Herbarium collections remain essential in the age of community science

Ethics and inclusionThe lands that many now call Canada, are the stolen traditional territories of many diverse First Nations, Métis, and Inuit Peoples. We recognize that the 88 remaining herbaria in Canada (and aspects of herbaria globally60) are a product of the colonization and exploitation of land long stewarded by Indigenous Peoples. Many of the specimens housed in herbarium collections were gathered by botanists who used local Indigenous Knowledge often without appropriate recognition. As a result, despite their critical influence, Indigenous voices are largely absent from Canadian herbarium collections. However, some herbaria are taking steps to amplify Indigenous voices, and ongoing initiatives such as Plenty Canada’s Greenbelt Indigenous Botanical Survey61, Canadian Museum of Nature’s Capture the Collections62, and McGill University Herbarium’s Recovering Lost Voices63, demonstrate ways that herbaria can work towards truth and reconciliation in Canada. Moving forward, as herbarium specimens are digitized and made accessible, investigating and communicating the colonial and Indigenous legacies of these collections is critical for both their scientific and societal value.Species lists and occurrence dataStarting with a full list of Canada’s vascular plants64 we downloaded all the GBIF observations in both Canada and the United States from 1900 to present (January 2024) to enable a contemporary comparison of both iNaturalist and herbarium records65,66. First, GBIF observations with a coordinate uncertainty of over 25 km were removed. Then data were divided into two groups representing iNaturalist records (institution code “iNaturalist”) and herbarium records (basis of record “PRESERVED_SPECIMEN”). This resulted in 5,423,637 research grade iNaturalist observations and 1,742,166 Herbarium records.Temporal biasTo illustrate temporal bias, we simple plotted the accumulation of herbarium and iNaturalist records annually from 1900 till present (January 2024).Spatial biasTo test for spatial biases in each data type we used Nearest Neighbor Index (NNI) which is a measure of spatial autocorrelation67. When log transformed, negative NNI indicates that the geographic locations (latitude, longitude) of records are more clustered than expected and positive NNI indicates records are more spread than expected, with an NNI of 0 indicating points are randomly distributed in space. After calculating NNI for each point for each data type, we used a two-sided t-test to test for significant differences between the means. NNI was calculated using the nni() function in the “spatialEco” package for R68. To visualize spatial bias in number of records, we mapped the total number of records per 25 km2 grid cells across Canada and the United States. To visualize the imbalance between herbarium and iNaturalist sampling effort per cell, we mapped the difference between the relative portion of records of each data type. This produced a map that highlights areas of higher iNaturalist observation density compared to herbarium record density and vice versa. Finally, because past work has demonstrated that iNaturalist observations tend to occur in areas of high human population density, we modeled the relationship between number of records and population density across North America for both data types. To do this we used the Gridded Population of the World (v4) raster for the year 202069, downloaded at 2.5 arc-minute resolution (roughly 5 km2). We then resampled this raster to our 25 km2 grid for North America using bilinear resampling in the terra package for R70. Using the rasterized layers of herbarium and iNaturalist record counts, we then assembled a data frame for all cells with at least 1 record of either data type and extracted the corresponding population density value for each cell. Because we are dealing with integer count data, we fit negative-binomial generalized linear models using the MASS package for R71. We log-transformed population density to improve model fit. We report both model parameters, AIC, and pseudo R2 (Kullback–Leibler) calculated with the performance package for R72.To visualize the spatial distribution of herbarium and iNaturalist records, we rasterized record data at 25 km2 resolution across Canada and the United States. To view the balance between data types across space, we took the difference in number of records per pixel. We overlayed national, provincial, territorial, and state borders73 to aid visualization.Taxonomic biasTo quantify taxonomic bias, we assessed whether the variance across the number of records per species differed between herbarium and iNaturalist data using an F test. To control for the different number of total records per data type, we used relative number of records to perform the F test.Phylogenetic and functional biasTo quantify phylogenetic and functional bias we first needed phylogenetic and functional data. We built a phylogenetic tree of Canadian vascular plants using the “rtrees” package for R74. To visualize and assess bias we built a functional dendrogram to match our phylogenetic tree. First, we downloaded the following plant functional traits from the TRY database75: Seed dry mass, Plant height vegetative, Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or excluded, Plant lifespan (longevity), Plant nitrogen(N) fixation capacity, Plant growth form, Leaf photosynthesis pathway, Dispersal syndrome, Plant reproductive phenology timing, Leaf compoundness, Plant woodiness, Leaf type. These traits were selected based on taxonomic coverage and have been used in the past to capture and represent the functional diversity of Canada’s plants15. Using these traits, we calculated an average trait value for each species of plant and phylogenetically imputed missing values using phylogenetic vector regressions in the “PVR” package for R76 and random forest regression trees in the “missForest” package for R77. Plants with no functional trait data were dropped from the analysis since imputing these species would be solely based on PVR values. This left us with a remaining 4147 (94%) species with complete functional and phylogenetic data. From there, functional traits were used to calculate a Gower’s distance matrix using the “FD” package for R78 which was used to construct a functional dendrogram using UPGMA clustering achieved with the hclust() function in “stats” package included in base R79. Once we had both our phylogenetic tree and functional dendrogram for the remaining 4147 plant species, we used Pagel’s λ which estimates the degree to which shared branch lengths influence the distribution of trait values at the tips of a phylogenetic tree or functional dendrogram80. Values of Pagel’s λ range between zero and one, with zero representing phylogenetic independence and one representing perfect Brownian motion (strong phylogenetic bias). Pagel’s λ has been shown in the past to outperform other estimates of phylogenetic signal81. We chose to use Pagel’s λ to test for both phylogenetic and functional bias to standardize our approach and to allow us to compare the strength in both phylogenetic and functional bias in our data. In our data, bias can be thought of as the evenness of the number of records per species across the phylogenetic tree or functional dendrogram. In this case, a high λ (close to 1) would indicate that the number of records is highly correlated to the phylogenetic or function structure of the tree/dendrogram, meaning that, for example, if one species is represented by a high number of records, more phylogenetically related or functionally similar species are more likely to also be represented by a high number of records compared to a distantly related species. On the other hand, a λ close to 0 would suggest that the number of records is randomly distributed across the phylogenetic tree or functional dendrogram, suggesting that little bias is present in that data type. Pagel’s λ was calculated using the phylosig() function in the “phytools” package for R82 based on the log-transformed (to achieve a normal distribution) number of herbarium and iNaturalist records for each species of plant. Phylogenetic and functional bias were visualized by plotting the square root (instead of log) of these values for each tip in the phylogenetic and functional trees. This was done to enable better visualization of the variation in record number across species.Taxonomic coverageOnce we had assessed bias, we were interested in quantifying the degree to which herbarium and iNaturalist records captured the full diversity of Canada’s vascular plants. Starting with taxonomic coverage, we first quantified the number and proportion of all Canadian vascular plants represented in each data type. Then, to understand how efficiently each data type captures the taxonomic diversity of Canada’s plants, we built modified species accumulation curves. First, each species in the VASCAN list was assigned a value representing its contribution to the total taxonomic diversity. For taxonomic diversity, this was simply one divided by the number of species. Then, working with iNaturalist and herbarium records separately, we then randomly accumulated records, and for each record added, recorded both the index number (e.g., the 5th record added), the species (e.g., Cypripedium parviflorum), and the proportion of taxonomic diversity represented by that species (e.g., one divided by the total number of species in VASCAN). When the added record (e.g., the 5th randomly selected record) corresponded to a species that was already captured in the accumulation by an earlier record (e.g., the 3rd randomly selected record), the proportion of taxonomic diversity represented by that record was recorded as zero, since that species had already been accounted for. This process was repeated until all records had been added, and the entire process was then repeated 1000 times. From these 1000 runs, we calculated the average proportion of taxonomic diversity captured for each index position for each data type. Using these averages and the corresponding index position, we then fit logarithmic beta regressions83 to these curves to allow us to estimate the increase in niche coverage with the addition of novel records. Traditional species accumulation curves are usually used to quantify sampling completeness or estimate the true number of species in a study region (usually by identifying the asymptote of the accumulation curve). In our case, we know the total number of plant species in Canada (based on the VASCAN list). Because of this, we needed to be able to fit a curve that reflected the fact that eventually, if we accumulate an infinite number of herbarium or iNaturalist records, we should be able to reach (at least asymptotically) the total number of plants in the VASCAN list. Logistic Beta regressions allowed us to do this and by log-transforming the independent variable (number of records) we achieved a logarithmic-shaped curve that reaches an asymptote at 1 as expected under sampling theory. Beta regressions operate with a response variable distributed between 0 and 1 and in this study, we used the proportion of all species as the response instead of the total number of species. The slope of this regression indicates how quickly different datatypes accumulate taxonomic diversity. We fit beta regressions using the “betareg” package for R84.Phylogenetic coverageWe used phylogenetic coverage to assess how well each data type captures the full phylogeny, representing millions of years of plant evolution in Canada. Using the phylogenetic tree described earlier, we used the evol_distinct() command in the “phyloregion” package for R85 to estimate the relative phylogenetic distinctiveness of each species. Total coverage then, was calculated as the proportion of all phylogenetic distinctiveness captured by each data type. We then built accumulation curves but instead of taxonomic diversity, we accumulated phylogenetic diversity with the addition of new records. To do so we used the same approach as described above and fit the same beta regressions.Functional coverageWe used functional coverage to assess how well each data type captures Canadian functional diversity, representing the variation in ecological roles played by different species of Canadian plants. Starting with the functional dendrogram described above, we calculated the relative functional distinctiveness of each species again using the evol_distinct() command in the “phyloregion” package for R85. To estimate total coverage, we calculated the proportion of all functional distinctiveness captured by each data type. We then built accumulation curves but instead of accumulating taxonomic diversity, we accumulated functional diversity with the addition of new records. Again, using the same approach and beta regressions as described above.Niche coverageTo estimate how herbarium and iNaturalist records represent the spatial and environmental niches of plants we first had to calculate the extent of their niche space. To do so we relied on the expertly estimated plant ranges provided as polygons in the BIEN dataset, accessed through the “BIEN” package for R86. Of the 4392 vascular plants in Canada, only 3269 (74%) have estimated range maps, of which only 3174 also had GBIF records. We acknowledge that areas within species range polygons do not always indicate species presence. If analyzed at a fine grain size (e.g., 1 km2), one would expect there to be cells within a species range that are not occupied by that species due to variation in suitable habitat across species ranges. To account for this, we chose a coarse grain size (25 km2) to try and maximize the probability that suitable habitat occurred in each grid cell. Furthermore, the range polygons used in our analysis were not simple convex hulls but contained holes to account for regions within the spatial extent of the range where species are likely not present.To estimate spatial coverage, we rasterized range maps at the 25 km2 resolution and simply calculated the proportion of raster cells in the species range that had herbarium/iNaturalist records in them. To go from spatial coverage to environmental niche coverage, we needed to estimate the “proportion” of the species environmental niche present in each raster grid cell. Starting with 5 climate normal layers (Mean Annual Precipitation, Precipitation as Snow, Humidity, Degree days above 0°, and Degree days above 18°) downloaded from AdaptWest Project87 spanning the past 30 years, we first aggregated the climate layers up to 25 km2 resolution to match the spatial grid of already rasterized plant ranges. These climate layers were used to match the climate variables used to construct the SDMs used later in the analysis. For each species of plant, we extracted the climate values for all cells within its rasterized range polygon to assemble a cell by climate matrix. To account for differences in measurement scale, we standardized climate variables using decostand() in the “vegan” package for R88. From there we computed a Euclidean distance matrix, which was clustered to form a dendrogram, like our functional dendrogram but clustering cells by climate similarity instead of by species functional traits. Using this climatic dendrogram, we calculated the individual contribution of each cell in said species range to the total climatic niche of that species using the same evol_distinct() command in the “phyloregion” package for R85. Finally, these values, representing the climatic distinctiveness of each 25 km2 cell in a species range were made relative so that the total across all cells in a species ranges summed to 1. This process was repeated for all species, recalculating the climatic distance matrix and climatic dendrogram each time to allow for differences between species based on differences between their realized climatic niches. Another possible approach would have been to decompose our climate variables into 2-dimensional principal component space, to identify geographic areas with similar environmental conditions. However, at coarse grain sizes such as the one we used, we believe it is safe to assume that each geographic pixel represents a unique combination of environmental variables. As such, we chose to instead assign a proportion of environmental niche space to each geographic pixel to give weight to pixels with highly dissimilar combinations of environmental conditions compared to pixels with more similar conditions.Using these estimates of spatial and environmental coverage per cell, we then built modified accumulation curves to model how rapidly herbarium and iNaturalist records were capturing species niches. First, we removed all herbarium and iNaturalist records found outside of the species rasterized range. Then we assigned each record a cell number corresponding to the 25 km2 grid cell it occurred in. Working with herbarium and iNaturalist records separately, we then randomly accumulated records, and for each record added, recorded both the index number (e.g., the 5th record added), the spatial contribution of that cell (e.g., 1 divided by the number of cells in that species range), and the climatic contribution of that cell (e.g., the relative climatic distinctiveness). When the added record (e.g., the 5th randomly selected record) corresponded to a cell that was already captured in the accumulation by an earlier record (e.g., the 3rd randomly selected record), the spatial and climatic contribution of that record was recorded as zero, since that cell had already been accounted for. This process was repeated until all records had been added, and the entire process was then repeated 1000 times. From these 1000 runs, we calculated the average spatial and environmental contribution of each indexed record for each data type. Using these averages and the corresponding index position, we then fit logarithmic beta regressions to these curves to allow us to estimate the increase in niche coverage with the addition of novel records. While we used both spatial and environmental niches in our analysis, the results were virtually identical, so we chose to report environmental niche results in the main text and figures.Beta regression fitFor the 3174 plants for which we attempted to fit beta regressions, model fit was very good. Goodness of fit (pseudo-R2) ranged from an average 0.996 (min = 0.898, max = 1) for herbarium regressions to an average 0.991 (min = 0.772, max = 1) for iNaturalist regressions. Because some regressions did not converge and some species were excluded due to too few data points, we chose to report community averages instead of individual species results. Regressions that did not converge were removed from the analysis.Extrapolating beta regressionsThere is an estimated 7.3 million undigitized herbarium specimens housed in active Canadian herbaria19. While the exact number is unknown, this estimate is probably on the lower side, since it only reflects information from herbaria for which metadata are available, including herbarium specimens that have at least been counted and catalogued and does not include the potentially millions of other records that remain hidden away in collections that lack resources to make even their metadata accessible. As such, our use of this number to reflect the potential benefits incurred by the digitization of Canada’s remaining herbarium specimens is likely conservative.To estimate how the digitization of herbarium records would add taxonomic, phylogenetic, functional, and niche coverage, we used regression coefficients. Then we used curves fit to iNaturalist data to estimate how many additional iNaturalist records would be required to match the added coverage conferred by herbarium digitization. For niche coverage, since there is taxonomic bias in herbarium representation, we first calculated the relative incidence of each species in existing herbarium records on GBIF. This was used to divide the undigitized 7.3M records in Canada into an expected number of undigitized records per species, which was used to estimate the increase in niche coverage using beta regressions. While this approach assumes that no additional species will be present in the undigitized 7.3M records, our other results suggest that an additional 156 species could be found in the undigitized records. However, these species, not represented in already digitized specimens, are likely rare and represented by only a few of the remaining 7.3M undigitized records. One limitation of this approach is that past digitization may have focused on specific clades (e.g., to understand trait variation in a single genus across space or time) and so it is possible that the taxonomic representation in undigitized collections is different than that of digitized specimens. To estimate the number of iNaturalist records required to match potential coverage given digitization, we simply extrapolated the iNaturalist beta regressions.Translating current and potential niche coverage into benefits for species distribution modellingWe relied on SDMs detailed in Eckert et al. (2023). These models were built using GBIF data downloaded on June 5th, 202189,90 for all Canadian vascular plants. We first thinned observation records down to a single observation per 1 km grid cell in North America. To further clean data points unlikely to represent the native distribution, we removed any points in core urban areas (e.g., areas designated ‘urban or built-up’ in the land-use/land cover data described below), which often included clusters of data points in botanical gardens/zoos/sanctuaries.We used the following set of climatic variables that were biologically meaningful and had low correlation: mean annual precipitation (mm), chilling degree days (Degree-days below 0 °C), precipitation as snow (mm), Hargreave’s climatic moisture index and warming degree-days above 18 °C. We used current climate models from AdaptWest Project (2021). Current climate data is based on PRISM and WorldClim and spans 1991–2020. We also included topographic wetness index (calculated based on the 1-km) digital elevation model using package “dynatopmodel” in R91, topographic ruggedness index (from AdaptWest), and an aggregated land cover layer based on MODIS land cover data and reprojected to our grid and reclassified to: unvegetated, hardwood forests, evergreen forest, mixed forests, shrubs, and grasslands92. Finally, we used three variables to represent soil properties (topsoil silt fraction, subsoil pH, and topsoil organic C content) from the Unified North American Soil Map93 (0.25 degree resolution) that were projected to match the 1-km2 climate raster.We fit Boosted Regression Trees (BRTs)94,95,96 with all environmental, topographic, and soil predictors in the “dismo” package for R97. All presences were used in the models unless they exceeded 5000, in which case 5000 presences were randomly drawn along with 10,000 absences. BRTs were projected across all North America, although only the Canadian ranges were used in subsequent analyses. Model outputs were used to estimate the degree to which our SDMs can “fill” expertly estimated species range maps from the BIEN package86. To do so, we first aggregated projections (keeping the maximum value) to 25 km and thresholded at 0.5 to identify areas where presence is highly likely. While it is unrealistic to expect any single species to occupy the entire extent of its spatial range, once aggregated to a coarse 25 km2 resolution, we operated under the assumption that within a species range boundary, there is likely suitable habitat in each 25 km2 subdivision, which we accounted for by retaining the maximum predicted probability of occurrence during aggregation. We then calculated the number of grid cells in the BIEN range that predicted a presence versus the number of grid cells in the entire range to estimate range filling. For example, if a species was predicted to be present in 10 grid cells in its BIEN range of 100 grid cells, then its range-filling score is 0.1.To estimate how increased niche coverage might impact the ability of our SDMs to fill species ranges, we needed to model the relationship between current niche coverage and range filling. Since range-filling scores were distributed between 0 and 1 (representing the proportion of the species range filled by our SDMs) we again used a beta regression. Because some species had either none or all of their range filled by our models, and traditional beta distributions do not handle 0 s and 1 s, we used an inflated beta distribution (family=BEINF) and logit link functions to fit a Generalized Additive Model for Location Scale and Shape (GAMLSS) in the “gamlss” package for R98 (Table S4). To improve model fit we used the square root of niche coverage as our predictor variable. Model estimates are provided in Table S1. Instead of using the fitted µ curve to generate a single mean estimate of potential gain in predictive power, we used the curve to estimate potential gain for each species of plant, predicting values using potential niche coverage. The average across all plants (including species represented by 0 or 1) is reported in the main text and visualized in Fig. 6b. Finally, to understand how increasing the niche coverage translates to increases in range-filling scores we fit a simple linear regression on log-transformed data (Table S5).Estimating the cost of digitizationTo estimate how much it costs herbaria to digitize a single specimen, we consulted the curators of major herbaria in Canada along with past work and reports from other herbaria around the world. This estimate of $3 per specimen represents the use of traditional workflows involving cameras and humans and does not account for new high-throughput workflows that are largely automated but initially costly to install.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles