DiasMorph: a dataset of morphological traits and images of Central European diaspores

The workflow for seed trait extraction consists of sample preparation, qualitative traits assessment, image acquisition, image processing and trait measurement with Traitor software (Fig. 1).Fig. 1Workflow overview for dataset.Sampled taxaWe sampled diaspores available in the seed collection of the Chair of Ecology and Conservation Biology at the University of Regensburg, Germany, which was started and curated by Prof. Peter Poschlod. The collection comprises taxa found in Central Europe, with collections carried out mainly in Germany (Fig. 2), and serves as a reference for identifying diaspores collected during field studies in the region. While Germany is home to 4,202 taxa17 (species and infraspecific taxa) of seed plants, the collection includes 1,048 taxa sourced from Germany, representing about 25% of the country’s flora, making it a substantial and representative sample. Most diaspores were collected within Central Europe, ensuring regional relevance. Additionally, some taxa with wide global distributions that encompass Central Europe were sourced from other areas, further enhancing the dataset’s comprehensiveness.Fig. 2Maps showing the number of diaspore collections (A) per country (B) per locality or geometric centre in the DiasMorph dataset. In (B), coordinates are rounded and grouped to the nearest whole degree. To enhance visualisation, four countries (Ethiopia, Iceland, India, and Namibia), each with a single collection, have been omitted.In total, our dataset contains images and records of quantitative morphological traits for 94,214 diaspores from 1,442 taxa (including species, infraspecific taxa, and three sections), belonging to 519 genera, 96 plant families (Fig. 3). Taxon names and family information were checked and updated using the functions WFO.match and WFO.one from the R package WorldFlora18. The last nomenclature verification was carried out on May 20th, 2024. The most represented families in the database are Asteraceae (192 taxa; 65 genera), Poaceae (114; 48), Brassicaceae (93; 44), Cyperaceae (86; 10), and Fabaceae (80; 22). This distribution closely reflects the diversity of the most species-rich families within the region17. However, there is an exception: the Rosaceae family is underrepresented due to limited collections of the genus Rubus, which comprises hundreds of taxa.Fig. 3Cladogram of the phylogeny for the families in the DiasMorph dataset. The barplot represents the number of taxa within each family in the DiasMorph dataset.GeolocationSince coordinates were not readily available for the diaspore collection, we utilised Google Maps to approximate the coordinates for each location. Subsequently, we categorised each location based on its resolution: locality (1,036 cases), which involved specific places such as neighbourhoods, towns, villages, parks, cities, mountain peaks, and communes; region (136 cases), encompassing larger areas such as districts and states within countries; country (50 cases); mountain range (156 cases); river (69 cases); botanic garden (9 cases); and commercial supplier (1 case). The obtained coordinates represent the geometric centre of a polyline (e.g., a river) or polygon (e.g., a region).Recorded appendagesFor each species, we recorded diaspore structures and appendages (Table 1, Fig. 4) following a modified version of seed structure categories in LEDA Trait standards8,12. As LEDA is a database focused on functional traits, the modifications aimed to improve the objectivity of the classification and facilitate the recognition of morphological structures for identification purposes. For each taxon, appendages and structures were classified as present (1) or absent (0). In some instances, diaspores of species and genera were found with and without appendages and structures; for these cases, we recorded the structures as present and later specified them as missing from the image (see Sample Preparation).Table 1 Summary of the diaspore appendage and structure categories.Fig. 4Example of taxa classified as having bent elongated appendages (first three from left to right) or bearing distinctively crooked elongated appendages (rightmost). From left to right: Avena barbata, Bromus squarrosus, Arrhenatherum elatius (Poaceae), Pulsatilla alpina (Ranunculaceae).Extraction of quantitative traitsWe used an image analysis method described and validated by Dayrell et al.16 to obtain images and extract quantitative measurements of diaspore morphology.Sample preparationWe cleaned the diaspores with the aid of a stereo microscope and only selected diaspores with all structures in a well-preserved state, apart from three exceptions. (1) Fleshy covering structures and some fleshy outgrowths were removed due to the pronounced changes that these structures undergo after dispersal, which can lead to unrecognisable colours, shapes, and sizes. (2) We measured diaspores without scales or covering structures when most diaspores in a vial of the seed collection had detached from these structures without handling. (3) Hairy appendages (e.g., pappus and plumes) were removed due to requirements of the method16. The structures that were not present in the scanned diaspores were recorded as ‘missing structures’ in the dataset.Image acquisitionFor image acquisition, diaspores were arranged on the flat scanner avoiding any contact or overlap. The number of sampled diaspores varied for each taxon according to their availability in the seed collection (Fig. 5). We sampled all available material that met sample preparation standards when 30 or fewer diaspores were available. In cases where the number of available diaspores exceeded 30, we sampled seeds to cover an area of up to 200 cm2. The flatbed scanner was covered with a wooden frame 10 mm thick with a royal blue background. Images were acquired with a flatbed scanner (HP Scanjet G4010) at a resolution of 1,200 DPI to well-represent small seeds and fine appendages. All automatic correction functions associated with the scanner software were disabled to ensure that the RGB values of the samples were not manipulated. The resulting images were saved in the Joint Photographic Experts Group (JPEG) format with no compression.Fig. 5Histogram of the number of diaspores per taxon sampled for quantitative measurements.Image processingTo allow standardisation of colour measurements, a Spyder Checkr® 24 card (Datacolor, NJ, USA) was scanned in the flatbed scanner under the same settings as the diaspores, and used to calculate a colour conversion matrix (CCM). The CCM was then applied to images for optimal colour reproduction (https://github.com/rdayrell/colour_calibration). In some images, undesired elements, such as broken seeds and particles, were removed from image with the brush and clone stamp tools in Adobe Photoshop. Images were saved in PNG format throughout all processing steps to avoid compression artifacts. Processed images (Fig. 6) comprise the original image dataset and were used as inputs for automated trait extraction.Fig. 6Examples of diaspore images in the DiasMorph dataset.Extraction with traitor softwareThe Traitor software https://github.com/TankredO/traitor was used to segment, align, and extract morphological traits from images16. The extracted traits include: (1) morphometric measurements (length, width, aspect ratio, area, perimeter, diaspore surface structure, solidity, circularity); (2) colour measurements for human recognition purposes (Fig. 7; mean, median, and most dominant colours in sRGB), and ecological and evolutionary studies (independent of any particular animal visual system; linear sRGB); (3) standardised contour of diaspores (50 coordinates for each seed) for shape analysis methods. After the extraction, fields containing size measurements in pixels were converted to units of measurement considering the conversion factor of 47.8 pixels per millimetres obtained from a reference scale, which is also included as an image in the dataset.Fig. 7PCA scores plot obtained from the median colour values of diaspores from the taxa in the DiasMorph dataset.Algorithm limitation and correctionOne limitation of the image-based trait extraction algorithm is its occasional failure to accurately align diaspores with bent elongated appendages (e.g. bent awns or distinctively crooked elongated appendages; Fig. 4), resulting in incorrect size and morphometric measurements16. Upon checking the consistency of Traitor’s output (see ‘Technical Validation’ section), this occurred primarily to taxa that belonged to Poaceae family, except for one Ranunculaceae species. Thus, the records of taxa with elongated bent or distinctively crooked appendages were deleted from the quantitative traits’ dataset obtained from original images, detailed in the previous section.To provide reliable measurements of taxa with such appendages, we edited the original images of diaspores to make them compatible with the algorithm. We also edited images of Poaceae taxa bearing unbent elongated appendages, even though they provided correct outputs. This was done to provide measurements pertaining to the same structures, making the data consistent and comparable across all the Poaceae taxa. As a result of this correction process, the final quantitative dataset has two records for each diaspore of taxa with elongated unbent appendages, obtained from original and edited images, while there is only one record for each diaspore of taxa with elongated bent appendages, obtained from edited images.Image editing consisted in manually erasing the elongated appendage from the image with the brush and clone stamp tools in Adobe Photoshop and saving the image as PNG. The edited images were labelled with the same name as the original image, with the addition of ‘_edit’ (e.g., ‘img_0261’ and ‘img_0261_edit’) and are available in a separate zip file. Traits of edited images were extracted with Traitor and merged with the quantitative dataset described in the previous section. For these images, ‘elongated appendages’ were classified as ‘missing structures’.

DiasMorph: a dataset of morphological traits and images of Central European diaspores

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis