Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

Mathematical modelA novel functionality of our approach is to introduce a new and automated method to compare the mass spectral feature similarities between experimental and ancient artifact samples. As with this study and other ancient residue metabolomics studies, datasets containing replicates is often not feasible [1]. This limits our ability to apply multivariate statistical methods. Thus, we developed an algorithm inspired by advances in NLP [11,12,13,14]. Here, we use the following analogy:$$\begin{array}{*{20}c} {{\text{Words}} \leftarrow \to {\text{Mass}}\;{\text{Spectral}}\;{\text{Feature}}\;{\text{Abundances}}} \\ {{\text{Documents}} \leftarrow \to {\text{Samples.}}} \\ \end{array}$$That is, if words between documents can tell us which documents are similar, mass spectral features between samples can tell us which samples are more likely to be similar. The standard technique in NLP is to first transform the original data into the term frequency-inverse document frequency (TF-IDF) matrix [13]. This transformation helps to resolve the fact that some words (or mass spectral features) appear more often than others. More precisely, the importance of a term (or mass spectral feature) is not solely determined by its frequency (or abundance) in a text (or sample) but also how rare this term (or quantifiable intensity of a particular mass spectral feature) is in other texts in the corpus (or collection of all samples). We note that Brownstein et al. [1] previously used a method more qualitative in nature. As with comparing words between documents to identify commonalities, we can identify which samples are likely to be similar based on their shared mass spectral features. In other words, analyzing the common mass spectral features can allow for inferring which experimental artifact (or experimental clay pipe) matches with which ancient artifact (or blind clay pipe). Let us recall these terminologies mathematically. Term frequency refers to the frequency (or abundance) of a word (or mass spectral feature) in a particular document (or sample).$$tf\left( {w,d} \right) = \frac{{{\text{count}}\;{\text{of}}\;{\text{w}}\;{\text{in}}\;{\text{d}}}}{{{\text{number}}\;{\text{of}}\;{\text{words}}\;{\text{in}}\;{\text{d}}}}$$The inverse of the document frequency which measures the informativeness/prevalence/abundance of term, t (or mass spectral features).$$idf\left( w \right) = \log \left( {\frac{N}{df\left( w \right)}} \right) + 1$$N is the number of documents (or samples) and df(w) is the number of documents (or samples) containing word, w (or mass spectral feature). We remark that the above formula for IDF is based on what Sklearn library (scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer) uses for its implementation; therefore, it is slightly different from the standard textbook definition. Specifically, some authors use the following convention for $idf\left( w \right)$.$$idf\left( w \right) = \log \left( {\frac{N}{df\left( w \right) + 1}} \right)$$We believe that due to the popularity and simplicity of Sklearn, its use, shown herein, can be applied to similar problems in the field. We also would like to remark that we applied other weighting methods available in Sklearn, but the method above performed the best. We refer to our GitHub repository for the performance comparison of these weighting schemes. Finally, once $tf\left( {w,d} \right)$ and $idf\left( w \right)$ are computed, the TF-IDF score is calculated by the following formula:$$tf – idf\left( {w,d} \right) = tf\left( {w,d} \right)*idf\left( w \right).$$In our context, the TF-IDF score describes the relevance of a mass spectral feature in a sample, as well as the relevance of that feature in different samples. Once the TF-IDF is computed, we can then use cosine similarity to compare two different documents (or samples). Recall that for two vectors $v,w$ their cosine similarity is defined to be cosine of the angle θ between them.$${\text{similarity}} = \cos \left( \theta \right) = \frac{\langle v,w\rangle }{{\left\| v \right\|\left\| w \right\|}}$$Here, $\langle v,w\rangle$ is the inner product of $v,w$ and $\left\| v \right\|,\left\| w \right\|$ is the Euclidean norm of $v,w$. We note that the similarity score (or frequency/abundance of a mass spectral feature shared between samples) ranges from -1 meaning exactly opposite to 1 meaning the same, with 0 indicating orthogonality, while in-between values indicate intermediate similarity or dissimilarity.Preparation of samplesSeeds of Artemisia ludoviciana (Strictly Medicinal, Williams, OR, USA), Lobelia inflata (Strictly Medicinal, Williams, OR, USA), Nicotiana attenuata (USDA Agricultural Research Services [ARS] National Plant Germplasm System; Accession Number: PI 555476), Nicotiana glauca (USDA ARS National Plant Germplasm System; Accession Number: PI 555686), Nicotiana obtusifolia (USDA ARS National Plant Germplasm System; Accession Number: PI 555573), Nicotiana quadrivalvis (USDA ARS National Plant Germplasm System; Accession Number: PI 555485), Nicotiana rustica (USDA ARS National Plant Germplasm System; Accession Number: PI 555554), Nicotiana tabacum (Strictly Medicinal, Williams, OR, USA), Salvia sonomensis (USDA ARS National Plant Germplasm System; Accession Number: PI 45388), and Verbascum thapsus (Companion Plants, Athens, OH, USA) were sown on Sunshine Mix LC1 soil (sphagnum peat moss and perlite; Sun Gro Horticulture Inc., Agawarm, MA, USA). For 60 days, the plants were grown with the following greenhouse conditions—average temperatures of 24/17 °C (day/night), and a photoperiod of 16/8 h (day/night) under 1000 W metal-halide lights to supplement natural daylight. Lights were set to come on when the outside light intensity fell below 200 μmol m−2 s−1. During the day, the light intensity averaged 350–400 μmol m−2 s−1 in the greenhouse. The plants were fertilized twice a week with Peters 20–20–20 (N–P–K; JR Peters Inc., Allentown, PA, USA) containing iron chelate, magnesium sulfate, and trace elements.Arctostaphylos uva-ursi (collected: April 2015; voucher ID: 393408), Cornus sericea (collected: September 2016; voucher ID: 393409), and Rhus glabra (collected: September 2016; voucher ID: 393395) were collected on the Washington State University, Pullman campus. Taxus brevifolia (collected: October 2016; voucher ID: 393425) was collected in the Iller Creek Conservation Area, WA, USA.After Korey Brownstein confirmed the identity of the fourteen (14) different plants, A. ludoviciana Nutt. (Alu) leaves, A. uva-ursi (L.) Spreng. (Auv) leaves, C. sericea L. (Cse) bark, L. inflata L. (Lin) leaves, N. attenuata Torr. ex S. Watson (Nat) leaves, N. glauca Graham (Ngl) leaves, N. obtusifolia M. Martens & Galeotti (Nob) leaves, N. quadrivalvis Pursh (Nqu) leaves, N. rustica L. (Nru) leaves, N. tabacum L. (Nta) leaves, R. glabra L. (Rgl) autumn leaves, S. sonomensis Greene (Sso) leaves, T. brevifolia Nutt. (Tbr) needles, and V. thapsus L. (Vth) leaves were collected, freeze-dried for 3 days, and crushed for experimental smoking. Voucher specimens from the same plants were also collected by Korey Brownstein and filed in the Marion Ownbey Herbarium, Washington State University, Pullman, WA, USA (herbaria.wsu.edu/web/default.aspx). These specimens can be found by performing a “Collector’s Name” search, i.e., Korey Brownstein, in the following database: intermountainbiota.org/portal/collections/harvestparams.php.American Spirit (AmSp) tobacco (Santa Fe Natural Tobacco Company, Oxford, NC, USA) was purchased from a local grocery store in Pullman, Washington, USA. The plant materials (n = 5 for each species) and AmSp (n = 5) were smoked in clay pipes following the experimental conditions detailed in Brownstein et al. [1]. The experimentally smoked clay pipes were then completely submerged in acetonitrile:2-propanol:water [3:2:2] and sonicated for 10 min. Five non-smoked blank clay pipes were extracted as controls using the same extraction methods as the experimentally smoked clay pipes. To prepare the experimental clay pipes for liquid chromatography-mass spectrometry (LC–MS) analysis, 3.0 mL from each of the five replicates were combined into a single tube. Only experimental clay pipes subjected to the same conditions or smoked with the same plant species were combined (i.e., non-smoked blank clay pipes were combined, experimental clay pipes smoked with AmSp were combined, experimental clay pipes smoked with Alu were combined, and so forth for the other plant species). The 15.0 mL pooled experimental clay pipe samples were freeze-dried for 3 days and resuspended with 5.0 mL of 0.10% formic acid/water:acetonitrile [1:1]. Afterwards, the resuspended samples were filtered into glass vials using a 0.20 μm filter.The blind clay pipes (n = 8) were smoked and broken in fragments with a mallet to emulate artifacts found in the field. These fragments were completely submerged in acetonitrile:2-propanol:water [3:2:2] and sonicated for 10 min. Afterwards, the extracts were freeze-dried for 3 days and resuspended with 1.0 mL of 0.10% formic acid/water:acetonitrile [1:1]. The resuspended samples were then filtered through a 0.20 μm filter into a glass vial. To limit biases, the authors did not know which plant species had been smoked in which blind clay pipe. After the experimental clay pipes, non-smoked blank clay pipes, and eight (8) blind clay pipes were analyzed by LC–MS and processed in MZmine 2 following the parameters described in Brownstein et al. [1], the data were exported into .csv files. Mass spectral features with peak heights less than 2.0E3 had their abundance values set to zero. The .csv files were arranged in the following format: mass spectral features were in rows; each mass spectral features’ unique identifier (ID) number, m/z value, and retention time value (in min) were in the first, second, and third columns, respectively; and mass spectral feature abundance values were listed under each sample in the remaining seventeen (17) columns. To eliminate solvent contaminant noise, mass spectral features present in the blank clay pipes were removed from the experimental clay pipes and blind clay pipes before processing the datasets in our algorithm. Python libraries, such as Sklearn and Pandas, were then used to apply the TF-IDF computation scores to these datasets. The extracted experimental clay pipes and blind clay pipes were allowed to air-dry on the lab bench. All solvents used for extraction and analysis were of mass spectrometry grade.

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data | BMC Bioinformatics

Hot Topics

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Popular Articles

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics