A natural language processing system for the efficient extraction of cell markers

Identification of gene and cell entities using MarkerGeneBERTPretrained NER models for entity extraction have proven to be effective in various research fields. MarkerGeneBERT integrates three pretrained NER models based on diverse biomedical corpora. Additionally, we incorporated cell names curated from the Cell Ontology database for exact string matching. Given the standardized gene names, the MarkerGeneBERT utilized only gene symbol IDs exclusively sourced from the GTF file in Cell Ranger for accurate gene entity recognition. Further details can be found in the Methods section.As detailed in the Methods section, 27323 sentences, initially labeled with cell and gene names and manually annotated by our team for the marker-related sentence classification model, were used to validate the performance of “en_ner_bionlp13cg_md”, “en_ner_craft_md”, “en_ner_jnlpba_md”, and MarkerGeneBERT in identifying cell and gene entities. Compared to the three pretrained NER models used individually, MarkerGeneBERT demonstrated higher precision and recall in the extraction of cell and gene names (Table 4). Specifically, for gene name identification, MarkerGeneBERT achieved an F1 score of 87% (precision: 89%, recall: 99%), surpassing the second-best model by 20%. In terms of cell name identification, MarkerGeneBERT obtained an F1 score of 92% (precision: 86%, recall: 98%), outperforming the second-best model by 8%, thus representing the optimal trade-off between precision and recall.Table 4 Performance of various NER models in identifying gene and cell entities.Cell–biomarker associative binary classificationWe introduced a supervised marker-related text classification model to determine which sentences included not only cell entities and gene entities but also specific syntactic patterns indicating that a gene is a marker of a cell. More details about the model and training dataset construction process are available in the Methods section.To evaluate the performance of the marker-related text classification model in distinguishing specific syntactic patterns indicating that genes are markers of cells, we partitioned the training dataset into 10 subsets, randomly selecting 9 subsets for model training and reserving one subset for validation. The evaluation results depicted in Fig. 4A demonstrated a mean average precision (mAP) of 0.876 (ranging from 0.84 to 0.91), a mean precision of 0.844 (ranging from 0.8 to 0.9), and a mean recall of 0.734 (ranging from 0.56 to 0.78).Fig. 4Evaluation of the marker-related text classification model. A Precision‒recall curve for the training model on the validation set at various iterations. B F1-scores for the training cohort in the validation cohort for different cutoff values. The mean F1-score is greatest at the vertical red lineAfter processing by the model, each sentence could obtain a predicted probability value. A sentence was classified as a marker-related sentence if the predicted probability value was greater than the threshold, so the threshold setting was very important for the performance of our model. We calculated the F1 score for different thresholds, as illustrated in Fig. 4B, and the fitting threshold was 0.7. Under these threshold settings, the F1 score achieved optimal performance across different validation sets.For the remaining marker-related sentences whose predicted probability was greater than 0.7, we employed syntactic structure-based analysis within each sentence to identify and extract reliable cell-marker relationship pairs. The extraction criteria are described in detail in the Methods section.In addition, we employed an appropriate NER model, as shown in Table 2, to assess the species, organs, and disease information in each study. Further details are provided in the Methods section.Statistics of the NLP system extraction resultsWe employed MarkerGeneBERT to extract 3280 cell types and 16124 genes from 3702 literature sources (Supplemental Table 1). Compared to existing databases manually curated by domain experts over the years, our model achieved competitive retrieval results (Table 5). The maximum memory of our system, which included all the scripts and models, was 21 GB, and the parsing and entity extraction of one paper could be quickly completed in 7 min.Table 5 Comparison of inclusion results across different databases.Concordance between MarkerGeneBERT and manually curated databasesTo validate the accuracy of the system for detecting cell entities, gene entities, cell-marker pairs, species, tissue, and disease information, we conducted a comparison with CellMarker2.0, widely recognized as the gold standard for manual curation. As our methodology chiefly extracted gene markers from main text, we specifically compared gene markers from 1027 articles present in both CellMarker2.0 and our database. Other articles were excluded due to reasons such as unavailability for download or because the markers were sourced from supplemental materials; additional details are available in Supplemental Fig. 1.The MarkerGeneBERT identifies most cell and gene entities recorded in databasesIn this 1027 studies, the CellMarker2.0 manual curated a total of 4646 cell types with 12,874 marker genes, while the main text parts covered 3185 cell types and 8683 marker genes; approximately 84% of the valuable information was derived from the main text (Supplemental Fig. 2). MarkerGeneBERT identified 90.8% of the marker gene entities (7890/8683) and 92.7% of the cell type entities (2954/3185) in these common studies (Fig. 5A).Fig. 5statistical result of cells and gene entities curated in the CellMarker2.0 database recognized by MarkerGeneBERT. A Proportion of genes and cell types identified by MarkerGeneBERT. B Number of cell types recognized by MarkerGeneBERTThrough a systematic comparison of the results extracted from each literature source with those of CellMarker2.0, MarkerGeneBERT revealed an additional 1764 cell types associated with the marker genes (Fig. 5B). Among the 1764 newly identified cell types, 1344 were initially excluded by CellMarker2.0 in the corresponding literature; however, these were reported in other studies of the same tissue.It is noteworthy that 89 cell types were not cataloged in CellMarker2.0, primarily comprising tissue-specific cell types. These cells, including enteric mesothelial fibroblasts from the intestine and retinal progenitor cells from ocular tissue, exhibited low frequencies.Additionally, 302 cell types were detected with CellMarker2.0 but not with corresponding tissues. We categorized these 89 newly recorded cell types and 302 reported cell types according to their tissue information (Fig. 6). These cell types primarily represent functional cells distributed across different tissues; for instance, in the literature related to human gastric tissue, cancer-associated fibroblasts (CAFs), as central components of the tumor microenvironment in primary and metastatic tumors, profoundly influence the behavior of cancer cells and are involved in cancer progression through extensive interactions with cancer cells and other stromal cells25. Our method can be used to directly record CAFs in both cancer and gastric tissues. The detailed cell marker information is available in Supplemental Table 2, and the additional cell types and marker genes identified by MarkerGeneBERT have been manually reviewed.Fig. 6MarkerGeneBERT identified novel cell types in specific tissues. Cell types marked with a suffix * were those not documented in any tissue type within the CellMarker2.0 databaseHigh consistency of the marker gene list between the MarkerGeneBERT and the databaseFor each study, we assessed the consistency of the cell marker genes identified between CellMarker2.0 and MarkerGeneBERT. As illustrated in Fig. 7, approximately 47% of the cell types and their corresponding marker gene pairs were the same in the CellMarker2.0 database and MarkerGeneBERT. Additionally, for approximately 23% of the cell types, the marker genes extracted by MarkerGeneBERT were present in CellMarker2.0, and they accounted for 87% of the corresponding marker genes recorded in CellMarker2.0. The reason for the extraction results falling short of 100% was primarily due to certain cell types that record multiple marker genes within a single document, and it was possible that MarkerGeneBERT may have filtered out some marker genes based on preset conditions (Supplemental Fig. 3). And still, most of such cell markers also showed a high level of precision, often reaching 100%. Overall, MarkerGeneBERT exhibited a high percentage of true positives, and there was a high level of consistency between the results extracted from the MarkerGeneBERT and CellMarker2.0 databases.Fig. 7The completeness and accuracy of marker extraction for each cell in every study. Accuracy: For a cell, the proportion of the intersection of the number of markers collected in the Cellmarker2.0 database and the number of markers extracted by the NLP-based model in a specific literature to the number of markers extracted by the model. Completeness: For a cell, the proportion of the intersection of the number of markers collected in the CellMarker2.0 database and the number of markers extracted by the model in a specific literature to the number of markers collected in the Cellmarker2.0 databaseAdditionally, approximately 13% of the cells and their marker genes reported in CellMarker2.0 were 100% of those found by MarkerGeneBERT, and on average, MarkerGeneBERT obtained 25% more marker genes that were not recorded by CellMarker2.0. We traced back some newly discovered marker genes in the original text and found that CellMarker2.0 may ignore marker genes inconsistent with the main research themes of the paper or that only the first half of the information was extracted, while the following half was ignored.Consistency of species, tissue, and diseaseWe compared the consistency of species, tissue and disease information extracted from 1540 studies between the NLP system and CellMarker2.0. Overall, the consistency rates were 75% for species information, 77% for tissue information, and 66% for disease information (Fig. 8).Fig. 8Consistency between the species, tissue, and disease types recorded in the CellMarker2.0 and those inferred by the MarkerGeneBERTThe primary reason for the lower-than-expected consistency stemmed from our emphasis on organizing and analyzing information extracted from the full texts of the specific studies, summarizing the main species, tissues, and disease types studied. In contrast, the CellMarker2.0 database uses literature IDs as indices to trace cell markers referenced from other literature sources, capturing the associated species, tissue, and disease information from both reference and specific literature. Consequently, there is variance in the information recorded by these two methods in the same study.Increased cell type annotation efficiency through multi-marker annotation strategiesMarkerGeneBERT collected 166 brain cell types from approximately 190 studies, including some cell types not previously cataloged in CellMarker2.0, such as tissue-resident memory T cells, neuroblasts, and myeloid-derived suppressor cells (Supplemental Table 3). We utilized these 166 brain cell types and their compiled marker gene lists on published posterior hippocampus single-cell RNA data for cell type annotation by using scCATCH, which is a cell type annotation tool based on preset marker gene list. As illustrated in Fig. 9A, the cell type annotations obtained directly by scCATCH were almost the same as those in the original paper labels26. Notably, among the top 5 differentially expressed genes (DEGs) identified by scCATCH for cell type annotation, seven were newly discovered in our database and not recorded in CellMarker2.0 (Fig. 9B). This indicates that while many cell types possess representative marker genes, such as the CD3 marker for immune cells, which is mentioned and used in numerous articles, a more comprehensive list of marker genes can enhance the annotation efficiency of automated cell type annotation methods or tools.Fig. 9Consistency between the cell type annotation results of single-cell sequencing in hippopotamus tissue and the original annotation results. A Cell type annotation results of single-cell sequencing in hippopotamus tissue. B The seven top DEGs used for scCATCH cell type annotation were newly extracted by MarkerGeneBERT

Hot Topics

Related Articles