Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

To illustrate the overall workflow used to develop our datasets and models we provide Figure 1 with the individual steps detailed below and in additional figures.The annotation teamProject managerThe project manager, with more than 15 years of experience in structural biology, more than 500 protein structures in the PDB, and over seven years of experience in developing software and machine learning algorithms was the lead for the annotation project. As lead, the project manager was responsible for the general management, planning and documentation of the project and was involved in the annotation process.AnnotatorsA team of six PDBe biocurators, involved in the curation of protein structures submitted to the PDB, volunteered in the annotation process. All but one had a PhD, either in biochemistry, bioinformatics or structural biology with a strong background in biochemistry and/or structural biology. Combined, they had 10 years of experience in bioinformatics, 24 years in biochemistry and structural biology, and 31 years in biocuration. While undertaking the annotation process, the team was split over two different sites and time zones, and annotation was carried out in a fully remote setting.Literature selectionThe general workflow for literature selection is depicted in Fig. 2. In the first step, all the PubMed (https://pubmed.ncbi.nlm.nih.gov/) IDs (PMIDs) for publications linked to a protein structure were retrieved by querying PDBe’s ORACLE database on 29th September 2022. On that date, the PDB contained 196,012 PDB entries with 73,019 associated, unique PMIDs.Fig. 2Schema of literature selection workflow. The different data batches of 10,000 publications, except batch 8 with 3,019 or 4,253 publications, each created an independent LitSuggest model based on title and abstract of the documents in the batch. Then each batch of data is given to each model for prediction. Only documents for which all models returned a confidence score of ≥0.8 and which were open access were included in the final list of possible literature to develop a named entity recognition system, 14,390 publications.LitSuggest20 (https://www.ncbi.nlm.nih.gov/research/litsuggest/), an AI-driven web browser-based trainable system that directly uses PMIDs, was used as a content filtering tool for assessing the abstract and title of our short-listed publications. Using the list of PMIDs generated above, we created seven publication batches of 10,000 IDs each (the positive samples) and batch 8, an exception, had only 3,019 IDs. All batches were matched with an equal number of randomly picked PMIDs from the entire set in PubMed representing the negative samples. Picking the negative samples was done automatically by the LitSuggest host at NCBI. It has to be noted, that due to the selection process, there is a small chance that the negative sets may contain some of the PMIDs from the positive batches. For each batch, a model was trained using the corresponding titles and abstracts for the individual PMIDs in the batch. It is also worth noting that the trained models were and still are challenged with new publications indexed by PubMed on a weekly basis and provide up-to-date protein structure specific literature recommendations. The trained models were used to identify the relevance of newly added IDs to PubMed over several weeks.The same exercise was repeated on the 23rd of January 2023, with 200,612 PDB entries having 74,253 unique PMIDs. The additional PMIDs were added to batch 8 which now contained 4,253 IDs while preserving the original batches used in previous training. Each of the eight trained models were presented with 7 batches, excluding the one that was used to train the model to obtain relevance scores for individual PMIDs. A publication was deemed relevant, when the predicted confidence score was ≥ 0.8 across all seven cross-prediction models. This resulted in 63,795 (86%) PMIDs predicted as relevant. The prediction statistics across the eight different models are given in Table S1.To adhere to open data principles and to be able to annotate full-text articles, only open access publications with PubMedCentral21 (https://www.ncbi.nlm.nih.gov/pmc/) IDs (PMCIDs) were identified using the EuropePMC’s22 (https://europepmc.org/) article API and included in further works. This further reduced the number of publications included in the study to 14,390, 19% of the initial starting set of 74,253. Lastly, a number of documents were further rejected during the annotation stage due to a primary focus other than a protein structure, often covering drug and fragment screening campaigns or nucleic acid structures.From the 14,390 open-access publications, five batches of ten publications were identified as training sets to develop the different models and create an independent test set, which is described in section Independent test set – batch 5. The PubMed Central IDs for each batch are given below.

Batch 1: PMC478490923, PMC478678424, PMC479296225, PMC483233126, PMC483386227, PMC484809028, PMC485027329, PMC485028830, PMC485259831, PMC488732632

Batch 2: PMC477211433, PMC484154434, PMC484876135, PMC485431436, PMC487174937, PMC487211038, PMC488028339, PMC491946940, PMC493782941, PMC496811342

Batch 3: PMC478197643, PMC479555144, PMC480204245, PMC480208546, PMC483158847, PMC486912348, PMC488716349, PMC488827850, PMC489674851, PMC491876652

Batch 4: PMC474670153, PMC477309554, PMC477401955, PMC482037856, PMC482205057, PMC482256158, PMC485562059, PMC485700660, PMC488550261, PMC491875962

Batch 5: PMC480629263, PMC481702964, PMC498066665, PMC498140066, PMC499399767, PMC501286268, PMC501408669, PMC506399670, PMC517303571, PMC560372772

The number of annotations for each document for 19 and 20 entity types, where applicable, for each of the publications are given in Table S2.The annotation tool and schemaFor generating training datasets a manual text annotation project was carried out. A number of free and paid-for annotation tools were evaluated regarding the following features: compatibility with PubMed and PubMedCentral, possibility for project management, multi-user co-annotation option, integration of ontologies, open source distribution, web browser compatibility, ease of use, and available documentation. The annotation tool of choice for our project was TeamTat73 (https://www.teamtat.org/). TeamTat is a free tool focused on biomedical literature and uses PubMed and PubMedCentral, to retrieve publication abstracts and metadata such as title, authors, journal and publication year for a given PMID. For open access publications with a PMCID, TeamTat retrieved the entire publication from a BioC XML FTP-server74 (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/). The BioC XML format75 (https://bioc.sourceforge.net/) was introduced by the BioCreative Initiative76 as a way of making scientific publications interoperable. In the case of open access full-text documents, these came with in-line figures, figure captions, tables and table captions.TeamTat also allowed for project management with the project manager being able to upload/retrieve the relevant literature, assign publications to annotators and control the start and end of an annotation round. Entity types, relationship types and ontology referencing were set up and updated by the project manager. TeamTat also supported versioning and after each annotation round, merging statistics were calculated across the corpus providing inter-annotator agreement and a new version for the publication set was created. Documents could be exported at any point in the annotation process as either BioC XML or BioC JSON. We opted for the BioC XML format, as it enclosed the plain paragraph text and its identified annotations under the same XML tag (<passage>), which allowed for easy retrieval of individual sentences with their respective in-line annotations for downstream transformer training.TeamTat provides access to Medical Subject Headings (MeSH)77 and the Gene Ontology (GO)78,79 through hard-coded links. Additional ontologies relevant to our project were Sequence Ontology (SO)80, Chemical Entities of Biological Interest (ChEBI)81, Gene82 and PRotein Ontology (PR/PRO)83. For each ontology a short-hand name similar to, e.g. “MESH:” for MeSH, was created and served as a prefix to link an entity type to an ontology. A “DUMMY:” short-hand name was used to collect terms that were not found in any of the other ontologies. Although we linked the different entity types to ontologies, controlled vocabularies and reference databases, we did not apply grounding of terms in the annotation process by linking text spans to unique references.The annotation handbook published by the TeamTat developers (Supplemental Materials of Islamaj et al.84) was adapted to suit our project requirements. The final detailed annotation schema was included in our GitHub repository for the project, Annotation handbook and TeamTat user guide. The project manager generated an initial set of annotated publications to define a set of entity types which formed the basis for developing initial guidelines. Those guidelines were revisited in the subsequent annotation rounds following discussions with the biocurators (see below in Manual annotation of initial set of publications). The updates to guidelines included adding or removing entity types or clarification on the guidelines. The guidelines continued to be adapted even after switching from fully manual annotation with a team of biocurators to a semi-automated process using a trained model to accommodate the increasingly diverse set of publications. All alterations were done after consultation with the volunteer team of biocurators either in the form of open discussion or polling. Focusing on structure and sequence features curated by the UniProt biocurators, we selected entity types that captured details about a particular protein, its structural make-up down to residue level, interaction partners, bound molecules, general properties of the protein, changes to its sequence, organism of origin, experimental methods and evidence to support drawn conclusions. The final list of entity types later used in transformer training was: “Bond Interaction”, “Chemical”, “Complex Assembly”, “Evidence”, “Experimental Method”, “Gene”, “Mutant”, “Oligomeric State”, “Protein”, “Protein State”, “Protein Type”, “PTM”, “Residue Name”, “Residue Name Number”, “Residue Number”, “Residue Range”, “Site”, “Species”, “Structure Element”, “Taxonomy Domain”. The “Materials and Methods” and “References” sections were excluded from the annotation process as little to no contextual, residue-level information was expected to be present in these sections.We also developed a detailed user guide (see Annotation handbook and TeamTat user guide) on how to set up and operate TeamTat from a project manager as well as biocurator perspective. This was used to support the biocurators after initial training when annotating independently.Manual annotation of initial set of publicationsInitially, ten publications (PMC478490923, PMC478678424, PMC479296225, PMC483233126, PMC483386227, PMC484809028, PMC485027329, PMC485028830, PMC485259831, PMC488732632) were chosen randomly from the filtered, open access list described in the Literature selection section above. Each biocurator was given two publications to manually annotate, based on the guidance from the example annotations and the handbook. A set of two-hour hackathons were organized weekly to annotate the assigned publications. In case the biocurators were not able to attend the hackathon, web-based access through a personalized web-link for the assigned documents was provided to annotate documents outside of the dedicated sessions. The project manager annotated those publications that could not be annotated by the biocurators in order to achieve double-annotation for each document. We acknowledge and are fully aware that the overrepresentation of the project manager annotated documents increased the likelihood of bias. However, even with the best annotation guidelines shaped by a team of expert annotators, assigning entity types to terms is a highly subjective process. A different team of experienced annotators may introduce a different set of biases, based on their training and understanding. The first round of independent annotation lasted approximately four months, after which the annotations across all ten publications were combined and annotation statistics were calculated within TeamTat.To increase efficiency and accelerate the annotation process, the decision was made to switch from a fully manual to a semi-automated annotation process. The project manager was made responsible for cleaning and consolidating the annotations for the ten initial publications. Upon completion of this task, the cleaned publications were passed to the lead biocurator, who served as a proofreader. In this capacity, the lead biocurator flagged annotations and entity types that were still ambiguous. In a number of discussions between the project manager and the lead biocurator those ambiguities were resolved, and entity types and annotation guidelines were updated. A graphical illustration of the manual annotation workflow can be found in Fig. 3. The project manager then applied a final pass of cleaning and consolidating across the ten initial publications before using the annotated text to train a named entity recognition system. This final, consolidated version was used as ground truth against which the annotation performance of each annotator could be measured.Fig. 3Schema of manual annotation workflow. The project manager oversees the entire manual annotation workflow, provides guidance, resolves disputes between annotators, is directly involved in the annotation process and undertakes the final cleaning and consolidating work. The lead biocurator may act as a mediator between the biocurator team and the project manager, is directly involved in the annotation process and serves as a proofreader to the project manager in the final cleaning and consolidating step.Independent test set – batch 5During the training of the different algorithms the size of the data set, the number and type of identified annotations and the annotation schema changed. The generated models could therefore not be compared directly. An independent test set was created to allow for unbiased comparison between algorithms. For this set of ten publications, model v2.1 was used to pre-annotate the text which was followed by manual curation by the project manager using the annotation handbook as a guide. Although batch 5 does not represent a strictly gold-standard set of annotations it is also not solely machine generated. Batch 5 should also be considered a benchmarking set.Annotation evaluationThe quality of manual annotations created by the biocurators was judged using the built-in calculation procedures in TeamTat, which follow a partial agreement model. The following six categories of annotation outcomes were determined by TeamTat:

FA – Full Agree: same type, concept ID and text span

CA – Concept Agree: same concept ID and text span, but different types

TA – Type Agree: same type and text span, but different concept IDs

PA – Partial Agree: same type and concept ID for overlapping text

DA – Disagree: different types, different concept IDs for text spans

SN – Single: text annotated by only one of the annotators

The full set of outcomes was only relevant for the initial manual annotation by the biocurators and during the cleaning and remediation steps to create the training data for the initial model. As mentioned in the The annotation tool and schemasection, we did not use concept IDs for grounding terms and only evaluated for prefix matches, which, as they were directly linked to an entity type, always returned a perfect match.In order to investigate whether there was any bias introduced into the annotations by individual biocurators we also applied the SemEval procedure to the manually annotated publications, see Annotation evaluation using SemEval procedure.Annotation processing for training and evaluationIn order to train a transformer-based annotation algorithm and to be able to calculate annotation statistics to monitor the performance of both the algorithm and the human annotators, the publication text and its in-line annotations needed to be converted from BioC XML into the IOB (Inside Outside Beginning) format85. For each document, we iterated over the individual paragraphs, split them into sentences and combined them with their respective annotations using the offset values available in the BioC XML file. For the total list of isolated sentences, we then generated an index. Next, the isolated sentences were converted into tab-separated TSV files. These TSV files were used to calculate various statistics, see in section Annotation evaluation using SemEval procedure. The index was then randomly split to create three smaller files holding train, test and development sets, containing 70%, 15%, and 15% of sentences, respectively.During the conversion process, it was found that a number of open-access documents retrieved from NCBI’s FTP site had line breaks introduced within a paragraph, often in figure or table captions. These line breaks resulted in shifts of the paragraph offset by “+1”, which introduced character position miss-matches for the corresponding annotations. Through personal communication with the maintainers of the FTP site, it was found that the offset shift was likely a result of the conversion process from a number of input file formats provided by publishers to BioC XML. Occasionally, we also found identical sentences and annotations more than once. In this case, only the first occurrence was included in the data. Although all efforts were made to catch as many errors as possible, on average 21 annotations were lost in each batch, which amounts to 0.2%, as each batch had, on average, 10,037 annotations.Training a first named entity recognition systemUsing the TSV files from above, we trained a first model. The basic principles of our algorithm and training process are given in Algorithm 1. The training routine described also provided the basis for the iterative training to build a semi-automatic annotation system.
Algorithm 1
Iterative Deep Learning Model Training with Curators in Loop.
Taking advantage of the rapid developments in natural language processing (NLP), we chose a starting model based on Google’s transformer86. For our objective, NER, we looked at BERT (Bidirectional Encoder Representations from Transformers)-based models such as BioBERT87, PubmedBERT88, and BioFormer89. Furthermore, the experience of developing a similar system by one of the author’s was used90 to design the general approach and choose an algorithm for fine-tuning. We employed a pre-trained transformer model from HuggingFace (https://huggingface.co), namely microsoft/BiomedNLP-PubMedBERT-base-uncasedabstract-fulltext. Fine-tuning was conducted for 3 epochs, initially, using the carefully selected hyperparameters listed in Table 1. Optimizing the hyperparameters resulted in an improved initial model, v1.2, which was used to annotate a new batch of publications. We also reduced the number of entity types from 23 to 19, as we found during the data preparation step that some entity types had too few samples to allow for a meaningful split into train, test and evaluation sets and had a negligible contribution to training.Table 1 Hyperparameters for the Named Entity Recognition Model.Consecutive rounds of semi-automatic annotation and NER trainingTo develop a robust algorithm, a diverse corpus, larger than the initial ten publications, was needed. Therefore, a human-in-the loop approach combined with a named entity recognition system (see Training a first named entity recognition system) was used, to iteratively increase the number of annotated publications in the corpus. The applied workflow is presented in Fig. 4.Fig. 4Iterative, human-in-the-loop buildup of training data. The workflow depicts the iterative training of the different models and creation of the different datasets. After pre-annotation by the current best model those annotations are manually cleaned following the annotation guidelines and combined with the set of annotations previously used to train the current best model. As a consequence, for the next round of training the set of annotations is split into new fractions to form new train, test and development sets. Models v2.1 and v3.1 performed equally well on their respective development sets but due to the new split in each round of training, the results were not comparable. A final independent test set was used to compare these models.In each iteration, a new batch of ten publications randomly selected from the open access list was presented to the current best model to identify text spans and annotate them with their entity types. The returned predictions were in BioC XML format which allowed for visual inspection in the annotation tool TeamTat.At the end of each prediction round, the project manager inspected each of the ten publications in the batch and fixed any errors in the annotations produced by the NER model. This curation process did not only look at the predicted annotations but rather the pre-annotated spans served as a guide for the annotator, who was still required to read the full-text and add missing annotations. Such an approach of post-prediction curation has been implemented as a standard tool in the NLP suite “prodigy” version 3 https://prodi.gy/ in the function “ner.correct” (“ner.make-gold” in version 2). A similar approach was also used by Gnehm et al.91. Any annotations predicted in the “Methods” and “References” sections were removed (see Annotation handbook and TeamTat user guide). The curated annotations were stored in BioC XML to be later combined with other batches and converted to the IOB format for a new round of model training or being used as ground truth for model performance monitoring.For entity types that repeatedly produced large numbers of false-positives or false-negatives, i.e. were not correctly identified by the predictor and required manual curation, anonymous biocurator polling was used to improve the annotation process. Here, examples of ambiguously labeled terms were given to all biocurators and they were asked to assign entity types. A majority vote across the responses determined the entity type and the annotations in the publications were updated accordingly. For example, it was not clear what should be labeled as entity type “mutant”. After polling, point mutations at specific sequence positions and deletions/insertions of sequence ranges or whole domains and proteins were included.The decision was made that no new publications would be added for training once the NER model would not improve by more than 0.5% for overall values for F1-measure, precision and recall across all entity types or reaching 100 publications, whichever came first.As a result of changes to the annotation schema in terms of entity types (described in section The annotation tool and schema) and adding new publications over time, splitting into train, test and validation sets was carried out anew for every new model training. In order to be able to judge and compare the performance of models v2.1 and v3.1, we therefore employed an additional set of ten publications which had not been used for any training, testing or validation and provided a completely independent test set.It should also be noted that for inference on a new document, we supplied the publication text split into paragraphs rather than sentences as was used during training. This was aimed at the transformer model’s ability to contextualize named entities and as was shown by Luoma and Pyysalo92 and Wang et al.93 this was expected to improve the model’s performance.Annotation evaluation using SemEval procedureTo monitor and evaluate the performance of the trained predictor, we followed the published assessment process for SemEval94. Each predicted annotation was assessed whether it had a matching annotation in the ground truth using the following five categories:

Correct – full agreement between predicted annotation and ground truth annotation in text span and entity type

Incorrect – disagreement between predicted annotation and ground truth annotation in text span and entity type

Partial – text span overlaps in predicted annotation and ground truth annotation but the entity type may differ

Missing – annotation is only found in the ground truth but not in the predicted annotations

Spurious – annotation is only found in the predicted annotations but not in the ground truth

SemEval then evaluated a found match whether it belonged to one of four different classes of matches:

Strict – only evaluate annotations with exact text span boundaries and exact entity types between predicted and ground truth annotations

Exact – allow annotations to have exact text span boundaries with disagreement in entity type between predicted and ground truth annotations

Partial – allow annotations to have partially overlapping text span boundaries with disagreement in entity type between predicted and ground truth annotations

Type – allow annotations that have some form of agreement between predicted and ground truth annotations

For each class of match, the precision, recall, and F1-measure were determined. The statistics were calculated for annotations in the selected sections: title, abstract, introduction, results, discussion, tables, and table and figure captions. In order to apply the SemEval procedure to the annotations, the text and in-line annotations had to be converted from BioC XML to the IOB format as described above in Annotation processing for training and evaluation.Please note that the evaluation was done by comparing the predictions to the ground truth. However, we used all the predictions produced by the different models on the full-text BioC XML rather than predicting only on the sentences included in the ground truth. As a consequence, the models produced predictions that were not found in the ground truth. Those additional sentences were given the O(outside) label and appended to the ground truth. During the evaluation process, these annotations were classed as “spurious”.For the batches used during consecutive training described above in Consecutive rounds of semi-automatic annotation and NER training, we performed the evaluation across the entire batch. The independent test set, batch 5, for the comparison of autoannotator versions v2.1 and v3.1, additionally underwent a per-document evaluation.

Hot Topics

Related Articles