Mining impactful discoveries from the biomedical literature | BMC Bioinformatics

Observations of the parametersFirst, we analyze different aspects of the method using the ND subset (see in “Data and preprocessing” section). A small gold standard dataset of 12 ND discoveries is created in order to analyze the method (Table 2). This dataset is built using external sources to collect the year of discovery.Footnote 19 Reliably assessing the true year of discovery is a consistent obstacle in the evaluation of LBD. This issue sometimes causes errors, because if the discovery year is postdated then the performance of a LBD system may be overestimated [20]. The choice of the relations is arbitrary and is partly based on whether the relation has a clearly established year of discovery, therefore the dataset cannot be interpreted as a representative sample. Even in the case of discoveries which are clearly recognized in the ND field, there is often ambiguity about the exact date; for instance, the causal gene for Huntington’s disease was approximately located in 1983 but precisely located on chromosome 4 in 1993.Table 2 Gold standard dataset of 12 ND discoveriesThis small-scale ND gold standard dataset is not meant as a reliable evaluation of the method, but as a way to compare the different parameters of the method (see in “Measuring literature impact” section). Additionally it can only be used to measure recall, i.e. the proportion of gold discoveries detected by the method. A discovery is considered as detected if the relation is retrieved and its surge year is within the window \(y \pm n\), where y is the true discovery year and n is a fixed constant. Precision would require a full annotated subset and thus cannot be measured with this method.The effect of the parameters of the method is evaluated by comparing their performance against the ND gold standard dataset. The surges are extracted for every configuration of parameters among: three window sizes for the moving average (1, 3 and 5); the six association measures (joint probability, PMI, NPMI, MI, NMI and SCP); and the two trend indicators (diff and rate).For every parameter, Fig. 4 shows the average performance (recall) on the dataset, averaging across the values of the other two parameters. Performance increases with the window size of the moving average: 5 is better than 3, which is better than no moving average (1). The NMI and SCP measures perform best, followed by NPMI. PMI and MI perform poorly, more so than even joint probability. Finally the diff indicator performs drastically better than rate. Importantly, the first year filter (see in “Detecting surges” section) decreases performance only slightly compared to keeping all the surges; this is an indication that the method works as intended: for every relation which has several high surges, the first surge is very likely to be detected around the true time of discovery.Fig. 4Average recall within \(y \pm 3\) years on the gold standard ND dataset (12 discoveries) by parameter. For each parameter value, the mean recall is calculated across the values of the other two parameters. The two colours represent whether all the surges are taken into account (any year) or the first year filter is applied (first year, see “Detecting surges” section)The best performing individual configurations are consistent with the results by parameters. SCP/diff/5 performs best, identifying the correct discovery year in 8 cases out of 12 (recall 0.67). It is followed by 6 configurations which perform equally well (7 cases out of 12; recall 0.58): NMI with window 1, 3 or 5, SCP with window 1 or 3 and NPMI with window 5 (all with the diff indicator). These results confirm the superiority of the NMI and SCP measures, but the small size of the dataset does not allow any fine-grained comparison between the configurations.Discoveries across timeIn this part we investigate the distribution of the surges predicted by the method across time. There are various potential biases related to the time dimension: for example, the volume of data is not uniform across time, since the number of entries in Medline increases by approximately 4% every year. There can also be artifacts due to the construction of Medline as a resource.Footnote 20 A distribution of the detected surges can be observed in Fig. 5 (middle); the distribution of the first cooccurrence year for all relations, i.e. the input data, is also shown for comparison (top). The surge patterns follow the first cooccurrence patterns quite closely, evidencing the absence of any visible bias due to the surge detection method. Most of the years have between 300 and 600 surges, which represents around 10% of the number of first cooccurrence relations. The peak in the 1960s is likely an artifact due to the construction of Medline. The decrease after the 2000s may be partly due to the filtering of the relations which have less than 100 cooccurrences (see in “Data and preprocessing” section), since relations which appeared in recent years had less time than older ones to accumulate 100 mentions. It is possible that the early years (50 s and 60 s) contain some spurious surges due to the low volume of data and the introduction in the data of concepts which might have existed before. For example, the relation Adrenal Medulla (D000313) and Pheochromocytoma (D010673) is detected as surging in 1952, although the discovery happened earlier. Nevertheless, other cases among the earliest surges detected appear to be valid, such as the relation between adenosine triphosphate (D000255) and Stem cells (D013234) which has a surge detected in 1952 [21, 22].Fig. 5Top: number of unique relations with frequency higher than 100 by year of first cooccurrence. Middle: number of surges by year, in blue for the earliest surge of a relation, red for subsequent surges. Bottom: difference between the earliest year where both concepts appear individually and (1) their first cooccurrence (red), and (2) their first detected surge (blue). Surges parameters: SCP measure, window 5, diff indicatorWe also study how long after the introduction of the two concepts surges happen. Given a relation between two concepts \(c_1\) and \(c_2\) which appear for the first time at years \(y_1\) and \(y_2\) respectively, the earliest possible year for a cooccurrence (and consequently for a surge) is \(max(y_1,y_2)\). The bottom plot of Fig. 5 shows the distribution of \(y_c – max(y_1,y_2)\) and \(y_s – max(y_1,y_2)\), with \(y_c\) the year of the first cooccurrence and \(y_s\) the year where the first surge is detected. With parameters SCP/diff/5, the first surge happens in average 7.6 years after the first cooccurrence. Among the relations which have a surge, 21% have their first cooccurrence the same year as the two concepts appear, whereas only 5.3% have their first surge this year. Similarly, 10% of the relations have a surge in the first 2 years while 30% cooccur in the first 2  years. Thus in a large number of cases, the first cooccurrence appears at the same time or soon after the first year where both concepts exist. However the first surge appears later most of the time, indirectly illustrating an important difference between the time-sliced method and our method: the former conflates cooccurrence and discovery, whereas the latter waits for evidence that the relation is an actual discovery.Footnote 21In most cases, the first surge occurs within 20 years of the two concepts appearing in the data. Nevertheless it is also possible for the surge to happen much later; in some cases, both concepts exist from the starting point of our dataset (1950) and have a first surge in the 2010s. For example, Spinal Muscular Atrophy (SMA; D009134) and Oligonucleotides (D009841) first appear in the data in 1951 and surge only in 2016, when a clinical trial for an antisense oligonucleotide drug for SMA proved successful [23]. However there are also questionable cases with some relations between two general concepts; for example, the relation between Muscle Spasticity (D009128) and Motor Neuron Disease (D016472) has its first detected surge in 2017, while both concepts exist in the data since 1950. This case likely corresponds to a “gap”, as described by [16] (see “Impactful discoveries” section).Footnote 22Comparison against the time-sliced methodFinally, we compare our method against a baseline representing the time-sliced method (see “Motivations” section). In a real LBD evaluation setting, a cut-off year would be selected, the LBD system would be applied to the data before the cut-off year, and its predictions would be compared to the “true discoveries” happening after the cut-off year. The set of “true discoveries” is determined by the evaluation method: recall that the state-of-the-art time-sliced method consists in considering every cooccurrence of two terms as a “true discovery”. Here the context is different, because we aim to compare the two evaluation methods themselves, not apply them in order to evaluate some LBD system. In this context, the two methods can be seen from the point of view of mining discoveries from the existing literature, where the time-sliced method acts as a simplistic baseline where every existing relation (two terms cooccurring) is automatically labelled as positive, as opposed to distinguishing relations which exhibit characteristics associated with a discovery (for example significant literature impact in our method). Thus the two evaluation methods can be compared simply by examining the set of relations they return, and determining which one is more likely to capture true discoveries.For our method we select specific values for the parameters, based on the previous analysis (SCP/diff/5). The baseline set of discoveries is obtained by extracting the N most frequent relations in the full dataset,Footnote 23 together with their first year of cooccurrence (see details in Table 1).Footnote 24 The original time-sliced method would normally include every cooccurrence which appears in the data: in our dataset (after relations with less than 100 cooccurrences were filtered out), this represents 108,794 unique relations after applying the post-processing steps. However N is chosen to be equal to the number of relations returned by our method (\(N=9092\)), in order to make the two lists of relations comparable. The same post-processing steps are applied to both methods: filtering years 1990–2020, maximum conditional probability 0.6,Footnote 25 and filtering of the four groups ’Anatomy’, ’Chemicals and Drugs’, ’Disorders’ and ’Genes’. The two resulting lists of discoveries are evaluated as follows: for each list, the top 100 relations are selected as well as a subset of 100 relations picked randomly in the list. Then the four subsets of 100 relations are randomly shuffled into a large dataset which is then annotated manually.Footnote 26 The final list contains no indication of which subset a relation comes from, so that the annotator cannot be influenced in any way. The annotation process is simplified in order to minimize the subjectivity involved in deciding whether a relation qualifies as a discovery or not. Every pair of concepts is labelled as one of three possibilities ’yes’, ’no’, ’maybe’ regarding the discovery status. The annotator relies on Google Scholar queries with the two concept terms in order to determine their status:

If the top results of the query show some evidence of a significant, non-trivial, new and impactful relation between the two concepts, then the relation is annotated as ’yes’. This requires at least one fairly clear title or abstract mentioning the relation as a discovery, with a healthy number of citations. The year of the main article is reported as gold-standard year (whether it is close to the predicted year or not).

If there is evidence that the relation is either trivial, questionable or has very little impact (few papers or citations), then it is labelled as ’no’. This includes for example obvious relations, e.g. “Adrenergic Receptor—Adrenergic Antagonists”, and relations involving trivial terms, e.g. “Traumatic Brain Injury—Neurons”.

In any other case, the status is considered ambiguous and the relation is labelled as ’maybe’. This includes cases where the annotator cannot understand the articles, has doubts about the originality, or the citation count is moderate. These cases are ignored in the evaluation results.

It is worth noting that this annotation policy is fairly strict regarding the discovery status of the relation: for example, in many cases a discovery exists with one of the terms but the other term is only indirectly related; such cases would be labelled as negative. The non-trivial condition also discards many relations which could potentially qualify as discoveries. These strict criteria are intended to make the annotation process as deterministic as possible, even though real applications of LBD might consider a larger proportion of predicted discoveries as relevant. Thanks to this annotated dataset, it is possible to estimate the precision (proportion of true discoveries among the predicted ones) of our method and compare it against the baseline.The results are presented in Table 3. It is striking that both methods have a large amount of non-discoveries (marked as “no”) mixed with the actual discoveries (marked as “yes”). Surprisingly, both methods obtain a lower precision for the subset made of the top 100 relations (by frequency for the baseline, by trend for our method) than the subset made of 100 random relations, even though the top relations are expected to be more likely discoveries. This could be due in part to the high randomness and possible annotation errors, but in the case of the baseline many of the top relations are visible outliers: due to the particular severity of the pandemic, the most frequent relations involve Covid19 with various other generic concepts, often not qualifying as a discovery (e.g. Coronavirus and Emotional stress).Footnote 27Table 3 Results of the SCP surges versus time-sliced baseline on 400 manually annotated relationsThe precision obtained by our method is higher than the baseline by 14.5 points for the random 100 relations, and by 20.5 points for the top 100 relations. This confirms that our method based on measuring impact offers a better quality set of discoveries, even if it is still far from perfect. Nevertheless, the \(\chi ^2\) test between the two methods, considering only the ’yes’ and ’no’ categories, is significant only for the top 100 subset; this is consistent with the fact that the difference between the precision values is lower for the random 100 subset. But even if our method obtains more than twice the precision of the baseline among the top 100 relations, it is disappointing that it does not reach more than 40%. This is a serious limitation that hopefully future improvements will alleviate.

Hot Topics

Related Articles