Imperfect gold standard gene sets yield inaccurate evaluation of causal gene identification methods

Evaluation with PU-labeled gene setsGenes outside the constructed GS set are more accurately viewed as unlabeled (U) rather than as negatives (N). Combined with accurately positively labeled genes in the GS set, the overall GS gene set should be regarded as positive-unlabeled (PU) data, a term used in semi-supervised machine learning. Using PU data to evaluate performance as though they were positive-negative (PN) labeled data results in inaccurate evaluations4. A PU-labeled gene set with perfect positive labeling consists of three subsets of genes: true causal genes that are correctly identified (labeled positives), true causal genes that are not labeled and therefore assumed to be non-causal (unlabeled positives), and non-causal genes that are unlabeled and therefore correctly assumed to be non-causal (unlabeled negatives, UN) (Fig. 1a). Evaluation treating PU labels with perfect positive labeling as PN labels will always underestimate the positive predictive value (precision) and overestimate the negative predictive value (NPV) of a classifier (Fig. 1b).Fig. 1: Illustration of classifier performance relative to an imperfectly labeled gene set.a Illustration of gene classes by reference and classifier labels. The rectangle represents the set of total genes, with columns corresponding to the three types of labeled genes: correctly labeled positive genes (blue, A/a), positive genes labeled as negatives (red, B/b), and correctly labeled negative genes (green C/c). Horizontal divisions indicate classifications, with the lower divisions (lowercase letters, darker shading) classified as positive and the upper divisions (capital letters, lighter shading) classified as negative. The classifier may not have the same performance on unlabeled positive genes as it has on labeled positive genes (red lines and arrows), which affects estimated sensitivity and specificity. b Comparison of true and estimated sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) when some positives are unlabeled.Figure 2 shows four possibilities for the performance of the classifier on the unlabeled positive genes and the corresponding relationship between the estimated and true sensitivity, specificity, and ROC curve. The classifier may perform differently on unlabeled positive genes and labeled positive genes if these two groups differ on important features that either align with or do not align with the features used to construct the classifier.Fig. 2: Effect of PU labeling on estimated ROC curves.The rectangle in each figure represents the set of total genes, as in Fig. 1. The drawing on the right of each figure represents the expected relationship between the true and estimated ROC curve for each scenario. a Labeled positives are representative of all true positive genes. b The classifier is more sensitive to unlabeled positives than labeled positives. c The classifier is more sensitive to labeled positives than unlabeled positives. d The classifier detects a lower proportion of unlabeled positives than true negatives. Note that in scenario (c), the estimated ROC curve could be either above, below, or intersecting with the true ROC curve.In almost all scenarios, estimation with PU labels leads to underestimating specificity. To see why this is, let A, B, C, a, b, and c be defined as illustrated in Fig. 1 as the counts of labeled positive, unlabeled positive, and true negative genes that are classified as non-causal (upper-case) or causal (lower-case) by the classifier. With these definitions, \(\alpha =\frac{B+b}{C+c}\) is the proportion of unlabeled genes that are truly causal. As long as the sensitivity of the classifier to unlabeled positives is higher than the probability of falsely predicting a true negative to be causal, B/(B + b) < C/(C + c), so B < αC. Therefore,$${{{\mbox{Spec}}}}_{True}=\frac{C}{C+c}=\frac{(1+\alpha )C}{(1+\alpha )(C+c)}$$
(1)
$$ > \frac{B+C}{(1+\alpha )(C+c)}=\frac{B+C}{B+b+C+c}={{{\mbox{Spec}}}}_{Estimated}$$
(2)
This means that specificity is underestimated in all cases except for the scenario in Fig. 2d. However, sensitivity may be either over- or underestimated depending on the feature biases of the labeled positive genes. If the classifier is more sensitive to unlabeled positives than to labeled positives (Fig. 2b), the sensitivity will be underestimated. If the classifier is less sensitive to unlabeled positives than to labeled positives (Fig. 2c, d), the sensitivity will be overestimated.Errors in estimating sensitivity and specificity result in errors in the ROC curve and, therefore, in the area under the ROC curve (AUC). These errors also affect other measures that rely on the 2 × 2 confusion matrix, such as Matthew’s correlation coefficient and F1 score. This error applies to evaluating ranking methods as well as methods that return only hard classifications. In the special case in Fig. 2a, the classifier has an equal ability to detect labeled and unlabeled positive genes, so sensitivity is estimated accurately. Motivated by this observation, refs. 4,5 rely on a “PU score” which is analogous to the F1 score but reliant only on sensitivity and not on specificity. However, if labeled positive genes are not representative of all positive genes, the PU score will also be inaccurate.In genetics research, we expect labeling biases because there are multiple molecular mechanisms by which a causal gene can affect complex diseases, and different classification and GS identification methods will favor different mechanisms. For example, as shown in Table 1, several GS gene set construction strategies focus on genes with phenotype-associated coding variants. Genes that affect phenotypes primarily through expression dysregulation may not be represented in these GS gene sets, so classifiers particularly sensitive to causal genes acting through expression changes may appear to perform poorly when using these gene sets.Simulated exampleTo illustrate this issue, we consider a hypothetical example in which each gene has two continuous, measurable features, Pr and Ex. We think of these features as continuous summaries of the evidence that a gene acts on the trait through mechanisms mediated by either protein sequence (Pr) or expression level (Ex). Let Yi be a binary indicator that gene i is causal for the trait of interest. We simulate Pri and Exi from independent standard normal distributions and generate Yi as$$\begin{array}{rcl}&&{Y}_{i} \sim Bern({\pi }_{i})\\ &&logit({\pi }_{i})=-3+6P{r}_{i}+2E{x}_{i}+{\epsilon }_{i}.\end{array}$$In our simulation, the protein feature, Pr is a stronger predictor of causality than the expression feature, Ex.In each simulated data set, we generate Pr, Ex, and causal status, Y for 20,000 genes that are divided into a set of 10,000 genes used for training and 10,000 genes used for testing. In the training set, we fit two classifiers, the Pr-classifier, and the Ex-classifier, by fitting a logistic regression with Y as the outcome and either Pr or Ex only as a predictor. This differs from the methods generally used to build causal gene discovery methods, as no perfectly labeled gene sets are available for training. However, this strategy provides a straightforward method to obtain classifiers based on only one of the two gene features.The 10,000 genes in the testing set function as our GS gene set. We consider 3 possibilities. Either all genes are correctly labeled, positives with high levels of Pr are more likely to be correctly labeled, or positives with high levels of Ex are more likely to be correctly labeled. We refer to these as correct, Pr-enriched, and Ex-enriched labels. Let ZC,i, ZPr,i, and ZEx,i be correct, Pr-enriched, and Ex-enriched labels for gene i in the testing set. We generate these as ZC,i = Yi, ZPr,i = YiWPr,i, and ZEx,i = YiWEx,i with$$\begin{array}{ll}{W}_{Pr,i} \sim Bern({\theta }_{Pr,i})&{W}_{Ex,i} \sim Bern({\theta }_{Ex,i})\\ logit({\theta }_{Pr,i})=-3+4{P}_{i}&logit({\theta }_{Ex,i})=-3+4{E}_{i}.\end{array}$$The Pr-enriched labels mislabel 4.5% of all positives as negative, while the Ex-enriched labels mislabel 13.5% of all positives as negatives.ROC curves estimated using each of the imperfect label sets are shown in Fig. 3, compared against ROC curves estimated using perfect labels. In both cases, label enrichment results in biased estimation of classifier performance. When Pr-enriched labels are used, the AUC of the Pr-classifier is overestimated, and the AUC of the Ex-classifier is underestimated. However, the accuracy of the two classifiers is correctly ordered. When the Ex-enriched labels are used, the pattern is reversed, resulting in the misordering of the two classifiers. These results align with our theoretical expectations. When using the Pr-enriched labels to evaluate the Pr-classifier or the Ex-enriched labels to evaluate the Ex-classifier, we have the scenario in Fig. 2c, where sensitivity is over-estimated, and specificity is under-estimated, pushing the ROC curve up from its true value. Conversely, when label enrichment does not favor the genes a classifier is most sensitive to, we are in the scenario in Fig. 2b, where sensitivity is under-estimated, pushing the ROC curve down.Fig. 3: Comparison of Classifiers on True vs. Biased Labels.Classifiers evaluated against true labels (solid lines) and biased labels (dotted lines). In both figures, the Pr-Classifier uses Pr values to identify causal genes and the Ex-Classifier uses Ex values. a Causal genes with large Pr values are more likely to be labeled as positive than causal genes with large Ex values. Performance of the Pr-Classifier is overestimated while performance of the Ex-classifier is underestimated. b Causal genes with large Ex values are more likely to be labeled as positive than causal genes with large Pr values. Performance of the Ex-Classifier is overestimated, making it appear more accurate than the Pr-classifier.

Hot Topics

Related Articles