Machine learning approaches for influenza A virus risk assessment identifies predictive correlates using ferret model in vivo data

Source data and supervised classification models testedFerrets (n = 717) were inoculated with an extensive panel of 125 unique influenza A viruses (Supplemental Table 1) by standard high-dose intranasal instillation, with multiple virological and clinical parameters captured post-inoculation as specified in the methods. Informed by these data, three overarching supervised classification models (lethality, morbidity, and transmissibility) were developed to classify and predict, on a per-ferret basis, a binary outcome (yes/no for lethality, yes/no for high weight loss, ≥50%/<50% for transmission in a respiratory droplet setting) from aggregated ferret data (Fig. 1). Each classification model was trained on three different source data types emulative of information generated during standard risk assessment activities (see Table 1 for key features included in models and rationale for inclusion). The standard data type included viral titer and clinical symptoms obtained from virus-inoculated ferrets, sequence-predicted receptor-binding preference and polymerase activity, with limited other metadata. The molecular only data type included no data from virus-inoculated ferrets and was informed solely by sequence-based and other viral metadata available prior to in vivo experimentation. A combined data type pooled all available data parameters from both standard and molecular data sets. Details of all 9 models evaluated are presented in Table 2.Fig. 1: Analysis workflow for generation of models employing machine learning algorithms.IAV metadata and results from in vivo experimentation are collected from pathogenicity (to inform lethality and morbidity classifications, top left depiction) and respiratory droplet transmissibility (to inform transmission classification, top right depiction) experiments in ferrets. Extensive data preprocessing was conducted to train numerous supervised classification models (encapsulating 11 different ML algorithms) and assess relative model performance. Final model selection led to additional model training and tuning, with all chosen models tested with both internally generated data and several external datasets for validation purposes. Illustrations in this figure were generated by the US Centers for Disease Control and Prevention.Table 1 Summary of key features used to train classification models on different outcome variablesTable 2 Description of supervised classification models trained and tested in this studyIterative testing of disparate machine learning algorithmsML algorithms can vary in the relative weight they give different parameters, leading to variability in outcomes and overall performance metrics. Due to a paucity of previously assessed models employing in vivo data in the context of viral infection, and a systemic lack of head-to-head comparisons of ML performance metrics when employing virological data, we chose to test a panel of 11 different ML algorithms spanning several different ML families (see methods) against all 9 model iterations described above. For each of the 11 ML algorithms tested, multiple iterations of feature selection were tested (Supplementary Data 1), with final model selection informed by assessing 14 performance metrics (area under the curve [AUC], accuracy, balanced accuracy, detection rate, F1, kappa, negative prediction value, positive prediction value, precision, recall, sensitivity, specificity, logarithmic loss, precision-recall AUC), with a focus on balanced accuracy, sensitivity, specificity, and F1 score. A full scope of all metrics calculated, for each feature iteration, for each model, are presented in Supplementary Data 2–7. An example is provided in Fig. 2 showing one performance metric (balanced accuracy) for 4 of the 11 ML algorithms across each iteration of features tested (Supplementary Data 1) for the L1 model (lethality standard). Employing these metrics for all 9 models evaluated, we found a range of variability between the ML algorithm employed and selected features informing each model. However, these assessments identified a consistent trend of top performing (gbm, nnet, rf, ranger) and low performing (glm, rpart) algorithms independent of the outcome metric assessed. Based on these metrics, a final algorithm of gradient boosting (gbm) was selected for lethality and morbidity models, while random forest (rf) was selected for transmission models. The top three features for each final model are presented in Table 2 and discussed in more detail below. Subsequent refinement of all classification models was performed with hyperparameter tuning (Supplementary Data 8) using the sample metrics for finalized model selection.Fig. 2: Comparison of lethality standard model balanced accuracy and feature selection iterations.A Heat map depicting balanced accuracy performance metrics of four ML algorithms (support vector machine (svm), decision trees (rpart), random forest (rf), and gradient boosting (gbm)) for the lethality standard (L1) model employing different feature selection. Values range from 0 (worst, green) to 1 (best, purple). B Feature inclusion for ML algorithms shown. Purple, feature inclusion; green, feature exclusion. All L1 models include AUC_6, MBAA, RBS, and PA features (not shown). Origin_orig: Virus host origin (human, variant, avian, swine, canine), based on lineage (not species of isolation). Origin: Binary virus host origin (avian or mammalian, see methods for definition). Temp: peak rise in pre-inoculation temperature (in degrees C) over 14 days p.i. (temp) or over first 5 days p.i. only (temp_5). slope1,3: measurement of virus growth or decay in NW specimens between days 1 and 3 p.i. peak_inoc: peak NW titer over days 1–6 p.i. HA: IAV HA subtype only. Subtype: IAV HA and NA subtypes combined. Feature definitions are also shown and provided in Supplemental Fig. 1 and Supplementary Data 1. Full scope of all model metrics and feature selections for models described in Table 2 are reported in Supplementary Data 2–7.Assessments of final model performance and comparisonOnce final algorithm selection was determined, we next examined in depth the relative performance of all 9 models (lethality, morbidity, and transmissibility, with standard, molecular or combined data types), with a focus on balanced accuracy, sensitivity, specificity, and F1 score metrics. Balanced accuracy was >0.9 among lethality classification models employing a tuned gradient boosting ML algorithm, with the standard (L1) model consistently showing the highest balanced accuracy (0.9314) followed closely by the combined (L1M) and molecular (LM) models (Fig. 3, Supplementary Data 2, 3), demonstrating that all models independent of training data type could accurately categorize both positive (no lethality) and negative (yes lethality) cases among our internal data test sets. Sensitivity values were also >0.92 across all lethality models, emphasizing model competence in correctly recognizing true positive outcomes. Specificity values exhibited greater variability between lethality models but still demonstrated the ability of all lethality models to accurately classify negative events with values of 0.9394 (L1), 0.8788 (LM1), and 0.8485 (LM). The F1 score, a metric that balances recall and precision, was >0.95 across all models, further supporting that all lethality models could correctly balance detection of positive events with a reduction in false positives independent of the data type employed for training.Fig. 3: Performance metrics for lethality, morbidity, and transmission classification models.Heat map depicting 9 performance metrics and the probability of prediction threshold determined for 9 models including standard data (L1, M1, T1), molecular only (LM, MM, TM), and combined data types (L1M, M1M, T1M). All 9 model iterations are presented in Table 2. Values range from -1 (no agreement) to 1 (agreement) (MCC, Kappa) or 0 (worst, green) to 1 (best, purple) (all other metrics).While the lethality models were found to be generally robust, models assessing morbidity (as measured by maximum weight loss of virus-inoculated ferrets) underperformed. For morbidity, the standard (M1) model was evaluated using the random forest algorithm of a stack of the top individually performing models (neural net, ranger, and gradient boosting), while molecular (MM) and combined (M1M) models employed a tuned gradient boosting algorithm (Supplementary Data 8). Balanced accuracy was consistently very similar across all three morbidity models (0.7492-0.7666) (Fig. 3, Supplementary Data 4, 5). Specificity and sensitivity values were consistent and balanced for M1 (0.75, 0.7485) and M1M (0.7692, 0.7362) respectively, while model MM had more specificity (0.8462) at the cost of sensitivity (0.6871), resulting in consistent balanced accuracy as shown above. The F1 score followed a similar pattern with values ranging from 0.7915 (MM) to 0.8188 (M1). While the stacked model algorithms were the best performing, improvement over the top individually performing algorithms for morbidity was negligible.For virus transmission by respiratory droplets, standard (T1), molecular (TM), and combined (T1M) models were finalized using a tuned random forest algorithm (Supplementary Data 8), but several others (such as ranger, gradient boosting, and neural net) were comparable competitors. Similar to lethality classification models, all transmission models were very predictive when tested with internally generated data, with balanced accuracy >0.95 for all three models (Fig. 3, Supplementary Data 6, 7). All models possessed maximum specificity, with very high sensitivity values of 0.9726 (T1M), 0.9577 (T1), and 0.9178 (TM). This pattern was similar, with maximum precision and high recall values, resulting in consistently high F1 scores >0.95 independent of the data type employed for model training.To further compare relative model performance, we employed Matthew’s Correlation Coefficient (MCC), a metric that considers true negatives and positives, and false negatives and positives, producing a high score if only good predictive rates are found for each category. The MCC score ranges from −1 (complete misclassification) to 1 (perfect classification), with zero values being random classification. In agreement with other performance metrics discussed above, MCC supported that transmission classification models were the most accurate, followed by lethality, then morbidity models which performed comparatively poorly (Fig. 3). All three transmission models had MCC values > 0.9. Lethality models had lower MCC values relative to transmission models (ranging from 0.7424 to 0.8114), with the combined L1M model more accurate than either standard (L1) or molecular only (LM) models alone. In contrast, morbidity models were not very accurate, with MCC values < 0.5 (0.4416–0.4598) for models trained on any data type.In conclusion, transmission classification models had the overall highest performance metrics and were very accurate in predictive outcomes when employing internally generated data. Lethality classification models offered similarly high performance with reasonable predictive ability. In contrast, morbidity classification models offered minimal predictive capabilities. Within each classification model, the combined standard and molecular data type models offered the highest predictive value for transmission (T1M) and lethality (L1M), illustrating the usefulness of combining these two types of data for training of ML algorithms.Feature importance for each modelWe next examined in more detail the specific features of each model. As shown in Fig. 4A, all three classification models employing the standard data type (L1, M1, T1) shared several common features (area under the curve days 1–6 [AUC_6], hemagglutinin [HA], polymerase activity [PA], receptor binding preference [RBS], all features defined in Supplementary Data 1), with variability among other features present depending on the classification. Both L1 and M1 models included the absence or presence of a multi-basic amino acid HA cleavage site (MBAA) and a temperature input; in contrast, T1 included features (Origin, slope1,3) not present in highest-performing lethality or morbidity models. RBS, PA, HA, and Origin were included in all combined data type models regardless of classification, further highlighting the critical and multifactorial role many of these features contribute to viral pathogenicity and transmissibility outcomes.Fig. 4: Variability in feature selection among different models employing standard data type.A Feature inclusion (purple) or exclusion (green) for lethality (L1), morbidity (M1), and transmission (T1) models employing the standard data type. Individual feature definitions are provided in Supplementary Data 1. B Relative ranked importance of top numeric features included in L1, M1, and T1 models. C Relative ranked importance of human (H), dual (D), or avian (A) predicted receptor binding preference in L1, M1, and T1 models. Relative ranked performance among features of all models is shown in Supplementary Data 9–14, Supplementary Data 10, 12, and 14 contain ranked importance results for models trained with molecular datasets. Relative ranked importance values are set to 100 for the most important feature and scaled to relative importance for remaining features independently within each model. For each model in (B, C) features are consistently scaled but separated out for visual purposes.Among lethality models, standard and combined models had comparable features, with weight loss followed by MBAA the highest ranked features for both L1 and L1M (Table 2, Fig. 4B) (Supplementary Data 9, 10). In contrast, while included in the highest-performing models, predicted receptor binding preference and HA subtype had minimal contributing impact. Highest importance features of the LM molecular model were HA positions 214 V, 160 T, and 496 R (H3 HA numbering throughout).Morbidity models also showed similar features of importance across the standard and combined models (Supplementary Data 11, 12). Both M1 and M1M models shared area under the curve of days 1–6 (AUC_6), temperature (temp_5), and MBAA as the three features of highest importance (Table 2). Receptor binding preference, polymerase activity, and HA subtype were less impactful, but notably important. Molecular position HA-227S had the highest importance in the MM molecular only model and was moderately important in the combined model; positions HA-196Q and PB2-627K were also highly ranked features across both molecular and combined models. Like the L1M model, in the combined M1M model, the highest ranked features were derived from in vivo experimentation (AUC_6, temp_5) and not sequential data.With the transmission standard model (T1), day 1–6 titer area under the curve showed the highest importance, followed by slope1,3, H5 subtype, and RBS (Table 2, Fig. 4B); polymerase activity (PA) had minimal influence (Supplementary Data 13–14). For the TM molecular only model, PB2-627E, HA-138A, and HA-21S had the strongest impact. Interestingly, unlike L1M and M1M combined models, the most impactful features of the combined T1M model were derived from molecular-based and not in vivo-derived data (Table 2).Assessments of relative ranked importance between the three classification models employing similar data types further highlights the variable strength ML algorithms consider different features. AUC_6 was among the top three ranked features across all standard data type classification models but was substantially less critical in the lethality model (L1) than either M1 or T1, where wt_loss was the most important feature (Fig. 4B). Interestingly, among categorical features such as predicted receptor binding preference, models differentially weighted specific variables within a feature (Fig. 4C). For example, while both M1 and T1 models had RBS as a comparable weighted feature across both models (with categorical responses of avian, human, or dual predicted binding), avian binding was highest ranked in the morbidity standard model among the three responses yet lowest ranked in the transmission standard model. Collectively, close attention to which features are included/excluded from different classification models sourced from different data types, as well as the relative ranked importance of features within each model, provides valuable context towards understanding the drivers of the phenotypic outcomes predicted by these ML algorithms.Validation of model predictive metrics on simulated and externally generated in vivo dataThe findings discussed above support that the lethality and transmission models had high performance metrics trained from our primary dataset of in vivo experimentally-generated data, but it was unknown if this high performance would be maintained when testing data was generated under conditions that diverged from the training data. We first evaluated the performance of models informed by in vivo data metrics (standard lethality and morbidity), by testing data generated from ferret inoculations with two H1N1 IAV from 11 different laboratories (n = 88)24, or from simulated values based upon our primary dataset (Table 2) (see methods). Overall, performance metrics from the H1N1 (L1-H1N1) and simulated (L1-sim) data were generally comparable to those obtained with our primary dataset, with a noticeable consistency in some and a decrease in other model metrics (Fig. 5). For L1-H1N1, balanced accuracy (0.821) was less than L1 (0.9314), while sensitivity (0.9753) and F1 (0.9814) were higher than the L1 model (0.9235 and 0.9548, respectively). However, specificity (0.6667) had a noticeable drop, also impacting the MCC (0.5594). Metrics for L1-sim were not too dissimilar from L1 but with an increase in specificity (0.7264) and MCC (0.645), and a decrease in sensitivity (0.9061) and F1 (0.8927).Fig. 5: Performance metrics for models tested with externally generated data.Heat map depicting 9 performance metrics and the probability of prediction threshold determined for models including standard data (L1, M1) and molecular only data (LM, TM) from three externally generated datasets (H1N1, sim, pub) as described in Table 2 and the methods. Values range from -1 to 1 (MCC, Kappa) or 0 (worst, green) to 1 (best, purple) (all other metrics).Consistent with other morbidity models, both the H1N1 (M1-H1N1) and simulated (M1-sim) data performed poorly and consistently worse than the M1 model tested with the primary dataset across metrics (Fig. 5). The simulated data, while not very accurate (0.227) performed better over the H1N1 data which was a random prediction (0.0803). These results support that our well performing lethality L1 model (which includes features derived from in vivo experimentation) maintained high performance metrics when data were generated under a consistent protocol in-house or when certain inclusion criteria were met between laboratories providing data, despite limitations in the H1N1 dataset due to limited sample size and viral diversity (Table 2); use of simulated data provided a secondary validation approach to overcome these limitations, despite being inherently unrealistic compared to the primary dataset presented here.To rigorously evaluate the performance of models informed by molecular features alone (lethality LM, and transmission TM models), we tested these models for lethality (LM-pub) and transmission (TM-pub) with a dataset of previously published data sourced from 68 publications external to our group that employed comparable experimental conditions as our primary dataset (Supplementary Data 18). Strikingly, results from the LM-pub model, tested on ferret lethality outcomes following in vivo experimentation by external research groups only, performed comparably well to all lethality models tested with our primary dataset (Fig. 5). We found comparable high balanced accuracy between LM-pub and LM models (0.888 and 0.9051, respectively) consisting of near equal sensitivity (0.9036 and 0.9617, respectively) and specificity (0.8723 and 0.8485, respectively). The outcome between LM-pub and LM also had a high F1 (0.9417 and 0.967) and a slightly diminished MCC (0.6283 and 0.7424) driven by higher false positives. In contrast, the TM-pub model performed poorly (balanced accuracy, 0.4453) and showed a near random prediction with a slight misclassification bias (MCC, −0.1097) for classification outcomes, suggesting that the TM did not perform well with independent data, likely due to model overfitting on the internal training data. Collectively, we found that our LM, but not TM model, maintained high predictive accuracy with externally generated data from a variety of independent laboratories, underscoring the importance of including external data sets when validating ML models.

Hot Topics

Related Articles