Disease prediction with multi-omics and biomarkers empowers case–control genetic discoveries in the UK Biobank

EthicsOur research complies with all relevant ethical regulations. Both biobanks have been approved by the relevant board or committee. Ethics approval for the UKB study was obtained from the North West Centre for Research Ethics Committee (11/NW/0382)45.The Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa approved the FinnGen study protocol (number HUS/990/2017)38. The FinnGen study is approved by the Finnish Institute for Health and Welfare (approval numbers THL/2031/6.02.00/2017, amendments THL/1101/5.05.00/2017, THL/341/6.02.00/2018, THL/2222/6.02.00/2018, THL/283/6.02.00/2019 and THL/1721/5.05.00/2019), the Digital and Population Data Service Agency (VRK43431/2017-3, VRK/6909/2018-3 and VRK/4415/2019-3), the Social Insurance Institution (KELA) (KELA 58/522/2017, KELA 131/522/2018, KELA 70/522/2019 and KELA 98/522/2019) and Statistics Finland (TK-53-1041-17).UKB cohortThe UKB comprises data from 502,226 participants aged 37–73 years at the time of recruitment with median age being 58 years46. Of these, 54.4% are female. The data collected from these participants include, but are not limited to, up-to-date diagnosis information, body size measures, blood count measures, blood biochemistry measures, genomics data45 as well as proteomics data14,15 (for 10% of participants). All participants provided informed consent and participation was voluntary.FinnGen cohortFinnGen comprises data from 412,181 individuals (55.9% female) with median age of 63 years38. All participants provided informed consent and participation was voluntary. We did not apply for access to patient-level data and used only FinnGen GWAS summary statistics to validate our findings.Population descriptors in UKBWe used peddy47 and 1000 Genomes Project data to classify genetic ancestries at a broad continental level (peddy_prob ≥ 0.9). We additionally removed participant samples where principal component values were more than 4 s.d. from the mean for the first four principal components, for the large EUR ancestry population only.Defining cases in UKBAll the participants diagnosed with a given ICD10 code across four different UKB fields (UKB v.673123, released in April 2023) were considered as cases. These UKB fields are 41270 diagnosis: ICD10; 40001 underlying (primary) cause of death: ICD10; 40002 Contributory (secondary) cause of death: ICD10; and 40006 type of cancer: ICD10. Out of 13,942 ICD10 codes, 12,081 codes already had entries for greater than 50% of cases in the field ID 41270 (Supplementary Fig. 23a). When analyzing a parent node such as N18, patients diagnosed with any of its sub-nodes: N18.0, N18.1, N18.2, N18.3, N18.4, N18.5, N18.8, N18.9, were included as well. If, after taking a union, there were at least 100 known cases for each ICD10 code, that ICD10 code was processed further (Supplementary Fig. 24). This resulted in 3,213 unique disease phenotypes analyzed by MILTON across all ancestries (Supplementary Table 1a), 2,392 of which achieved an AUC > 0.60 and are being reported in our public MILTON portal. The following additional filtering steps were applied to these cases when applicable.Filtering based on lag between sample collection and diagnosisTo assess the impact of time-lag between biomarker sample collection and diagnosis dates for all ICD10 codes with at least 100 known cases, cases were filtered based on given (time-model, time-lag) pairs: (1) time-agnostic (no lag): include all cases irrespective of any time difference between biomarker sample collection and diagnosis; (2) prognostic (‘t’ years lag): only include those cases who received diagnosis between 0 years and ‘t’ years (inclusive) after biomarker sample collection; (3) diagnostic (‘t’ years lag): only include those cases who received diagnosis between 0 years and ‘t’ years (inclusive) before biomarker sample collection.Therefore, if time-model = prognostic and time-lag = 5 years, then the MILTON pipeline will be trained on all known cases who were diagnosed ≤5 years of biomarker collection (Supplementary Fig. 24). In the case of multiple lag values per subject, the shortest time-lag was used for analysis. This will happen when analyzing a parent node such as ‘N18’ and the patient has already been diagnosed with its sub-nodes such as N18.1 and N18.3 at two different time-points. As the diagnosis information can be retrieved from any of the four UKB fields (41270, 40001, 40002 and 40006), the corresponding date field was used to retrieve time for diagnosis: date of first in-patient diagnosis (41280), date of death (40000) and date of cancer diagnosis (40005). Please note that the time-lag was calculated as the difference between biomarker collection date (or biomarker sample collection date in case biomarker was measured from blood or urine samples) and diagnosis date. This difference was converted to days and divided by 365, rounding to closest integer, to obtain time-lag in years.We performed a sensitivity analysis for the effect of the sample collection and diagnosis time-lag on 400 randomly selected ICD10 codes. Within each time-model, no significant difference in performance was observed across time-lags (two-sided MWU: P > 0.05; Supplementary Fig. 1a). Furthermore, diagnostic models had consistently higher performance than prognostic models across all time-lags tested (two-sided MWU: P < 0.05; Supplementary Fig. 1a). This difference remained consistent across all ICD10 chapters (Supplementary Fig. 1b) and increased with increasing time-lag (Supplementary Fig. 1c). However, for time-lags ≥ 5 years, diagnostic models had significantly lower cases for training than prognostic models (two-sided MWU: P ≤ 1 × 10−3; Supplementary Fig. 1d). Additionally, the number of cases available for training increased with increasing time-lag (Supplementary Fig. 1d). This is also evident from the lag distribution across all UKB individuals irrespective of specific diagnosis (Fig. 2a). A longer time-lag resulted in more ICD10 codes satisfying the minimum 100-cases criterion for training MILTON models (Methods). Therefore, we selected 10 years as the optimal time-lag, since: (1) median performance of prognostic models started dropping after the 10-year time-lag; (2) it had comparable performance to other time-lags for diagnostic models; and (3) the number of cases attained with a 10-year time-lag was significantly higher compared with the 1- or 5-year time-lags (Supplementary Fig. 1d).Filtering based on sex-specificity of an ICD10 codeSome diseases might affect the opposite sex on rare occasions or not at all, given anatomical differences between men and women. To have sufficiently clean data for training a machine-learning model, cases from the opposite sex were filtered out if an ICD10 code was deemed to dominantly affect one sex (Supplementary Fig. 24). To find a suitable threshold for filtering, 117 male-specific and 606 female-specific ICD10 codes were shortlisted by keyword search. For male-specific diseases, ‘male’, ‘prostate’, ‘testi’ and ‘patern’ keywords were used. For female-specific diseases, keywords ‘female’, ‘breast’, ‘ovar’, ‘uter’, ‘pregnan’ and ‘matern’ were used. Only those ICD10 codes with at least 100 cases in total were considered for this analysis, resulting in 31 male-specific and 201 female-specific ICD10 codes (Supplementary Table 10). As shown in Supplementary Fig. 23b, 231 out of 232 diseases had proportion of dominant sex above 0.9, with highest number of women = 12 for N40 hyperplasia of prostate (n male = 26,033) and highest number of men = 128 for C50.9 breast, unspecified (n female = 15,311). The ICD10 code N62 hypertrophy of breast (n male = 317, n female = 864) did not satisfy the ±0.9 threshold and includes a condition called gynecomastia, which causes abnormal enlargement of breasts in men. Therefore, N62 was not considered to be a female-specific phenotype in our analysis.Defining controls in UKBAll the remaining UKB participants who were not diagnosed with any of the ICD10 codes in the entire chapter that the current ICD10 code belongs to were considered as ‘potential controls’ (Supplementary Fig. 24). To maintain similar case to control ratio across different ICD10 codes, the number of controls was restricted to be a maximum of control-case ratio (ctrl_ratio) times the number of cases, where ctrl_ratiomax = 19 for robust results when sufficient data were available (mean Pearson’s r between each replicate pair > 0.9; Supplementary Fig. 25a,b) or ctrl_ratiomax = 9 otherwise (Supplementary Fig. 25c). These ‘controls’ were randomly selected from the list of ‘potential controls’ and further filtering was applied when applicable. The final ctrl_ratio may be less than ctrl_ratiomax depending on how many controls remained after applying all filters.Filtering to match baseline characteristics of casesAccording to the UKB field ID 31: Sex, there are 54.40% female (n = 273,326) and 45.60% male (n = 229,086) across the entire UKB cohorts. This imbalance was significant (P < 1 × 10−3) across multiple age groups (Supplementary Fig. 23c) as well as multiple ICD10 codes with varying case numbers (Supplementary Fig. 23d). To maintain similar distributions of age (UKB field 21003) and sex (UKB field 31) across cases and controls, age was divided into four bins. Across these four bins, the distribution of women (and men) in controls was matched with the distribution of women (and men) in cases. For example, if ten women and five men were diagnosed with a given ICD10 code and their hypothetical distribution across four age bins is given in Supplementary Table 11, then the controls were sampled to match the case distribution as given in Supplementary Table 11. This criterion meant that sometimes cases were dropped to get the matching distribution between cases and controls (Supplementary Fig. 24).FeaturesUKB 67 traitsFor each participant, quantitative traits (n = 67) collected from blood assays (n = 50), urine assays (n = 4) and physical measures (n = 10) along with covariates fasting time, sex and age were used as features for training the machine-learning models (Supplementary Table 12).If a feature was collected during multiple assessment visits (instances 0–3 in UKB), only the first non-null values for each trait were used. The correlation between these features is shown in Supplementary Fig. 26. Low-density lipoprotein direct was highly correlated with apolipoprotein B (Pearson’s correlation coefficient = 0.96) and cholesterol (Pearson’s correlation coefficient = 0.95). Similarly, HDL cholesterol was highly correlated with apolipoprotein A (Pearson’s correlation coefficient = 0.92) and negatively correlated with triglycerides (Pearson’s correlation coefficient = −0.44) and waist circumference (Pearson’s correlation coefficient = −0.48).Despite taking the first non-null value across multiple instances, rheumatoid factor and estradiol had more than 80% values missing (Supplementary Fig. 22). It has been recommended by the UKB to treat these missing values as ‘naturally low’ instead of missing48. Furthermore, we observed that the amount of missingness correlated among features obtained from the same assay category, suggesting that these participants did not have the corresponding tests taken (Supplementary Fig. 27). It is unclear why microalbumin in urine had 68.84% missing values while other features obtained from urine assays only had around 3.50% missing values. Its missingness pattern was also similar to those of estradiol and rheumatoid factor.It is important to note that UKB was designed for research purposes, and the same measurements were collected during baseline visits of each patient to one of the available recruitment centers, regardless of any underlying diagnoses/conditions. These measurements have been collected uniformly (except for some missing data for a small number of patients) and in a standardized manner by UKB for all patients. No additional measurements from any other labs, for example, from in-patient hospitalization visits, have been included. Thus, there is no mixture of research-based measurements with uncommon clinical measurements that would require further harmonization.UKB plasma proteomics dataNormalized protein expression values for 2,923 proteins were available for 46,327 UKB individuals of EUR ancestry14,15. If a single protein was recorded across multiple panels, the column with the lowest proportion of missing values was used. Age, sex, body mass index, smoking status (20116), alcohol intake (1558), paternal history (20107), maternal history (20110) and blood type (23165) were added as covariates.UKB date fields for calculating time-lagsBlood sample collection dates and urine sample collection dates were recorded in UKB fields 21842 and 21851, respectively. The dates for recording physical measures, fasting time, sex and age were assumed to be the date when a participant visited the assessment center (UKB field 53). As the difference between these dates is 0 for most of the participant IDs (Supplementary Table 13), blood sample collection dates were used for all features when calculating time-lag between sample collection date and diagnosis date (Methods section ‘Filtering based on lag between sample collection and diagnosis’).Feature pre-processingUKB has taken extensive measures to ensure quality of values (https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/biomarker_issues.pdf). To account for any missing values and transform the data for machine-learning pipelines, the following pre-processing steps were applied on the training data and then on the testing data: (1) Missing values in each feature (except rheumatoid factor, estradiol and testosterone) were imputed with median value of that feature (Supplementary Table 12). As it is recommended to treat missing values in rheumatoid factor and estradiol as ‘naturally low’ instead of missing48: rheumatoid factor was imputed with 0 IU ml−1 (https://www.ouh.nhs.uk/immunology/diagnostic-tests/tests-catalogue/rheumatoid-factor.aspx) and estradiol was imputed with values 36.71 pmol l−1 for men and 110.13 pmol l−1 for women (https://www.southtees.nhs.uk/services/pathology/tests/oestradiol/). Also, the testosterone feature was imputed with respective median values for men and women. (2) Categorical features were one-hot encoded. (3) All features were standardized to have zero mean and unit variance.Model trainingIn MILTON, we first select cases and controls for each ICD10 code10 (Data-Coding 19), and then extract biomarker values for each participant to derive a disease-specific signature. Next, we use an eXtreme Gradient Boosting (XGBoost49) model on a single balanced case–control set to find the optimal hyperparameters for a given ICD10 code. We train an ensemble model with fivefold cross-validation using up to four balanced case–control sets (bound defined by the initial number of cases and total UKB cohort size) and repeat for ten stochastic iterations to ensure the entire control set is fed to the model. We then apply the final model to the entire UKB cohort to predict the probability of each UKB participant being a case for that given disease. We then assign individuals as ‘cases’ based on four different probability thresholds (named L0–L3). The L0 class includes the strictest cut-off whereas L3 uses the most lenient, including the largest number of predicted cases. Finally, we repeat rare-variant collapsing analysis4 on each of the augmented cohorts and compare the results against the baseline cohorts used to train the models, as well as against the original PheWAS results that relied exclusively on positive labeled cases4.Further, the training of the XGBoost classifier was divided into either: (1) two steps when training on 67 traits: hyperparameter tuning and model prediction; or (2) three steps when training on UKB protein expression data: feature selection, hyperparameter tuning and model prediction.Package requirementsFor development of the MILTON pipeline, the following python packages were used: python (v.3.10.13), pandas (v.2.1.4), numpy (v.1.22.4), scipy (v.1.11.4), scikit-learn (v.1.3.0), scikit-image (v.0.22.0), xgboost (v.1.7.3), pyarrow (v.14.0.2), holoviews (v.1.18.3), matplotlib (v.3.8.0), pytest (v.7.4.0), Jinja2 (v.3.1.3), pyparsing (v.3.0.9), dask (v.2023.11.0), dask-ml (v.2023.3.24), dask-jobqueue (v.0.8.1), numba (v.0.56.0), tqdm (v.4.65.0), beautifulsoup4 (v.4.12.2), boruta_py (v.0.3).For data analysis and visualization, the following python packages were used: python (v.3.10.13), pandas (v.2.1.4), numpy (v.1.22.4), matplotlib (v.3.8.0), seaborn (v.0.12.2), statannotations (v.0.5.0), upSetPlot (v.0.8.0), missingno (v.0.5.1), scipy (v.1.11.4).Feature selection (for proteomics data only)Training cases were divided into 50% for feature selection and 50% for hyperparameter tuning and model prediction. The Boruta feature selection algorithm with random forest classifier was trained on this subset of training cases and an equal number of randomly sampled controls (Methods section ‘Defining controls in UKB’). This process was repeated nine times and a set union of ‘confirmed’ features from all iterations were used for further analysis.Hyperparameter tuningThe number of XGBoost trees (‘n_estimators’) was tuned for each ICD10 code. For this, equal numbers of age–gender matched controls as cases were shortlisted for each ICD10 code. The XGBoost classifier was then fit on the entire dataset using each of the n_estimators values {50, 100, 200, 300} and its performance was recorded in terms of area under the receiver operating characteristic curve (AUROC). The n_estimators value giving the highest AUROC for each ICD10 code was then used for further analysis.Model predictionOnce the optimal number of n_estimators was determined for each ICD10 code, the next step was to predict the probability for each UKB participant of being a case given their biomarker profile. A predicted probability value closer to 1 denotes higher chances of a participant being a case while a value closer to 0 denotes higher chances of being a control. To obtain these predicted probabilities, steps outlined in the paragraph below were repeated ten times and average prediction probability across these ten iterations was calculated for each participant in the entire UKB.In each iteration, all the cases (n) diagnosed with that ICD10 code were taken (Methods section ‘Defining cases in UKB’) and controls (ctrl_ratio × n) were randomly sampled (Methods section ‘Defining controls in UKB’). Fivefold cross-validation was performed, where in each fold, four-fifths of the data (cases = n × 4/5, controls = ctrl_ratio × n × 4/5) were used for training a ‘Balanced Ensemble Classifier’ and the remaining one-fifth (cases = n × 1/5, controls = ctrl_ratio × n × 1/5) were used for testing model performance. A Balanced Ensemble Classifier trains an XGBoost classifier on an equal number of cases (n × 4/5) and controls (n × 4/5) by random subsampling from ctrl_ratio × n × 4/5 controls without replacement and repeats the training process ctrl_ratio times to include all controls. The average performance from these four Balanced Ensemble Classifier repeats is reported as the performance of each cross-validation fold and the average performance across all five folds is reported as the performance of each overall iteration. An XGBoost classifier is then fitted on the entire training dataset (cases = n, controls = ctrl_ratio × n) to generate prediction probabilities for all UKB participants.We also compared the predictive performance of XGBoost with other classification models, such as logistic regression (LR) and the autoencoder DL model. We observed that XGBoost gave a better performance than LR and comparable performance to DL (AUCXGBoost versus AUCLR: 0.655 ± 0.09 versus 0.646 ± 0.09, MWU two-sided P = 0.000012; AUCXGBoost versus AUCDL: 0.655 ± 0.09 versus 0.656 ± 0.08, MWU two-sided P = 0.025; Supplementary Fig. 28).Additionally, we compared performance reported on cross-validation set (averaged across five folds before calculating median across ten replicates) with performance reported on a held-out test set when only 85% of data were used for training MILTON models and the remaining 15% of unseen data were used for testing model performance. Across 100 ICD10 codes, we observed performance measures to be comparable (mostly within ±0.1 difference range; Supplementary Fig. 29) across the two sets and decided to use all available data for training MILTON while reporting performance measures from cross-validation sets.MILTON-augmented case cohort generationFor each ICD10 code, four novel case ratio (NCR)-based cohorts, namely ‘L0’, ‘L1’, ‘L2’ and ‘L3’, were generated. Here, NCR is defined as 1 + ((number of new cases)/(number of known cases)). The distribution of sex for each new cohort was matched with that of known cases.To generate NCR-based MILTON cohorts with known and new cases for each ICD10 code, all known cases were assigned an average prediction probability of 1 and any not-known participant with probability less than 0.7 was filtered out. All the remaining not-known participants were ranked in decreasing order of their average prediction probability, that is, a participant with highest prediction probability will have rank 1, second highest prediction probability will have rank 2 and so on. These ranks were then divided by the number of known cases to obtain an (NCR − 1) score per participant. Therefore, if the total number of known cases is ten, then adding all not-known participants with rank less than or equal to 3 will give an NCR of 1 + (3/10) = 1.3. The range of NCR was clipped between 1 (no new cases) and 10 (number of new cases = 9 × number of known cases), and it was divided into four quantiles: NCRq1, NCRq2, NCRq3, NCRq4. All not-known cases with rank less than equal to (NCRq1 − 1) × (number of known cases) were added to ‘L0’ cohort, all not-known cases with rank less than equal to (NCRq2 − 1) × (number of known cases) were added to ‘L1’ cohort and so on for the ‘L2’ and ‘L3’ cohorts. All known cases were then added to each of these NCR-based case cohorts.To assess the suitability of using dichotomized MILTON-augmented cohorts, that is, considering an individual as a case or control, we also repeated rare-variant collapsing analysis by taking the MILTON-predicted probability scores as continuous measures (Supplementary Notes). Upon comparing gene–ICD10 associations (P < 10−8) obtained from this new approach with those obtained using dichotomized cohorts, only 30.37% (n = 82) of baseline associations were recovered in this approach having the same direction of effect, compared with 87.41% (n = 236) recovered when using the dichotomized cohorts (Supplementary Fig. 10 and Supplementary Table 14). Furthermore, we noticed that while we recovered 99.62% (n = 1,043) of known quantitative and 73.63% (n = 134) of putative novel associations with the new approach, it also led to a staggeringly high number of putative novel associations (n = 7,365). Given the low sensitivity achieved when using prediction scores as a continuous phenotype, we expect a very large number of false positives among the putative novel hits. We believe that these signals may be contributed by the genetic architecture of individuals with lower probability scores and, therefore, decided to proceed with the original dichotomized cohorts.PheWAS collapsing analysis on case–control cohortsThe PheWAS collapsing analysis was performed on each {ICD10 code, case cohort, QV model} combination using whole-genome sequencing data from UKB participants of different ancestries (EUR = 419,391; SAS = 8,577; AFR = 7,731; EAS = 2,312; AMR = 706). All noncases according to each {ICD10 code, case cohort} pair were taken as controls and matched with sex distribution of cases. A 2 × 2 contingency table was generated between cases and controls with or without QVs in each gene4. Fisher’s exact test (two-sided) was performed with the null hypothesis being that there is no difference in the number of QVs in each gene between cases and controls4. All genes with a P value less than 1 × 10−8 were considered to be significantly associated with the given ICD10 code for a given QV model and case–control cohort.Clonal hematopoiesis is the presence of clonal somatic mutations in the blood of individuals without a hematologic malignancy50, and these somatic variants can be detected with germline variant callers51. To avoid conflating somatic with germline variants, and age-confounded disease associations, we removed from further analysis 16 genes we previously identified in the UKB as carrying clonal somatic mutations15,52 (ASXL1, BRAF, DNMT3A, GNB1, IDH2, JAK2, KRAS, MPL, NRAS, PPM1D, PRPF8, SF3B1, SRSF2, TET2, TP53 and CALR). A few other problematic genes as listed in ref. 4 have also been removed from further investigation.GWAS analysis on case–control cohortsWe selected 20 binary phenotypes for common variant GWAS analysis by randomly selecting one ICD10 code from each of the four AUC-specific and five training case-specific bins obtained from time-agnostic, EUR ancestry models (Supplementary Table 15). We used UKB imputed genotypes (UKB Field 22828) to perform GWAS across baseline and L0–L3 MILTON cohorts using REGENIE (v.3.1)53 with sex, age, age × sex, age2, age2 × sex, array batch and ancestry principal components (principal components 1–10, as supplied by the UKB). For step 1, we used high-quality genotyped (UKB field 22000) variants with minor allele frequency (MAF) > 0.01, minor allele count (MAC) > 100, missing genotype fail rate < 1% and P Hardy–Weinberg > 1 × 10−15, which we pruned based on linkage disequilibrium (LD) using PLINK2 (R2 < 0.8, using a 1-Mb window with 100-kb step size). For step 2, we included all imputed single nucleotide polymorphisms with an INFO score > 0.7 and a minimum MAC of 400 with the same covariates as for step 1 and used approximate Firth regression to adjust for case–control imbalance.To define association loci for each trait, we first selected all variants with P < 5 × 10−8 across ‘known’ and L0–L3 cohorts. We then iteratively clumped these variants based on the 2-Mb region regardless of the cohort defining the variant with the smallest P value to index the locus across all cohorts. We annotated the closest gene using Ensembl (GRCh37) Biomart using the biomaRt package.FIS calculationDuring model training, the XGBoost classifier was enabled to return the relative biomarker (feature) importance scores (FISs) for each feature. FIS was calculated for each ‘n_estimator’ in XGBoost using the Gain measure, which quantifies average performance improvement when a given feature is used for making a decision split. This process was repeated for all ‘n_estimators’ and average FISs were reported for each trained XGBoost model. Because we used a balanced Ensemble classifier to train 19 XGBoost models on 19 different subsamples of controls (ctrl_ratio parameter; Methods section ‘Model prediction’) within fivefold cross-validation for each of ten replicates, the mean of those 19 × 5 × 10 average FISs was reported for each ICD10 code and feature as the final importance score.Finding optimal FIS cut-offFor each time-model, nonzero FISs per ICD10 code were first log-transformed and then standardized to have zero mean and unit standard deviation. Transformed FISs for top features per ICD10 code per time-model were extracted and compared with the median AUC of ICD10 codes across ten replicates. It was observed that transformed FISs of top features highly correlated with AUC performance of ICD10 codes (Pearson’s correlation coefficient > 0.80; Supplementary Fig. 7e). This suggested that high-performing models with AUC > 0.80 identified dominating features and assigned high FISs to them (>4 s.d. away from the mean FIS). A threshold of 1.2 was therefore chosen for further analysis because it captured dominating features for most ICD10 codes with median AUC > 0.6 for which we made results available. This value comes from rounding up to the first decimal place the 25th percentile of feature importance Z-scores for top features of ICD10 codes with AUC 0.60–0.65 (Supplementary Fig. 7e). By selecting this cut-off, we confirmed as a positive control for PKD1 (which had a relatively large number of putative novel associations in the unfiltered set of results) that the number of putative novel associations remained below the number of known binary associations, while it started exceeding them for higher FIS Z-score cut-offs (Supplementary Fig. 7f).Comparison with Mantis-ML (v.2.0)Stepwise hypergeometric tests proceed through identifying a ranked list of genes (ordered from most to least associated according to Mantis-ML (v.2.0)39,40) and a set of genes ‘known’ to be associated. The latter may be a list of tentatively associated genes, provided, for example, by a PheWAS with a relaxed P value threshold. The test then iteratively goes through the ranked list, marking the N, N + 1, N + 2 and so on genes as associated in the ranked list, and quantifies the overlap between this and the latter (the set of known associated genes) using Fisher’s exact test. In this way, we may observe whether there is substantial overlap, not by chance, between the two without specifying a threshold for the ranked list. The output of this exercise is a sequence of P values that can be converted to scores using −10log10P transformation. Finally, to provide a simple summary statistic for the test we may take the AUC, where the curve is defined as max(−10log10(0.05), −10log10P), that is, the area exceeding α = 0.05 significance. Requiring that the curves exceed 0.05 significance allows for a certain degree of noise to be filtered out. In this setting, genes from MILTON with a weak significance were used and the top 5% of Mantis-ML (v.2.0)-associated genes were used as the ranked list. Of note, Mantis-ML (v.2.0) uses HPO while MILTON uses ICD10 codes, and thus manual HPO-to-ICD10 mapping was first performed (Supplementary Table 16).Comorbidity enrichment analysisThe comorbidity enrichment analysis performs a Fisher’s exact test for all distinct ICD10 diagnoses within the selected cohort of UKB participants to find diseases that have significantly higher incidence in the cohort as compared with the general population (the rest of the UKB participants). ICD10 codes are scored according to the results of their Fisher’s tests to highlight the most relevant ones. Fisher’s exact test produces by default two numbers: P value and odds ratio. The former is a measure of how significant the disease enrichment in the selected cohort is, while the latter corresponds to the effect size, that is, how much more likely the participants from the selected cohort are to be diagnosed with the disease. The score then is taken to be the log of odds ratio, capped at a high value, and normalized to 100 to facilitate visualization of results.Statistics and reproducibilityData distribution was assumed to be normal, but this was not formally tested. N = 484,230 whole-genome sequencing samples from the UKB were derived after quality control checks and filtering for relatedness.Those ICD10 codes that had less than 100 diagnosed cases were excluded from analysis (Methods section ‘Defining cases in UKB’). Under prognostic or diagnostic time-models, those individuals who were diagnosed after or before sample collection were excluded, respectively (Methods section ‘Filtering based on lag between sample collection and diagnosis’). For ICD10 codes inferred to be sex-specific, any individuals from the opposite sex were excluded from analysis (Methods section ‘Filtering based on sex-specificity of an ICD10 code’). For a given ICD10 code, those individuals who were diagnosed with any code within the same ICD10 chapter were excluded from the control set (Methods section ‘Defining controls in UKB’).Randomization was performed ten times to select case-matched controls for each ICD10 code (Supplementary Fig. 24 and Methods section ‘Model prediction’).The investigators were not blinded to allocation during experiments and outcome assessment. Data collection and analysis were not performed blind to the conditions of the experiments.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles