expHRD: an individualized, transcriptome-based prediction model for homologous recombination deficiency assessment in cancer | BMC Bioinformatics

Development of the HRD prediction model using the transcriptomeA comprehensive dataset of 10,068 samples representing 34 cancer types from the TCGA-pan cancer cohort, encompassing both tumour and blood samples with SNP array copy number variation and WTS, was harnessed for training, development, and evaluation of the HRD prediction model (Fig. 1). The scarHRD score for each sample was derived by aggregating HRD-associated genomic scars—namely, LOH, LST, and TAI.Fig. 1Schematic representation of transcriptome-based HRD prediction model development and validation. Schematic illustration describing the overall process of serial machine-learning training and validation for the development of an HRD prediction algorithm in the cancer transcriptomeIn a sequential manner, the dataset was stratified into two subsets: a training set (n = 8041) comprising 80% of the samples, and a test/validation set (n = 2027) comprising the remaining 20%. The training set included 1422 HRD and 6619 HR-proficient (HRP) samples, while the test/validation set contained 353 HRD and 1674 HRP samples. Gene selection was executed by retaining 20,502 genes from RNA-seq data and filtering down to 4436 DEG based on scarHRD status within the TCGA-OV set—an essential component of the clinical HRD prediction algorithm evaluation.Gene feature optimisation was accomplished through iterative machine-learning training employing elastic net and bootstrap methodologies. A total of 356 genes were meticulously chosen to facilitate the calculation of expHRD, which has applicability at the individual patient level in clinical contexts. Subsequent validation processes encompassed internal validation (TCGA-OC test, n = 58) and external validation (GDC PanCanAltas, ovarian cancer, n = 112).Performance evaluation of the HRD prediction modelThe expression data of initially selected genes (n = 4436) from the pan-cancer training set were subjected to various linear regression models, such as Ridge, Lasso, elastic net regression, SVM, Gradient Boosting Regressor (GBM), and Multilayer Perceptron (MLP), for the prediction of corresponding sample scarHRD scores (Table S1). The gene count ranged from 862 to 4436, yielding R-squared values varying from 0.6375 to 0.7364, dependent upon the regression models (Table 1). The R-squared value is indicative of the predictive accuracy of our machine learning model, representing the proportion of variance in the actual scarHRD scores that our model can explain. Notably, the elastic net regression model, among the regression options, exhibited the highest correlation with the scarHRD score and was subsequently chosen for refining the prediction algorithm. Five-fold cross-validation demonstrated feature number saturation (n = 2538) and yielded a Pearson’s correlation coefficient (PCC) of 0.858 (P < 0.0001) and 0.788 (P < 0.0001) in the TCGA-pan cancer and OV test cohorts, respectively (Fig. 2a). The Pearson’s correlation coefficient measures the linear correlation between the actual scarHRD scores and the HRD scores predicted by our machine learning model, providing insight into the strength and direction of this linear relationship.Table 1 Performance evaluation of various machine-learning methodsFig. 2Evaluation of the HRD prediction model in the TCGA-pan cancer. a Cross-validation analysis of the machine-learning model. The x-axis denotes the number of elastic net cross-validation iterations. The left-y-axis signifies the count of features (genes), while the right-y axis indicates Pearson’s correlation coefficient (PCC) with the scarHRD score post-machine learning. Black closed circles linked by solid lines and white circles connected by dotted lines correspond to the gene count and PCC, respectively, across each cross-validation step. b Correlation pattern across TCGA-pan cancer cohorts. Bar graph depicting the PCC between the predicted HRD score and scarHRD score in the TCGA-pan cancer test set, encompassing various cancer types including KIRP (kidney renal clear papillary cell carcinoma), UCEC (uterine corpus endometrial carcinoma), BRCA (breast invasive carcinoma), KICH (kidney chromophobe), BLCA (bladder urothelial carcinoma), CESC (cervical squamous cell carcinoma and endocervical adenocarcinoma), OV (ovarian serous cystadenocarcinoma), STAD (stomach adenocarcinoma), SARC (sarcoma), UCS (uterine carcinosarcoma), LIHC (liver hepatocellular carcinoma), PRAD (prostate adenocarcinoma), LGG (brain lower grade glioma), TNBC (triplet negative breast cancer), HNSC (head and neck squamous cell carcinoma), MESO (mesothelioma), READ (rectum adenocarcinoma), SKCM (skin cutaneous melanoma), LUAD (lung adenocarcinoma), PAAD (pancreatic adenocarcinoma), ESCA (esophageal carcinoma), COAD (colon adenocarcinoma), KIRC (kidney renal clear cell carcinoma), ACC (adrenocortical carcinoma), LUSC (lung squamous cell carcinoma), THYM (thymoma), CHOL (cholangiocarcinoma), PCPG (pheochromocytoma and paraganglioma), GBM (glioblastoma multiforme), THCA (thyroid carcinoma), UVM (uveal melanoma), DLBC (lymphoid neoplasm diffuse large B-cell lymphoma), and TGCT (testicular germ cell tumours). Significance levels denoted as *, **, and *** indicate P-values < 0.05, < 0.001, and < 0.0001, respectively. The frequency of HRD (scarHRD score ≥ 42) in each tumour type is displayed. c Correlation between scarHRD and predicted HRD score (pHRD) in the TCGA-pan cancer test set. Pearson’s correlation-regression line was calculated, with the dark dotted line illustrating pan-cancer correlation and the red line representing TCGA-OV set correlation. The numeric number in each bar plot represents the frequency of HRD positive samples in cancer types. Frequency: the number of HRD positive sample / the number of sample with scarHRD scoreFollowing elastic net regression, the R-squared values were as follows: pan-cancer, 0.7364; ovarian cancer (OV), 0.592; and triple-negative breast cancer (TNBC), 0.4682. Similar conclusions were drawn from the performance metrics of various machine-learning models for predicting HRD-high samples (scarHRD ≥ 42) within the TCGA-OV test set (Table S2).The PCC of the HRD prediction score with scarHRD was 0.8584 (P < 0.0001) in the pan-cancer test set. In most cancer types, excluding those with limited clinical significance in relation to HRD status (e.g., cholangiocarcinoma, uveal melanoma, diffuse large B-cell lymphoma, testicular germ cell tumour), a significant correlation with the scarHRD score was observed (Fig. 2b, c). Notably, HRD prediction value in clinically relevant cancer types—such as OV, breast cancer (BRCA), TNBC, and prostate adenocarcinoma (PRAD)—displayed substantial correlations with scarHRD score (PCC > 0.7). Particularly in the case of TCGA-OV, a validation cohort with clinical relevance, the HRD prediction score showcased a significant and remarkable correlation with the scarHRD score (PCC = 0.7879, P = 2.1556e−13, Fig. 2c).Bootstrap for feature optimisation of the HRD prediction modelFurther refinement of gene features was achieved through sequential machine-learning analysis employing elastic net and bootstrap techniques (Fig. 1). The gene feature count for HRD score prediction was optimised, reducing from 4436 to 2538 via elastic net regression (Ela). Subsequently, a second-round elastic net regression (Re-Ela) further streamlined this number to 2337. To achieve maximal gene feature optimisation, we embraced the bootstrap method, maintaining elastic net regression parameters. Through this iterative process involving random sampling and training set generation (n = 229) over 100 iterations, 356 genes consistently emerged in 98 out of 100 bootstrap instances (Fig. S1a). Notably, the PCC with scarHRD score improved from 0.533 to 0.767 across optimisation stages, while gene features dwindled from 4436 to 356.The final selection encompassed 356 gene features classified as positively correlated (n = 183) and negatively correlated (n = 173), out of which a significant majority—94.8% for positively correlated and 94.0% for negatively correlated—were identified as coding sequences (CDS) (Fig. S1b and S2). In addition, we used the EnrichR package to perform gene set enrichment analysis to investigate the association of HRD score with positively associated genes (Table S3) and negatively associated genes (Table S4) with specific pathways. The results revealed that genes such as ATR and AURKA and pathways related to “Regulation of DNA repair” and “Cell cycle process” exhibited significant positive correlation with HRD score. On the other hand, genes and pathways associated with “Cell cycle” or “DNA damage checkpoint”, including BRCA1, demonstrated a negative correlation with HRD.Development and validation of the expHRD algorithmCalculation of expHRD was accomplished through single-sample gene set analysis (ssGSEA) using a gene set derived from the bootstrap process. This computation was executed using a newly devised equation to tailor the expHRD values per sample: The expHRD score is determined by subtracting the ssGSEA score of HR-negative genes from that of HR-positive genes. Notably, the PCC of expHRD against scarHRD score was 0.768 (P = 2.045e−12) in the TCGA-OV test set (Fig. 3a) and 0.633 (P = 6.655e−14) in the GDC ovarian cancer cohort (Fig. S3a). For evaluating the predictive performance of expHRD concerning scarHRD-high samples, receiver operating characteristics (ROC) curve analysis was conducted, revealing area under curve (AUC) values of 0.872 and 0.806 in TCGA-OV (Fig. 3b) and the GDC ovarian cancer cohort (Fig. S3b), respectively. To classify samples as having high or low expHRD, we utilized a regression analysis to align expHRD scores with those obtained from scarHRD, thereby standardizing our scoring against a recognized metric. Samples exhibiting an expHRD score of 1000 or higher were classified as having a high level of HRD.Fig. 3Validation of expHRD performance in the TCGA-OV test set. a Pearson’s correlation between expHRD and scarHRD in the TCGA-OV test set (n = 58, PCC = 0.768, p = 2.045e−12). The blue line denotes the regression line, while the shaded area represents the 95% confidence interval (CI). b Receiver operating characteristic (ROC) curve plotting sensitivity against 1-specificity values for expHRD score’s capacity to predict scarHRD-high instances within TCGA-OV test samples. c, d Kaplan–Meier overall survival analysis contrasting patients with high vs. low scarHRD (c) or expHRD (d) status within the TCGA-OV test set. P-values obtained via the log-rank testNext, we analyzed the association between the mutation and promoter methylation of BRCA1/2 genes, and the distribution of expHRD calculated using our algorithm in the TCGA-OV and -TNBC test set (n = 58 and 12, respectively). The results revealed a statistically significant enrichment of BRCA1/2 mutations and methylation in samples with higher expHRD scores (expHRD ≥ 1000; P = 4.973e−04; Fig. S4). The accuracy and precision for predicting scarHRD were 0.71 and 0.58, respectively.Also, we analyzed the association of Classifier of Homologous Recombination Deficiency (CHORD) which predicts the status of HRD by utilizing specific single nucleotide variants (SNV), short insertions/deletions (indel) and structural variants (SV) types [9], scarHRD, and expHRD with 23 samples in the TCGA-OV cohort. CHORD exhibited significant associations not only with scarHRD (Spearman’s rho = 0.660; p = 6.16e−4) but also with expHRD (Spearman’s rho = 0.484; p = 1.63e−2; Fig. S5). Furthermore, we analyzed the distribution patterns of expHRD and CHORD based on mutations and methylation within BRCA1/2 genes (Fig. S6). Both the Australian Ovarian Cancer Study (AOCS) and TCGA-OV/GDC-OV cohorts showed higher expHRD scores in samples with BRCA mutations and methylation compared to the wild type, and CHORD exhibited a similar pattern in AOCS (Fig. S6a, c) and TCGA-OV/GDC-OV (Fig. S6b, d) dataset.To compare the performance of expHRD with another HRD-associated gene set (n = 230) identified by Peng et al., we performed HRD scoring based on ssGSEA using the HRD-associated genes presented in the paper and tested its ability to distinguish scarHRD high and low [25]. The results showed that our expHRD algorithm yielded notably enhanced prediction potency, as reflected in elevated accuracy and AUC values in both TCGA-OV test set and GDC cohorts (Table 2).Table 2 Performance comparison with other HRD prediction tools on the TCGA-OV and GDC data setsSubsequent scrutiny of the clinical relevance of expHRD encompassed intra-cohort (TCGA-OV test set, n = 58, Fig. 3c, d) and extra-cohort (GDC ovarian cancer, n = 112, Fig. S5c, d) samples. Recent studies have reported that the HRD score is a clinically proven indicator for ovarian cancer prognosis [21, 30]. As anticipated, patients with scarHRD-high tumours exhibited significantly improved overall survival (TCGA-OV median survival: 1538 days ± confidence interval (CI) 95% (range: 1187–4624), GDC median survival: 1511 days ± CI 95% (range: 1355–2154)) relative to those with scarHRD-low tumours (TCGA-OV median survival: 679 days ± CI 95% (range: 455–1448), GDC median survival: 871 days ± CI 95% (range: 681–1662)), as evidenced in both the TCGA-OV test set (p = 0.0004, Fig. 3c) and GDC ovarian cancer cohort (p = 0.00037, Fig. S7a). Moreover, patients with high expHRD tumours (TCGA-OV median survival: 2717 days ± CI 95% (range: 1249–4624), GDC median survival: 1442 days ± CI 95% (range: 1264 ~ 1933)) experienced significantly superior overall survival compared to cases with low expHRD (TCGA-OV median survival: 760 days ± CI 95% (range: 627–1448), GDC median survival: 949 days ± CI 95% (range: 690–1946) in both the TCGA-OV test set (p = 0.00047, Fig. 3d) and GDC ovarian cancer cohort (p = 0.011, Fig. S7b).We next validated whether the expHRD scoring system could reflect the functional restoration of HRD in recurrent samples with BRCA reversion mutations compared to primary tumors. To this end, we analyzed expHRD in five pairs of primary and recurrent samples with identified BRCA reversion mutations from the AOCS cohort [38]. The analysis revealed that two pairs of samples (AOCS_091 and AOCS_167) exhibited decreased expHRD scores in recurrent tumors, while this pattern was not observed in the remaining three pairs (Fig. S8). These results suggest that the expHRD scoring system, designed primarily to predict scarHRD scores, may have limitations in detecting functional restoration, similar to the scarHRD system.Web server development for expHRD calculationTo facilitate expHRD calculation, we developed a user-friendly web service enabling researchers to obtain predicted HRD scores akin to scarHRD by uploading their transcriptome data, even for a single tumour. The expHRD webserver (http://www.genome-intelligence-lab.org/expHRD/) is powered by Apache and constructed using the Django framework. Data management is governed by sqlite3 (https://www.sqlite.org/). On the client side, HTML5 (https://html.com/html5/) and JavaScript (https://www.javascript.com/) were employed to create interactive user interface components. Ensuring a responsive user experience, the web server employs bootstrap (https://getbootstrap.com/), jqWidgets (https://www.jqwidgets.com), and plotly libraries (https://plotly.com/javascript/). Noteworthy, expHRD prioritises user privacy, refraining from cookies or personal information collection. Moreover, it guarantees compatibility across major web browsers—Microsoft Edge, Google Chrome, Apple Safari, and Mozilla Firefox. The website is accessible without login requisites and is open to all users. The web server showcases calculated HRD scores (expHRD) upon clicking “Upload” to input DESeq2-based gene-expression profiles of single or multiple samples, followed by the “Run” button (Fig. 4a, b). Users are provided estimated values, along with 95% confidential intervals, derived from expHRD, allowing distribution-based comparative analyses with TCGA-OV samples (Fig. 5a, b). The predicted HRD is a score recalculated from the ssGSEA-based expHRD score of the designated samples and parameters obtained from the linear regression model of scarHRD (Fig. 5).Fig. 4The web interface for expHRD calculation. The front page of the web interface for expHRD calculation demonstrates the query for uploading (a) and the input file overview (b) of the user’s gene-expression dataFig. 5The result page of the web interface. a The result page of the web interface exhibited upon selecting “Run” after file upload presents the HRD score graph with the TCGA-OV test sample’s regression function (a, left). The histogram illustrates the distribution of HRD scores (a, right). b The result table showcases the calculated expHRD alongside the predicted HRD (p_hrd), both derived from the expHRD outcomes. The values for lower_hrd and higher_hrd are determined by the 95% confidence intervals (CI) of the predicted HRD

expHRD: an individualized, transcriptome-based prediction model for homologous recombination deficiency assessment in cancer | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis