Identification of lysine lactylation (kla)-related lncRNA signatures using XGBoost to predict prognosis and immune microenvironment in breast cancer patients

Collection of kla-related lncRNAsInitially, we extracted 327 genes associated with kla from previously published literature. Subsequently, co-expression analysis employing the Pearson correlation algorithm identified 756 lncRNAs associated with kla in the TCGA database.Construction of a prognostic risk assessment modelIn this investigation, we enrolled 1,082 BC patients, excluding those with missing clinical data. Random division assigned them to a training set (n = 756) and a test set (n = 326) following a 7:3 ratio. Statistical analysis revealed no significant differences in demographics, including age, gender, or staging, between the two groups (Table 1). Initially, uni-Cox regression analysis identified 48 lncRNAs (Fig. 1A). Following this, XGBoost analysis screened 10 distinctive lncRNAs (Fig. 1B), with 8 being incorporated into the multivariate Cox proportional risk model. The risk score was determined through the application of the multivariate Cox regression formula as follows: risk score = MIR4435-2HG × (0.867022528575273) + EGOT × (-0.235669144798444) + ST7-AS1 × (-1.11551827892176) + PDCD6IP-DT × (− 0.750754400726742) + ERICH6-AS1 × (− 1.20438658732549) + LINC00310 × (0.882320522314723) + OTUD6B-AS1 × (0.423845118028363) + LINC01871 × (− 0.415778456572796). The resulting risk scores for each modeled lncRNA are presented in the three-line table (Supplementary Table 1). Simultaneously, Sankey diagrams depicted the correlation between the 8 lncRNAs and the initial set of 13 genes related to kla (Fig. 1C). People received risk scores through the scoring system and were then grouped into high-risk or low-risk categories according to the median risk score value.
Table 1 Demographics of 1082 patients.Fig. 1Identification of prognostic kla‐related lncRNAs in BC. (A) Prognostic lncRNAs that were derived from the uni‐Cox regression analysis. (B) Negatively biased logarithms of the XGBoost algorithm for cox risk-proportional regression vary depending on the number of iterations. (C) Sankey diagram of the co-expression relationship between 8 kla-related lncRNAs and related mRNAs. (D) PCA between the high- and low-risk groups based on total genes. (E) PCA between the high- and low-risk groups based on kla-related genes. (F) PCA between the high- and low-risk groups based on kla-related lncRNAs. (G) PCA between the high- and low-risk groups based on signature lncRNAs.Validation of prognostic risk assessment modelsAs depicted in Fig. 2A–C, our analysis indicated that individuals in the high-risk category experienced a less favorable prognosis compared to those in the low-risk group. Expanding on this finding, Fig. 2D–F demonstrate the validation of our risk assessment model. Consistent results from Kaplan–Meier analysis demonstrated that individuals in the high-risk category had a lower overall survival than those in the low-risk group (Fig. 2D). This trend persisted in both the test cohort and the overall cohort (Fig. 2E, F). In Fig. 3B–K we assessed the prognostic significance of eight kla-related lncRNAs across various patient subgroups (Age ≤ 65, Age > 65, Metastasis negative, Metastasis positive, node negative, node negative, Stage I–II, Stage III–IV, T1–2, T3–4). Our findings demonstrated the predictive potential of these features across all examined subgroups. Patients classified in the high-risk category exhibited poorer prognoses compared to those in the low-risk category. These results indicate that these eight kla-related lncRNAs hold predictive value for breast cancer prognosis across different clinical stages. The discriminatory efficacy of various gene sets was demonstrated using PCA, depicted in four plots (Fig. 1D–G). Our findings emphasize that among the four gene sets (signature lncRNAs, kla-related lncRNAs, kla-related genes, and total genes), signature lncRNAs exhibited the most robust discriminatory efficacy.Fig. 2Prognosis of the risk model in the different sets. (A) Risk score distribution, patients’ survival status, and heatmap of 8‐lncRNA prognostic signature in the training set. (B) Risk score distribution, patients’ survival status, and heatmap of 8‐lncRNA prognostic signature in the test set. (C) Risk score distribution, patients’ survival status, and heatmap of 8‐lncRNA prognostic signature in the total set. (D–F) K‐M survival curves of OS of patients between the two groups in the training, test, and total set. (G–I) Time‐dependent ROC analysis for 1-, 3‐ and 5‐year survival probability by the 8‐lncRNA signature in the training set, test set, and total set.Fig. 3Risk curve under different clinical and pathological variables. (A) The correlation between 8 lncRNAs and clinicopathologic features was evaluated. (B, C)Age. (D, E)M. (F, G)N. (H, I)Stage. (J, K)T. Abbreviations: T, tumor; N, lymph node; M, metastasis.Prognostic properties of 8 lncRNAsWe assessed the correlation between eight lncRNAs and clinicopathologic features. The low-risk profile group exhibited elevated expression levels of EGOT, ST7-AS1, PDCD6IP-DT, ERICH6-AS1, and LINC01871, whereas the high-risk group demonstrated increased expression of MIR4435-2HG, LINC00310, and OTUD6B-AS1 (Fig. 3A). Additionally, survival analysis of lactylation profiles was conducted across various subgroups with distinct clinical characteristics. Figure 3B–K depict the prognostic significance of the 8 lncRNAs in different patient subgroups. The findings indicate that prognostic characterization is applicable across all subgroups. Those in the high-risk category exhibited a less favorable prognosis compared to individuals in the low-risk group. These results imply that the 8 lncRNAs serve as predictive indicators for the prognosis of BC, although prognostic features may not be applicable to breast cancer cases with distant metastases.The construction and validation of a nomogramIllustrated in Fig. 4A, B, we utilized univariate and multivariate Cox regressions to evaluate the prognostic importance of variables. In both single-factor and multi-factor regressions, the p-value associated with our risk score was below 0.05, confirming the independent prognostic significance (HR = 1.408, 95% CI 1.268–1.562, p < 0.001). Age emerged as the sole remaining independent prognostic factor (HR = 1.033, 95% CI 1.014–1.051, p < 0.001). Furthermore, Kaplan–Meier analysis validated the prognostic significance of the PFS risk score (Fig. 4G). To evaluate the predictive capacity of independent prognostic factors, we employed the receiver operating characteristic curve (ROC). In the 3-year ROC model, the AUC of the risk score was 0.756, indicating a robust predictive ability in comparison to other clinicopathologic features (Fig. 4C). In the training set, the AUC values for the ROC curves at 1, 3, and 5 years were 0.691, 0.756, and 0.722, respectively. This affirms the predictive capacity of our risk score across different time frames (Fig. 2G). This discovery was consistently validated in both the test cohort and the overall cohort (Fig. 2H, I). Furthermore, the C-index demonstrated that our risk score displayed superior predictive accuracy in comparison to other factors (Fig. 4D). Consequently, we generated nomogram plots to facilitate quantitative prognosis prediction (Fig. 4E), and its predictive accuracy was confirmed through subsequent calibration (Fig. 4F).Fig. 4Verification of a prognosis risk assessment model and a nomogram of construction. (A) Uni‐Cox analyses of clinicopathologic factors and risk score with OS. (B) Multi‐Cox analyses of clinicopathologic factors and risk score with OS. (C) The ROC curves of risk score and clinicopathologic features. (D) C-index demonstrated the predictive accuracy of the risk score was superior to other clinical parameters. (E) Nomogram for predicting overall survival. (F) A calibration plot to assess the prediction power of the nomogram. (G) Kaplan–Meier curves of PFS.GSEA of high- and low-risk groupsWe employed GSEA to investigate variations in GO and KEGG between the two groups. Within the high-risk subgroup, KEGG analysis confirmed enriched signaling pathways associated with tumor initiation and progression, including ECM receptor interaction, focal adhesion, and tight junction signaling pathways (Fig. 5D). Moreover, the high-risk subgroup exhibited enrichment genetically altered biological processes, such as cellular component assembly involved in morphogenesis and muscle cell development, as evidenced by the GO analysis (Fig. 5B). On the flip side, the low-risk subgroup exhibited enriched signaling pathways, including Graft-versus-host disease, allograft rejection, cytokine-cytokine receptor interaction, and Intestinal immune network for IgA production, all interconnected with immunity based on the KEGG analysis (Fig. 5C). Embedded biological processes that are associated with immune function were highlighted by the GO analysis, which included an adaptive immune response and T cell activation (Fig. 5A).Fig. 5Differences between high- and low-risk groups in gene set enrichment analysis (GSEA) and tumor microenvironment (TME). The GSEA approach was utilized to detect and depict distinct Gene Ontology (GO) (A, B) as well as enrichment analyses for the Kyoto Encyclopedia of Genes and Genomes (KEGG) (C, D) in both the high- and low-risk groups. Results of infiltrating fractions of immune cells. (E) Results of immune functions. (F) The presence of immune cells was assessed in both low- and high-risk groups. (G) The expression of immune checkpoint inhibitors varied among the high- and low-risk groups. (H) The proposed model examined correlations between TIICs and 8 kla‐related lncRNAs. *p < 0.05, **p < 0.01, ***p < 0.001, ns (no significance).Investigation of immunization characteristics of different risk groupsWe delved deeper into exploring the potential correlation of kla-related lncRNA features with tumor immunity. We calculated the difference in the proportion of tumor-infiltrating immune cells exhibiting lactylation between low- and high-risk subgroups using multiple databases, such as TIMER, CIBERSORT, CIBERSORT-ABS, MCPCOUNTER, QUANTISEQ, EPIC, and XCEL immunization databases (Supplementary Fig. 1). Furthermore, we assessed the relative proportions of 22 tumor-infiltrating immune cells in each sample through CIBERSORT, unveiling distinctive abundances in the two risk groups (Fig. 5G). An intricate correlation between the TIICs and the 8 kla-related lncRNAs was observed (Fig. 5I). These results imply that the 8 lncRNAs incorporated into our risk model can discern different aspects of immune cell infiltration in breast cancer. We conducted a thorough quantification of 16 immune cells and their respective 13 immune pathways and functions using single-sample gene set enrichment analysis (ssGSEA). Exploring correlations between various risk scores and immune pathways, we identified significant associations between 11 immune functions and 15 immune cell types with Kla-related risk scores (Fig. 5E, F). Further analysis explored the connection between risk scores and immune checkpoint genes, revealing heightened expression of 23 gene types in the low-risk category, encompassing immunosuppressive molecules like CD274, CTLA4, BTLA, IDOI, and ICOS (Fig. 5H).TIDE, TMB and therapeutic drug sensitivityWe investigated the prevalence of somatic mutations and copy number variations (CNVs) in genes associated with breast cancer, revealing missense mutations as the most common variant category, with single nucleotide polymorphisms (SNPs) being the most prevalent type. The most frequently observed SNV classification has been C > T (Supplementary Fig. 1). Subsequently, we explored the frequency and categories of mutations in two distinct risk groups. Mutations were identified in 453 (88.3%) out of 513 breast cancer samples in the low-risk subgroup and 454 (90.26%) out of 503 breast cancer samples in the high-risk subgroup. The predominant type of mutation observed was missense mutation. Within the high-risk subgroup, PIK3CA exhibited a notable mutation rate (32%), ranking second only to the TP53 mutation rate (41%). Meanwhile, in the low-risk subgroup, PIK3CA had the highest number of mutations (37%) (Fig. 6A, B). Moreover, in the computation of the tumor mutational burden (TMB) for each breast cancer patient, we observed increased TMB in patients associated with the high-risk group in comparison to those in the low-risk group (Fig. 6C). The survival curve based on the TMB illustrated that individuals with lower TMB had a more favorable prognosis (Fig. 6D p = 0.010). Individuals with both low risk and low tumor mutational burden in breast cancer demonstrated the most favorable prognosis compared to other groups (Fig. 6E). The TIDE score was notably lower in the high-risk group in comparison to the low-risk group (Fig. 6F). We analyzed drug sensitivity using the GDSC and CTRP databases with the R package “oncoPredict” to predict drug response in cancer patients based on cell line screening data. We observed variations in the IC50 value of the PLK1 inhibitor BI 2536 between the two groups. Breast cancer patients in the high-risk group exhibited increased sensitivity to this drug (Fig. 6G).Fig. 6TMB, TIDE, and sensitivity to chemotherapeutics. (A) waterfall plot illustrating the top 20 mutant genes is generated for patients in the low-risk group. (B) Patients in the high-risk group face a heavier tumor mutational burden. (C) TMB between high‐ and low‐risk groups. (D) K–M survival curves between the low‐TMB and high‐TMB groups. (E) K‐M survival curves between the two groups. (F) TIDE score between the high‐risk and low‐risk groups. (G) IC50 difference in BI 2536.

Hot Topics

Related Articles