Apelin (APLN) is a biomarker contributing to the diagnosis and prognosis of hepatocellular carcinoma

Establishment and analysis of a geneset for model constructionTo construct high-quality diagnostic and prognostic models for HCC, it’s imperative to first curate a purpose-specific dataset. To improve the quality of this dataset and limit the number of genes included, we set three criteria for the gene selection: (1) The expressions of the included genes should be significantly different between tumor and normal tissues; (2) The expressions of the included genes should correlate with patient survival in the tumor cohort; and (3) The included genes should be classified as immune-related factors.Our primary data was sourced from TCGA-LIHC dataset, which comprises 424 samples, including 50 from normal tissues and 374 from tumor tissues. Through Principal Component Analysis (PCA), we observed a distinct separation of samples into two clusters, clearly differentiating normal and tumor tissues, as illustrated in our PCA plot (Fig. 1A). Our differentially expressed genes (DEG) screening identified 1372 genes (coding genes), among which 864 genes were upregulated and 508 genes were downregulated in tumor tissues compared to normal tissues, as depicted in the volcano plot (Fig. 1B). Then, a batch Cox regression analysis was used to explore the relationship between gene expression and patient survival outcome in the tumor samples, leading to the identification of 4527 genes that significantly correlated with patient survival rates. Subsequently, a list of 1793 immune-related genes was obtained from the ImmPort database. Finally, a geneset containing 36 genes was established by the intersection of the above three gene sets, which would be used for the subsequent development of the diagnostic and prognostic models (Fig. 1C).Fig. 1Establishment of a 36-gene dataset for model construction. (A) Principal component analysis (PCA) on TCGA-LIHC dataset after sample filtering. (B) Volcano plot of DEGs between HCC and normal liver tissues. (C) Venn diagram demonstrating the selection of 36 genes for subsequent model development. (D,E) GO/KEGG enrichment analysis of the identified 36 genes. (F) PPI network analysis of 36 genes, 29 of which were related (combined score > 0.4).According to the results of GO and KEGG analyses, these 36 genes are involved in biological pathways related to development and growth (such as morphogenesis and axonogenesis regulation), as well as molecular functions such as receptor ligand activity. Most of these genes are implicated in receptor ligand activity, signaling receptor activation, or protein modification processes (Fig. 1D,E). We also conducted a Protein–Protein Interaction (PPI) network analysis and found that 29 of the 36 genes exhibit a moderate to high correlation with a combined score > 0.4, as illustrated in Fig. 1F. Collectively, the enrichment and PPI network analyses not only shed light on the operating mechanisms and intrinsic relationships of these genes but also underscore their potentially critical role in distinguishing between tumor and normal tissues.Establishment and comparison of the diagnostic modelsAfter obtaining the 36-gene set, we tried to construct an early diagnostic model for HCC. We divided the TCGA-LIHC dataset and GTEx-Liver dataset into training and validation cohorts at a 7:3 ratio, with the former designated for model development and the latter for validation. Our initial analysis focused on the expression correlations of these 36 genes. Based on our experience, genes with low correlations are advantageous for binary classification models (Fig. 2A). Using LASSO regression, we refined the gene set from 36 down to 12 (Fig. 2B,C). We then further narrowed it down to 7 key genes—MAPT, VIPR1, TNFRSF4, CCR1, BIRC5, RFX5, and APLN—through logistic regression analysis (Fig. 2D). The logistic regression model, expressed as Risk Score = (-1.00 × CCR1 expression) + (1.77 × RFX5 expression) + (1.01 × APLN expression) + (0.49 × MAPT expression) + (0.82 × TNFRSF4 expression) + (0.72 × BIRC5 expression) – (0.52 × VIPR1 expression), demonstrated remarkable predictive capacity and accuracy, as evidenced by the decision curve and calibration analysis (Fig. 2E). The model’s Area Under Curve (AUC) was 0.996 in both the training and validation cohorts (Fig. 2F), and it maintained robust performance when validated on an external cohort (Fig. 2G).Fig. 2Evaluation and comparison of the diagnostic models. (A) Correlation of expression levels of 36 genes. (B,C) Gene selection via the Lasso algorithm, focusing on genes selected at the lambda-min for further analysis. (D) Construction of a nomogram model based on seven key genes included in the logistic regression model. (E) Calibration curve for the logistic regression model. (F) Receiver operating characteristic (ROC) curves for the logistic regression model in both training and validation cohorts, with the DeLong test confirming consistency. (G) ROC curves for the logistic regression model across test cohorts (including GSE69715, GSE102079, GSE76427), demonstrating predictive reliability. (H) Top 20 diagnostic models built with machine learning algorithms.To further refine the diagnostic models for HCC, we embraced the latest popular machine learning algorithms. Experimenting with 113 combinations of 12 different machine learning binary prediction algorithms based on the 36-gene set, we achieved impressive results, as depicted in Fig. 2H and Supplementary Fig. S1. The models ranked first and second, the plsRglm model and Ridge model, involved the full suite of 36 genes. The model ranked third, a combination of Random Forest (RF) and Naive Bayes algorithms, also showed superior predictive performance compared to our logistic regression model. This model utilized a more concise 16-gene set, comprising VIPR1, BIRC5, SEMA3F, GHR, RFX5, APLN, MDK, ESR1, TNFRSF4, MAPT, JUN, CCR1, RFXANK, PIK3R2, ADM, and PSMD4. Impressively, it attained an AUC of 0.993 in the training cohort, 0.991 in the validation cohort, and exceptional performance in three external cohorts with AUCs of 0.999, 0.966, and 0.964 (Fig. 2H).In conclusion, our research successfully established an early diagnostic model for HCC using a machine learning approach. This model, based on a 16-gene set, is built using the RF + NaiveBayes algorithms. It has demonstrated remarkable performance in both the training and validation cohorts, effectively distinguishing HCC from normal tissue samples with high precision. Utilizing the predict function in the R language environment and our training cohort, this model can be easily applied to clinical diagnostics. In practical terms, when the model predicts a probability greater than 0.5, the sample is classified as tumorous, providing a straightforward and reliable tool for early HCC detection.Establishment and comparison of the prognostic modelsIn our research, we extensively evaluated the prognostic significance of genes within our 36-gene set, utilizing the comprehensive clinical data from TCGA. Prior survival correlation analysis underscored the potential of these genes as independent prognostic indicators. However, given patient heterogeneity and varied tumor subtypes, we recognized the need for a multivariate model to enhance prognostic accuracy.First, we screened out the top 7 survival-correlated genes from the 36-gene set through LASSO regression (Fig. 3A,B), including SPP1, BIRC5, APLN, MAPT, PLXNA1, NDRG1 and CACYBP. Using the stepwise COX regression algorithm, we refined this to a 5-gene set. Yet, as CACYBP didn’t meet the proportional-hazards assumption, we established a 4-gene model (Fig. 3C–E), with each gene proving to be an independent prognostic factor. This model, meeting the proportional-hazards assumption (global p = 0.352), is defined as Risk Score = (0.087 × SPP1 expression) + (0.208 × BIRC5 expression) + (0.177 × APLN expression) + (0.276 × PLXNA1 expression). It is shown that this model performed well in the training cohort, with an AUC of 0.748 at 1-year, 0.685 at 2-year, and 0.671 at 3-year (Fig. 3F). It also performed well in the validation cohort and the external test cohort. The validation cohort had an AUC of 0.748 at 1-year, 0.722 at 2-year, and 0.696 at 3-year (Fig. 3G), the external test cohort had an AUC of 0.620 at 1- year, 0.623 at 2-year, and 0.632 at 3-year (Fig. 3H). Overall, the four-factor Cox regression model has practical decision-making value.Fig. 3Evaluation and comparison of the prognostic models. (A,B) Gene selection via the Lasso algorithm, focusing on genes selected at the lambda-min for further analysis. (C) Assessment of individual factor performance within the multi-Cox model. (D) Calibration curve for the multi-Cox model. (E) Construction of a nomogram model based on four key genes included in the multi-Cox model. (F) ROC curves for the Cox regression model in training cohorts. (G) ROC curves for the Cox regression model in validation cohorts. (H) ROC curves for the Cox regression model in test cohorts ICGC-LIRI. (I) Top 20 prognostic models built with machine learning algorithms. (J) ROC curves for the multi-Cox model across different age groups (youth, middle, and old), with age thresholds set at 45 and 65 years. (K) ROC curves for the multi-Cox model distinguishing between male and female. (L) ROC curves for the multi-Cox model distinguishing between early and advanced HCC stages.We also tried to explore additional prognostic models with the help of machine learning algorithms, testing 101 combinations of ten machine learning algorithms. While several machine learning-based models outperformed the Cox regression model (Fig. 3I and Supplementary Fig. S2), the top-ranked Ridge model, using numerous genes, demonstrated superior AUC performance across all cohorts. The training cohort had an AUC of 0.775 at 1-year, 0.708 at 2-year, and 0.699 at 3-year, the validation cohort had an AUC of 0.756 at 1-year, 0.771 at 2-year, and 0.716 at 3-year, the external test cohort had an AUC of 0.637 at 1- year, 0.647 at 2-year, and 0.637 at 3-year. Nevertheless, considering the complexity, the streamlined 4-factor Cox model maintains its significance for practical decision-making in prognostics.Performance evaluation of the prognostic modelWe evaluated the performance of the prognostic model by AUC across various conventional clinical subgroups, including T stage, gender, and age. Samples were divided into male or female groups according to gender, and young (below 45 years), middle-aged (45–65 years) or old (above 65 years) groups according to age. Additionally, samples were divided into early-stage (T1 and T2 stages) or advanced-stage (T3 and T4 stages) groups according to T stage. The results show good performance across all these clinical subgroups, suggesting the model’s robust discriminatory capability. Notably, we found that the model was particularly effective in young patients, males, and those in advanced stages of HCC (Fig. 3J–L).APLN is a gene closely associated with HCCIn our DEG analysis and model construction, APLN emerged as a noteworthy gene, frequently featured in various models and integral to our finalized diagnostic and prognostic models, which has received less attention in HCC and plays role in HCC is not enough clear. This led us to a detailed investigation of APLN. Firstly, we reviewed our enrichment analysis findings, focusing on those related to APLN. The results primarily connected APLN to receptor ligand activity, G protein-coupled receptors, and cell proliferation. Considering its linkage with liver cancer, we speculated that APLN-associated receptor ligand activity alterations could lead to endothelial cell proliferation (Fig. 4A). Secondly, we explored APLN’s differential expression across a spectrum of cancers. Pan-cancer analysis showed that APLN’s expression significantly differed in many cancers relative to normal tissues, predominantly displaying upregulation in tumors. Notable exceptions included kidney renal papillary cell carcinoma (KIRP), lung adenocarcinoma (LUAD), and lung squamous cell carcinoma (LUSC), where APLN was significantly downregulated (Fig. 4B,C), suggesting diverse regulatory mechanisms of APLN across different cancer types. In the context of HCC, we found a significant upregulation of APLN (Fig. 4D), corroborated by our paired sample analysis, indicating APLN upregulation as a potential distinguishing feature of HCC.Fig. 4Correlation of APLN expression with different factors. (A) APLN-related items in enrichment analysis. (B) Comprehensive analysis of APLN expressions between tumor and normal tissues in a variety of cancers. (C) Specific analysis of APLN expressions between tumor and matched paracancerous tissues in a variety of cancers. (D) Focused analysis of APLN expressions between tumor and matched paracancerous tissues in liver cancer. The correlation analyses of APLN expression with age (E), gender (F), and T stage (G).We further analyzed APLN expression across various clinical subgroups categorized according to T stage, age, and gender. Samples were divided into three age groups: under 45, between 45 and 65, and above 65 years. Additionally, samples were divided into male and female groups based on gender or four groups representing T1–T4 stages. Our findings revealed stable APLN expression across these subgroups, except for a significant difference within early T stage (T1 and T2 stages). This consistent expression pattern of APLN, irrespective of age, gender, and advanced T stage (T3 and T4), underscores its potential as a robust biomarker for HCC, indicating its stability and reliability independent of conventional clinical factors (Fig. 4E–G).Investigation of APLN in single-cell sequencing dataTo further investigate the potential mechanisms underlying the effects of APLN on HCC and their corresponding cell subsets, we took an in-depth look at single-cell sequencing data. Our analysis is based on the scRNA-seq dataset GSE149614. After data filtration, we analyzed the remaining 62,774 valid cells and 24,879 genes from both tumor and normal tissue samples across multiple cases. Cells were classified into 28 distinct clusters by the FindClusters function and annotated into 9 types by common marker genes (Fig. 5A,B). Notably, we observed a significant reduction of NK cells in tumor tissue compared to normal tissue, while hepatocytes were markedly increased (Fig. 5C). This result is consistent in all seven paired sample analysis, illustrating changes in immune cell content and predominant proliferating cell types in HCC tissue, regardless of potential biases caused by uneven sampling. Next, we examined APLN expression across various cell clusters. In normal tissues, APLN showed minimal expression in endothelial cells. However, in tumor tissues, APLN was highly expressed in endothelial cells and presented weak expression in other cell types. This dramatical difference in APLN expression patterns between tumor and normal tissues, each sample’s contribution to assessing APLN expression is depicted in Fig. 5D. Our results indicate that, despite minor variances across samples, the significant upregulation of APLN in tumor endothelial cells was a consistent and reliable observation (Fig. 5E).Fig. 5Classification of cell types and analysis of APLN expressions across different cell types. (A,B) Identification of nine distinct cell types using common markers. (C) Proportions of different cell types in each sample. (D) Examination of APLN expression across different cell types in the scRNA-seq dataset. (E) Elevated APLN expression in tumor endothelial cells. The left panel is a normal sample, and the right panel is a tumor sample. Visualization of APLN expression in endothelial cells of clusters 5 and 17 from normal (F) and tumor (G) tissues.During cell clustering, we observed that endothelia cells were divided into two clusters (clusters 5 and 17). Our analysis focused on the expression levels of APLN in these two clusters in both normal and tumor tissues. We found that in normal tissues, APLN expression was low in both clusters. However, in tumor tissues, it was cluster 5 that demonstrated a significantly elevated expression of APLN (Fig. 5F,G). This led us to conduct a DEG analysis of cluster 5 between normal and tumor tissues. Given that the expression level of APLN in cluster 17 showed minimal variation, we concentrated our DEG analysis exclusively on cluster 5. Although APLN wasn’t the most significantly altered gene in this cluster, its differential expression was notably distinct and significant, indicating its potential role in the pathogenesis of HCC in specific endothelial cell subsets.Subsequently, we divided endothelial cells into the APLNPos group (261 cells) and the APLNNeg group (1633 cells), based on the detection of APLN expression (Fig. 6A,B). Pseudotime trajectory analysis showed that APLNNeg cells diverge into two separate evolutionary paths: Branch A, primarily comprised of APLNNeg cells, and Branch B, where APLNPos cells are significantly enriched. Under the hypothesis that the expression level of APLN is related to endothelial cell cancerization, we considered that branch B (APLNPos cells) is a manifestation of endothelial cell carcinogenesis during the occurrence of HCC (Fig. 6C,D). We then examined the differential expression of APLN and established HCC markers from the CellMarker 2.0 database (http://bio-bigdata.hrbmu.edu.cn/CellMarker/index.html), such as VWF and CDH5 (They are known as markers of liver cancer in endothelial cells, as different from our study markers to verify whether APLN can serve as the endothelial cells of liver cancer marker), in all endothelial cells (Fig. 6E). The expression patterns of these genes were similar, suggesting that APLN, akin to VWF and CDH5, could be indicative of the carcinogenic state of endothelial cells. A subsequent DEG analysis of APLNPos and APLNNeg cells, paired with GO enrichment, revealed significant enrichment in biological processes, cellular components, and molecular functions closely linked to immune responses (Fig. 6F). This supports the hypothesis that APLNPos represents a carcinogenic branch of endothelial cells. KEGG pathway analysis underscored that the Apelin pathway, involving APLN, is a principal distinction between APLNPos and APLNNeg groups (Fig. 6G). When cells were grouped based on their origin from normal or tumor tissues and analyzed for differential gene expression with GO and KEGG enrichment, the results, predominantly associated with the extracellular matrix, appeared somewhat disordered (Supplementary Fig. S3A,B). This disarray may hint at the complexities of cell carcinogenesis and could be related to the diverse endothelial cell differentiation pathways observed in our pseudotime analysis.Fig. 6Detailed analysis of endothelial cell subtypes, differentiation trajectories, gene enrichment, and cell–cell communication. (A) Dimension-reduced visualization of cell types from the scRNA-seq dataset, with endothelial cells categorized into APLNPos and APLNNeg groups. (B) APLN expression across various cell types. (C,D) Differentiation trajectory analysis of endothelial cells based on pseudotime, states, or cell types (APLNPos and APLNNeg). (E) The expression levels of APLN, VWF, and CDH5 in APLNPos and APLNNeg endothelial cells. (F) The biological processes, cellular components, and molecular functions which are enriched in the GO analysis of DEGs identified between APLNPos and APLNNeg endothelial cells. (G) KEGG analysis of DEGs identified between APLNPos and APLNNeg endothelial cells. (H) Cell types involved in Apelin pathway and the extent of their involvement. (I) Expression patterns of APLN and its receptor APLNR across various cell types. (J) Involvement of different cell types in the Apelin signaling pathway.Finally, we conducted cell–cell communication analysis among endothelial cells (APLNPos and APLNNeg) and other types of cells (Supplementary Fig. S3C–E). After analyzing the Apelin pathway identified in the enrichment analysis, we found that endothelial cells are sole participants in this pathway, with APLNPos cells showing considerably more activity than their APLNNeg counterparts (Fig. 6H). APLN and its receptor APLNR are only highly expressed in APLNPos cells. Meanwhile, a certain amount of APLNR expression is found in APLNNeg cells, which suggests that the evolution of APLNNeg cells may be influenced by APLNPos cells (Fig. 6I,J). In addition, the APLNPos endothelial cells are more capable of communicating with other types of cells than the APLNNeg endothelial cells (Supplementary Fig. S3C).Investigation into APLN’s role in HCCWe studied the correlation between APLN and liver cancer from multiple aspects. In the immune infiltration assay, we observed a strong correlation between the infiltration levels of specific immune cells and their presence in tumor samples (Supplementary Fig. S4A). We further investigated the relationship between APLN expression and immune infiltration within tumor samples. The result showed no significant correlation between APLN expression and most immune cells, except for a correlation with T cells CD4 memory resting cells and eosinoohils (Supplementary Fig. S4B). In the evaluation of APLN’s impact on the tumor microenvironment (TME), we discovered a mild correlation between APLN expression and tumor purity (Supplementary Fig. S4C,D). The impact of other possible APLN modifications on HCC was also investigated, and the results showed that these modifications (single nucleotide variation, copy number variation, and methylation) were not significantly associated with APLN expression or prognostic outcomes in HCC.Taken together, our comprehensive analysis led to the conclusion that APLN expression mainly correlates with endothelial cells carcinogenesis. The upregulation of APLN expression during carcinogenesis appears to occur independently of the aforementioned genetic and epigenetic modifications.

Apelin (APLN) is a biomarker contributing to the diagnosis and prognosis of hepatocellular carcinoma

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis