Integrative biomarker discovery and immune profiling for ulcerative colitis: a multi-methodological approach

Integration of differential analysis and WGCNA for gene selectionFigure 1 shows the flowchart of this study. The analysis was conducted using data from GSE87466, which comprised a total of 87 UC and 21 normal control tissues. All expression values were normalized (Fig. 2A,B), and disease-associated genes were screened using a combined approach of differential analysis and WGCNA. We employed the “limma” package to identify differentially expressed genes (DEGs) with an adjusted p-value (adj.P.Val) < 0.05 and log fold change (logFC) > 1 or < -1. Within the GSE87466 dataset, a comprehensive analysis revealed a collection of 91 DEGs, consisting of 55 upregulated genes and 36 downregulated genes. To provide a visual representation of the expression patterns of these DEGs across individual samples, a heatmap was generated (Fig. 2C). The DEGs were further visualized using volcano plot and divergent bar plot. (Figs. 2D,E). Following the identification of DEGs through the differential analysis, we proceeded to subject these DEGs to Gene Set Enrichment Analysis (GSEA). The pathways significantly enriched in the UC group were predominantly associated with amoebiasis, complement and coagulation cascades, leishmaniasis, and staphylococcus aureus infection, as revealed by the GSEA analysis (Fig. 2F,G). To establish a Weighted Gene Co-Expression Network for the GSE87466 dataset, we employed the “WGCNA” package, a powerful computational tool widely employed for gene co-expression analysis. As illustrated in Fig. 3A, we set the soft threshold to 6, ensuring an \(\:{R}^{2}\) value greater than 0.9 and a high mean connectivity. After merging highly correlated modules, a total of 10 modules were identified for further investigation. The resulting modules were then displayed together under the clustering tree, showcasing the merged and primed modules. (Fig. 3B). The robustness of module delineation was validated through transcriptional correlation analysis within modules, which revealed no significant inter-module linkage. This suggests that the identified modules are distinct and not strongly correlated with each other (Fig. 3C). Correlations and significance were calculated between the modules and the disease and control groups, respectively. Among them, the brown module showed the highest correlation and was selected as the key module (Fig. 3D). In the MM (module membership) versus GS (gene significance) scatter plot of the brown module, a positive correlation was observed between MM and GS, indicating that these highly disease-related genes play a pivotal role in the key module (Fig. 3E). Clinically meaningful modules were identified, and further examination was conducted on all the genes in the brown module. Finally, we took the intersection of the genes identified by differential analysis and all the genes in the brown module of WGCNA to obtain disease-associated genes (Fig. 3F).Fig. 1Flow chart of this study.Fig. 2Identification and enrichment analysis of differential genes between UC and normal control. (A) Raw data of the training set without batch correction. (B) Expression matrix of the training set after batch correction to remove batch effects. (C) Heat map of differential genes between UC and normal control. The blue module represents normal and the red module represents UC, and the red color in the heat map represents up-regulated gene expression and blue color represents down-regulated gene expression. (D) Volcano plot of differential genes. Expression is up-regulated when |logFC|>0, p < 0.05 and gene expression is down-regulated when |logFC|<0, p < 0.05. Differential genes were screened by |logFC|>1, p < 0.05, blue indicates down-regulation and red indicates up-regulation. (E) Dispersion bar graph of differential genes. Blue represents up-regulated genes, yellow represents down-regulated genes, and the length of the bar represents the value of logFC. (F) GSEA enrichment analysis of differential genes and selection of the five pathways with the strongest enrichment significance for display. (G) GSEA enrichment analysis of differential genes and selection of the five pathways with the weakest enrichment significance for display.Fig. 3Construction of gene co-expression network. (A) The left panel is set when the scale-free topological fit index R2 = 0.9, and the best soft threshold β = 6 is chosen to obtain the best average connectivity of the co-expression network on the right panel. (B) Sample clustering tree with the modules before and after merging, Dynamic Tree Cut for the original modules and Merged dynamic for the result after merging the strongly associated modules. (C) Heat map of correlation between merged modules and modules. (D) Heat map of correlations between modules and clinical traits. Red indicates positive correlation, blue indicates negative correlation, and the darker the color, the stronger the correlation. The numbers in parentheses are the p-values of correlations between modules and traits to test whether they are statistically significant with each other. The numbers above the brackets indicate the magnitude of correlation between modules and traits. (E) Scatter plot between brown module affiliation and UC gene significance with a correlation between each other of cor = 0.71, p < 1.3e−80. (F) Wayne diagram of 91 intersecting genes obtained from DEGs and co-expressed genes with the strongest association with UC as disease associated genes.Functional enrichment analysisThe GO enrichment analysis results revealed that disease-associated genes were significantly enriched in biological processes such as positive regulation of leukocyte cell-cell adhesion, positive regulation of cell-cell adhesion, regulation of cell-cell adhesion, adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains, regulation of leukocyte cell-cell adhesion, positive regulation of T cell activation, and secretory granule membrane, indicating their potential involvement in these pathways (Fig. 4A). The KEGG enrichment analysis results demonstrated that disease-associated genes were significantly enriched in pathways such as complement and coagulation cascades, viral myocarditis, ECM-receptor interaction, staphylococcus aureus infection, hematopoietic cell lineage, and Toll-like receptor signaling pathway (Fig. 4B). Additionally, we conducted enrichment analysis of disease-related genes using the Metascape website, which revealed that these genes were primarily involved in biological processes such as positive regulation of immune response, positive regulation of leukocyte cell-cell adhesion, regulation of peptidyl-tyrosine phosphorylation, neutrophil degranulation, complement system, and innate immune response (Fig. 4C). The relationship between enriched terms is illustrated in Fig. 4D. In summary, the functional enrichment analysis reveals that the disease-associated genes are predominantly involved in complement and coagulation cascades, regulation of adhesion between various cells, and immune response, as indicated by the enrichment results from the bioinformatics analysis.Fig. 4Functional enrichment analysis. (A) Histogram of GO enrichment analysis of disease associated genes. (B) Histogram of KEGG enrichment analysis of disease associated genes. (C) Histogram of Matascape enrichment analysis results. (D) Node Network of Matascape enrichment analysis results.Selection of feature genesThree machine learning algorithms and PPI network analysis were utilized in combination to identify the optimal feature genes for the diagnosis of UC. The LASSO regression algorithm identified 11 genes, while the RF algorithm identified 32 genes, and the SVM-RFE algorithm identified 16 genes (Fig. 5A-D). We utilized Venn diagrams to identify six genes that showed overlap in the results obtained from the three machine learning algorithms mentioned above (Fig. 5E). Additionally, we performed PPI network analysis and identified two key genes in the constructed network (Fig. 5F, G). Lastly, we utilized the overlapping genes from the three machine learning algorithms, in combination with the hub genes identified from the PPI network, namely B4GALNT2, PDZK1IP1, FAM195A, REG4, MTMR11, FLJ35024, CD55, and CD44, as feature genes for subsequent analyses.Fig. 5Machine learning screening of feature genes. (A) LASSO algorithm for feature gene screening. (B) Combination of number of decision trees and error rate of RF algorithm. (C) RF calculates the top 20 genes for gene importance for ranking. (D) support vector machine recursive feature elimination (SVM-RFE) algorithm to screen biologic feature genes, and the point with the lowest accuracy and error rate is used as the number of feature genes screened by SVM-RFE. (E) Wayne diagram to obtain the feature genes screened by three machine learning algorithms. (G) PPI network of disease associated genes, the circles indicate protein nodes, and the lines between each other indicate the existence of interrelationship between two proteins. (F) Interacting networks of hub genes in PPI networks. (H) Correlation line graph between 8 disease signature genes.Validation of feature gene expressionWe further investigated the expression of the selected feature genes in GSE87466 using a box plot, and validated the findings using an additional dataset, GSE92415 (Fig. 6A,B). Gene correlations were also examined, as shown in Fig. 5H.Fig. 6Validation of feature genes expression. (A) Box line plot of differential expression of feature genes between UC and normal individuals in the training group GSE87466. The size range of the box line plot indicates the range of their gene expression, while the black line in the plot indicates the mean of their expression. The values between the two box-line plots are P values, and the differences are statistically significant when P < 0.05. (B) Box line plot of differential expression of characteristic genes between UC and normal individuals in the validation group GSE92415.Analysis of the feature genes using GSEATo better understand the role of feature genes in UC, we analyzed them using single-gene GSEA enrichment analysis. The results of the single-gene GSEA enrichment analysis revealed that all the selected feature genes, except for REG4, were significantly enriched in the complement and coagulation cascades. REG4 was found to be enriched in the Adherens junction pathway (Fig. 7A–H).Fig. 7Single gene GSEA analysis. (A–H) Single gene GSEA analysis of feature genes, respectively, and the top five pathways with the highest and lowest enrichment significance were selected for display.Modeling and testing of diagnostic performance using neural network and generation of diagnostic column line graphWe employed a neural network model to effectively model and test the feature genes, leveraging its robust non-linear modeling capabilities to uncover intricate patterns and associations in gene expression data and evaluate the importance of our screened signature genes. (Fig. 8A). The ROC curve of the neural network model was generated using the training datasets (GSE87466) and test datasets (GSE92415), and the performance of the model was comprehensively evaluated by calculating the AUC to assess its overall predictive accuracy (Fig. 8B,C). Then we utilized the Rms package to construct UC diagnostic column line graph models for the feature genes, including B4GALNT2, PDZK1IP1, FAM195A, REG4, MTMR11, FLJ35024, CD55, and CD44 (Fig. 8D). The calibration curves demonstrated that the disparity between the actual and predicted risks of UC was very minimal, suggesting that the column line graph model for UC is highly accurate(Fig. 8E). The Decision Curve Analysis (DCA) was performed to evaluate the impact of utilizing the model for diagnosis at various prediction probability thresholds, providing further assessment of the performance and utility of the model (Fig. 8F). To further validate the diagnostic value of B4GALNT2, PDZK1IP1, FAM195A, REG4, MTMR11, FLJ35024, CD55, and CD44, we conducted receiver operating characteristic (ROC) analysis. B4GALNT2 (AUC:0.952), PDZK1IP1 (AUC:0.969), FAM195A (AUC:0.907), REG4 (AUC:0.929), MTMR11 (AUC:0.996), FLJ35024 (AUC:0.949), CD55 (AUC:0.958), CD44 (AUC:0.957) (Fig. 8G).Fig. 8Construction and validation of Artificial Neural Network (ANN) model and nomogram model. (A) ANN model constructed using feature genes, containing input layer, hidden layer and output layer. (B) Training set ROC curve with AUC = 0.984, 95% CI  0.964-1.000, used to illustrate whether the model has good prediction performance. (C) Validation set ROC curve, AUC = 0.957, 95% CI  0.903-1.000, used to demonstrate whether the stability and generalization of the model are good. (D) Nomogram plot of the characteristic gene construction, each element followed by a scoring scale. The scores of each element are summed to obtain a total score to predict the risk of disease. (E) Calibration curve for the evaluation of nomogram prediction performance. The higher the overlap between the solid and dashed lines and the closer the diagonal line, the better the performance. (F) Decision curve analysis (DCA), which compares the clinical benefit between the nomogram model and other diagnostic indicators. the higher the AUC, the higher the clinical benefit in the range of possible thresholds from 0 to 1. (G) The ROC curves of the eight feature genes in the training set were used to assess whether each gene had good disease predictive ability.Immunological infiltration in the UC group and healthy controlsImmune cells play an important role in the development and progression of UC. To gain insights into the relative proportion of different immune cell types that undergo changes in the disease and control groups, we utilized the CIBERSORT package for immune cell infiltration analysis. After excluding non-statistically significant immune cell infiltration, our results revealed that plasma cells, T cells CD8, T cells follicular helper, NK cells activated, macrophages M0, macrophages M2, and neutrophils were found to be higher in the disease group compared to the control group (Fig. 9A). Our analysis revealed that FAM195A was significantly and positively correlated with NK cells resting, T cells CD4 memory activated, macrophages M0, and T cells CD4 naive. On the other hand, FAM195A was significantly and negatively correlated with NK cells activated, plasma cells, and mast cells resting. MTMR11 was significantly positively correlated with NK cells resting and T cells CD4 naive, while it was negatively correlated with NK cells activated and plasma cells. CD44 showed a significant positive correlation with mast cells activated, NK cells activated, and plasma cells, while it was negatively correlated with NK cells resting and T cells CD4 naive. B4GALNT2 showed a significant positive correlation with T cells CD4 naive, and a significant negative correlation with NK cells activated and plasma cells (Fig. 9B). In conclusion, there are significant differences in immune infiltration between the disease group and the normal group in UC. Moreover, most of the characteristic genes, including FAM195A, MTMR11, CD44, and B4GALNT2, showed significant correlations with specific immune cell types.Fig. 9Analysis of immune infiltration in UC group and normal control group. (A) The difference in immune infiltration between the UC and normal control groups with orange representing the UC group and green representing the normal control group.*p < 0.05, **p < 0.01, ***p < 0.001. (B) The correlation heatmap between 20 immune cells and eight feature genes with orange representing positive correlation and green representing negative correlation. The darker the color, the stronger the correlation. *p < 0.05, **p < 0.01, ***p < 0.001.Identification of subtypes of UCWe combined and removed batch effects from the UC group data in GSE87466 and GSE92415 to identify distinct subtypes of UC. UC samples were classified into subtypes using the consensus clustering method based on gene expression profiles. The optimal number of subtypes was determined as 2 using various criteria including consensus matrix plots, CDF plots, relative changes in regions under the CDF curve, and trace plots (Fig. 10A-C,E). Principal Component Analysis (PCA) revealed notable distinctions among the subtypes, which were labeled as C1 and C2 (Fig. 10F). The heatmap visualizes the differential gene expression across the two identified subtypes (Fig. 10D).Fig. 10Consensus clustering of UC samples and immune infiltration analysis between the subtypes of UC. (A) Consensus matrix plot, the cleaner the blank area between the blue modules indicates the more successful analysis. (B) CDF plot of consensus clustering, showing the relative change of consensus index from k = 2 to k = 9 with the change of CDF value, and the k value of the curve with the most stable change is the optimal fractal number. (C) Relative change in area under the CDF curve. (D) Heat map between C1 and C2 and genes, red indicates up-regulated gene expression and blue indicates down-regulated gene expression. (E) Trace plot of k = 2 to k = 9. (F) PCA plot of UC samples. The scatter plot allows visualization of the characteristic genes that classify UC into two subtypes, C1 and C2. (G) Stacked histogram of the infiltration ratio of 22 immune cells in each sample of the two subtypes C1 and C2. (H) Box plot of the difference in infiltration between C1 and C2 for the 22 immune cells. *p < 0.05, **p < 0.01, ***p < 0.001.Different immunological characteristics of the two subtypesAs illustrated in Fig. 10G,H, our analysis of immune infiltration in UC revealed that levels of T cells CD8, T cells CD4 memory resting, T cells gamma delta, NK cells activated, monocytes, and macrophages M0 were higher in the C1 subtype compared to the C2 subtype, after excluding non-statistically significant immune cell infiltration. Compared to the C1 subtype, the C2 subtype exhibited higher levels of B cells memory, T cells follicular helper, NK cells resting, macrophages M2, dendritic cells activated, mast cells activated, and neutrophils.Expression of diagnostic genes in single-cell dataFrom the GSE182270 dataset, we selected three UC samples (GSM5525955, GSM5525956, GSM5525957) and three healthy control samples (GSM5525960, GSM5525961, GSM5525962) for scRNA-seq analysis. Scatter plots of PCA-reduced data illustrated the distribution of cells from different samples, with the Elbow Plot indicating the importance of each principal component (Fig. 11A). We selected the top ten principal components for further dimension reduction and clustering, resulting in 18 cell clusters. These clusters were visualized using the t-SNE method (Fig. 11B). To accurately annotate these subclusters, we referred to marker genes reported in the literature and supplemented them from the CellMarker database. Based on these markers, we identified cell types. Histograms displayed the proportions of various cell types across the samples (Fig. 11C). Figure 11D shows the distribution and cell type information of cells from the normal group samples. We plotted the expression of diagnostic genes in normal group sample cells using feature plots; however, due to differences in sequencing technologies, the gene FLJ35024 was absent in the single cell sequencing data. Figure 11F shows the distribution and cell type information of cells from the UC group samples. Figure 11G displays the expression of diagnostic genes in UC group cells.Fig. 11Expression of diagnostic genes in single-cell data. (A) Left: PCA-based representation of cell distributions in the normal control group vs. UC group. Right: Contribution of the top 10 principal components. (B) t-SNE-based nonlinear dimensionality reduction clustering of all single-cell data, resulting in 18 cell clusters. (D) t-SNE distribution of cells in normal samples, with colors representing different cell types. (E) Feature plot illustrating the expression levels of diagnostic gene across cells in the normal group. (F) t-SNE distribution of cells in UC samples, with colors representing different cell types. (G) Feature plot illustrating the expression levels of diagnostic gene across cells in the UC group.Analysis of CD44 and CD55 in cellular communicationWe found that two UC diagnostic genes, CD44 and CD55, have a more pronounced and widespread expression in the single-cell data. They differ greatly in their molecular functions and mechanisms of action, but they are both expressed on the cell surface and encode proteins that are cell surface proteins involved in the regulation of immune responses and protection of cells from damage. These commonalities reveal their importance in the study of cell-environment interactions, immune regulation, and disease development. We used the CellChat package to analyze the role of CD44 and CD55 in cellular communication, particularly in understanding cell-cell interactions, signaling mechanisms, and disease development. Compared with the normal group samples, interactions between cells were more frequent in the UC group, but the strength of communication was decreased (Fig. 12A). In UC, inflammation leads to a massive infiltration of immune cells and an increase in the number and density of cells, allowing for increased physical contact and interaction between cells. Despite the frequency of interactions, the effectiveness or strength of each interaction may be reduced due to impaired cellular function in the inflammatory state. In the normal and UC groups, the strength of communication between different cell types also changed, but fibroblasts consistently maintained an active signal output (Fig. 12B). Cells such as B cells, epithelial cells, and T cells underwent very pronounced changes. The cellular communication families with significant differences in intensity in the normal and UC groups were demonstrated by histograms (Fig. 12C). The violin plots demonstrated the expression of two genes, CD44 and CD55, in cells of the 9 middle cell types (Fig. 12D). In the cellular communication network, we extracted all ligand-receptor pairs containing CD44 and CD55.MIF is an important inflammatory factor with its key role in multiple immune responses and inflammatory reactions. Its primary receptors include CD74 and CD44. this ligand-receptor pair complex is active in cellular communication between epithelial cells and T cells, B cells, and myeloid cells (Fig. 12E). The interaction of LGALS9 with CD44 plays an important role in a variety of biological processes, particularly in immune regulation. In UC samples, the LGALS9-CD44 ligand receptor pair played an important role in cellular communication between myeloid and epithelial cells (Fig. 12F). Among them, the communication of myeloid cells with T cells through this ligand-receptor pair attracted our attention. It has been shown that LGALS9 can modulate the function of T cells and other immune cells by binding to CD44. The LGALS9-CD44 interaction inhibits T cell activation and proliferation, promotes immune tolerance, and reduces excessive immune responses. FN1 and CD44 ligand receptor pairs play important roles in cell adhesion, migration, and tissue repair. This ligand-receptor pair plays an important role in cellular communication between smooth muscle cells and fibroblasts (Fig. 12G). This is essential for the repair and regeneration of colonic epithelial cells after injury. COL1A2-CD44 also played an important role in smooth muscle cells and fibroblasts, which may be closely related to connective tissue formation and tissue repair (Fig. 12H). SELE, a member of the selectin family, is a cell adhesion molecule that is primarily expressed by activated endothelial cells. In immune and inflammatory responses, the SELE-CD44 ligand receptor modulates the migration and localization of immune cells and promotes cell-cell and cell-matrix interactions (Fig. 12I). The ADGRE5-CD55 ligand receptor pair is active in cellular communication between myeloid and T cells and may play a role in regulating immune responses and inflammation (Fig. 12J).Fig. 12Analysis of cellular communication networks. (A) Histogram of the number and intensity of communications. (B) Scatter plot of cell signal intensity. (C) Histogram of cellular communication family strength. (D) Violin plots of CD44 and CD55 expression in cells of various cell types. (E-J) Cellular communication networks mediated by specific ligand-receptor pairs.

Hot Topics

Related Articles