Biomarker identification and risk assessment of cardiovascular disease based on untargeted metabolomics and machine learning

Clinical traits of participantsThe clinical traits of the subjects are available in Table 1, despite the limited source of subjects, which is a common problem in a population-based disease study. Their chi-square test and ANOVA revealed that eight traits out of 38 ones showed significant differences (p < 0.05) between groups and even extremely significant differences (p < 0.01). They were age, total bilirubin (TBIL), direct bilirubin (DBIL), indirect bilirubin (IBIL), aspartate aminotransferase/alanine aminotransferase (AST/ALT), inorganic phosphate (IP), magnesium (Mg) and anion gap (AG). Their boxplots in Fig. S3 in the Supporting Information disclosed the distribution and dispersion of in each group. For instance, the age, TBIL, DBIL and IBIL indicators of control group showed large variations.Table 1 Clinical traits of the participants.The eight traits of three groups were submitted to a PCA procedure. Figure 2A shows the resultant scores. Clearly, it failed to distinguish the HD, IS and control subjects. Therefore, the RFE was executed to further refine these eight traits. Ultimately, the age, IP, AG and DBIL were spotted and subjected to five machine learning methods for establishing discrimination models (Table 2). It could be observed that LDA performed well on both the calibration and validation sets, and the resulting accuracy, sensitivity, specificity, Matthews correlation coefficient, F1 score, and AUC were fair. The values corresponding to the calibration set were respectively close to those of the validation set. Hence, the LDA method was subsequently used to distinguish between HD and IS groups, as well as between CVD and non-CVD participants. Figure 2B depicts the obtained ROC curves. As in the HD versus control case, the AUC value calculated for the calibration reached 0.923 and 0.875 for the validation, respectively. While in the IS versus control case, both the AUC values approached 1.0. This indicated a good efficiency in discriminating HD, IS and control groups, by using the trait panel of age, IP, AG and DBIL.Fig. 2CVD risk assessment using selected clinical traits. A PCA score plot. B ROC for risk assessment of CVD using the trait panel.Table 2 Risk assessment of cardiovascular disease by five machine learning methods with age, direct bilirubin, inorganic phosphate, and anion gap.Metabolite detection and annotationA total of 4819 metabolic features were detected from the original UHPLC-MS/MS data of the HD, IS and control groups within the analysis time of 12.0 min, including 2532 features in the positive mode and 2287 features in the negative mode (Fig. S4 in the Supporting Information). A complete list of the features was available in the Supplementary Data 1. After data preprocessing, 848 features were generated for metabolite annotation (Fig. 3A), as detailed in peak table of the Supplementary Data 2. Metabolite annotation was completed through searching in known databases, comparing the experimentally obtained tR, MS1, and MS2 to those standard compounds in libraries, and scoring the similarities within MS-Finder33, mzCloud34, SIRIUS23, Metfrag35, and MetaboAnalyst29.Close inspection of each annotated metabolite revealed that lipids and lipid-like molecules had a maximal proportion of 33% in the metabolite profiles, and followed by organic acids and derivatives (24%), unannotated metabolites (14%), and organ heterocyclic compounds (8%). Figure 3B summarizes the classes of annotated metabolites. Although PCA was tried to visualize these metabolic profiles, the subjects in the control, HD, IS, and QC groups failed to discriminate from one another on the score scatterings (Fig. 3C). In contrast, seven QC samples were clustered within a small area, and their Cronbach’s alpha score was 0.89, quite higher than those of the control, HD, and IS samples (Table S1 in the Supporting Information). This at least implied a satisfactory stability of the UHPLC-MS/MS measurements and reliability of data manipulation.Fig. 3Metabolic profiling results of plasma samples from 37 CVD patients and 20 control subjects. A Heat map of the annotated metabolites in relative intensity of peak area abundance. B Chemical categories of the annotated metabolites. C PCA score plot.Metabolic pathway of CVD and control groupsBy ANOVA of the metabolite profiles associating metabolic pathways with CVD, 165 differential metabolites were identified between the CVD and control groups (Fig. 4A), accounting for 19.34% of the 848 features. The correlation between the age and metabolites was also analysed (Fig. 4B), through conducting a significant test with covariate adjustment on age. Finally, 164 differential metabolites were determined in the intersection of ANOVA and linear age-adjusted correlation, with significant differences of p < 0.05. Their heat map (Fig. 4C) covered a variety of categories of metabolites (Fig. 4D) annotated at four levels (Fig. 4E). Among them, 42 lipid and lipid-like molecules held the highest percent of 25.61%. The Supplementary Data 3 tabulated these metabolites.In order to understand the metabolome of CVD, both pathway enrichment and biofunction analysis were performed with an imposed p < 0.05. The result was that nine metabolic pathways were identified and presented in Fig. S5A in the Supporting Information. These pathways were primarily associated with energy metabolism and membrane components. Glycerophospholipid metabolism had a minimum p-value of 7.88 × 10–6. The biofunction with the smallest p-value of 1.00 × 10–8 was energy source and membrane component. Figure 5 illustrates these metabolic pathways in the CVD patients. Particular attention was given to glycerophospholipid metabolism and biosynthesis of unsaturated fatty acids, which involved nine metabolites of choline (C00114), lysophosphatidylcholine (LPC, C04230), phosphatidylcholine (PC, C00157), oleic acid (C00712), 1-acyl-sn-glycero-3-phosphoethanolamine (C04438), 2-acyl-sn-glycero-3-phosphoethanolamine (C05973), LA (C01595), eicosadienoic acid (C16525), and docosahexaenoic acid (DHA) (C06429). These two pathways were critically related through PC conversion to LA, accounting for the major metabolome of CVD different from that of the control group, to a certain extent.Fig. 4Comparative metabolic analysis of IS, HD and control groups. A ANOVA of metabolites. B Correlation of age to metabolites. C Heat map for differential metabolites. D–E Chemical categories and annotation levels of differential metabolites.Fig. 5Significant metabolic pathways accounting for the difference between CVD patients and control group. “Matched metabolites” refer to the differential metabolites identified through comparative analysis between CVD and control groups. “Unmatched metabolites” are those involved in the main pathways yet not detected during the analysis.Metabolic alteration in HD and ISGiven the differences in symptoms and clinical diagnoses between HD and IS patients, pairwise comparisons were made on the HD versus control, IS versus control, and IS versus HD groups, respectively. The p-values corresponding to each metabolite feature were computed and then subjected to logarithmic transformation with a base of 10. As a consequence, 132 differential metabolites were identified and classified into three main groups. Nine metabolites of them were specifically associated with HD, four of which belonged to benzene series, accounting for 44.44%. This metabolite alteration involved four perturbed metabolic pathways and one disturbed biofunction of protein synthesis (Table S2 and Fig. S5B in the Supporting Information). The phenylacetate and phenylalanine metabolisms were closely related to the protein synthesis and amino acids biosynthesis.There were 107 metabolites specific to the IS metabolic alterations, mainly lipid and lipid-like molecules (25%), and organic acids and derivatives (25%). Seven important pathways and six disordered biofunctions (Table S2 and Fig. S5C in the Supporting Information) explained that the glycerophospholipid metabolism was mostly correlated, followed by the α-linolenic acid and LA metabolism, biosynthesis of unsaturated fatty acids, and LA metabolism, respectively. The disturbed biofunctions were mainly membrane component and energy source.Both the IS and HD groups had 16 significant differential metabolites in common, comprising the majority (50%) of lipid and lipid-like molecules. Eight perturbed pathways and four biofunctions were identified to alter significantly (Table S2 and Fig. S5D in the Supporting Information).To examine the differential expression of these 132 metabolites in plasma of the IS, HD, and control subjects, the Mfuzz package36 was run to cluster them. It turned out that eight clusters revealed the distinct formation of significant metabolite features at different expression levels (i.e., clusters 1–8, Table S3 and Figs. S6‒14 in the Supporting Information). For instance, cluster 7 was composed of 15 metabolites, showing a good agreement that the lowest levels were expressed in the control group, and the expressions increased to middle levels in the HD group, while the IS group had nearly the highest expressions. The top five metabolites with large membership values from 0.64 to 0.42 corresponded to (9Z,12Z,15Z)-octadecatrien-1-ol, oleic acid, linoleamide, arachidyl linoleate, and oleamide, respectively.Assessment of CVD risk with metabolic biomarkers and machine learningThe machine learning methods of LDA, PLS-DA, SVM, GBM and RF were employed to address the issue of discrimination and risk assessment of CVD. First, three crucial metabolites of 1594-pos (palmitic amide), 1698-pos (oleic acid), 138-pos (138th feature in the positive ion mode, unannotated), were identified from the above-mentioned 164 features through recursive feature selection (Fig. 6A). When these three metabolites were individually fed to the five methods, the resulting LDA model showed the best performance (Table S4 in the Supporting Information). However, the corresponding accuracies, sensitivities and F1 scores were just above 0.80 for both the calibration and validation. This model seemed not good enough for the CVD risk assessment. Therefore, an ultimate biomarker panel was constructed, not only including palmitic amide, oleic acid, 138-pos, but also PC and LA that linked the glycerophospholipid metabolism to the biosynthesis of unsaturated fatty acids, and the clinical traits of age, DBIL, IP (Fig. 6B). Once again, this panel was used to obtain another five models (Table 3). The accuracy, sensitivity, specificity, F1 score, and AUC values resulted from the LDA model were all larger than 0.90, which attained to a desire of the risk assessment. The MCC values of the validation and calibration sets were quite similar, both larger than 0.86. With this panel the LDA model was able to discriminate any group of the HD patients, IS patients, and control subjects from the two remaining. Figure 6C depicts the ROC curves, of which all the AUC values were equal or close to 1.00. This indicated a good efficiency in the CVD discrimination.Fig. 6Machine learning prediction results based on selected metabolic biomarkers for disease differentiation. (A) Recursive feature selection from metabolite. (B) Expression level distributions of eight traits in biomarker panel (metabolites in the first five box plots, clinical traits in the last three box plots) used for machine learning. (C) ROC for risk assessment of CVD using the biomarker panel.Table 3 Risk assessment of cardiovascular disease by five machine learning methods with the biomarker panel.

Hot Topics

Related Articles