An end-to-end deep learning method for mass spectrometry data analysis to reveal disease-specific metabolic profiles

Clinical sample collectionThis study was approved by the Ethics Committees of the Sun Yat-Sen University Cancer Centre, the First Affiliated Hospital of Sun Yat-Sen University and the Affiliated Cancer Hospital of Zhengzhou University. A total of 210 healthy individuals, 323 patients with benign nodules and 326 patients with lung adenocarcinoma were enroled. Cases of lung adenocarcinoma were collected prior to lung resection surgery and had pathological confirmation. Serum from benign nodules was collected from individuals undergoing annual physical examinations. Participants with benign nodules were defined as those with stable 3–5 years follow-up Computed Tomography (CT) scans at the time of analysis. The sample collection period was from January 2018 to October 2022. The sex of the participants was determined by self-report. Informed consent was obtained from all participants. The study design and conduct complied with all relevant regulations regarding the use of human study participants and was conducted in accordance to the criteria set by the Declaration of Helsinki. Research with humans has been conducted according to the principles of the Declaration of Helsinki.In addition, we collected serum samples from 100 healthy blood donors, including 50 males and 50 females, aged between 40 and 55 years, from the Department of Cancer Prevention and Medical Examination, Sun Yat-Sen University Cancer Centre. All these samples were mixed in equal amounts and the resultant mixture was aliquoted and stored. These mixtures were used as reference samples for quality control and data normalisation in the conventional metabolomic analysis as previously described34.Serum metabolite extractionFasting blood samples were collected in serum separation tubes without the addition of anticoagulants, allowed to clot for 1 h at room temperature, and then centrifuged at 2851 × g for 10 min at 4 °C to collect the serum supernatant. The serum was aliquoted and then frozen at −80 °C until metabolite extraction.Reference serum and study samples were thawed and a combined extraction method (methyl tert-butyl ether/methanol/water) was used to extract metabolites. Briefly, 50 μL of serum was mixed with 225 μL of ice-cold methanol and 750 μL of ice-cold methyl-tertbutyl ether (MTBE). The mixture was vortexed and incubated for 1 h on ice. Then 188 μL MS grade water containing internal standards (13C-lactate, 13C3- pyruvate, 13C-methionine and 13C6-isoleucine, all from Cambridge Isotope Laboratories) was added and vortexed. The mixture was centrifuged at 15,000 × g for 10 min at 4 °C, and then the bottom phase was transferred to two tubes (each containing 125 μL) for LC-MS analysis in positive and negative modes. Finally, the samples were dried in a high-speed vacuum concentrator.Untargeted liquid chromatography-mass spectrometryThe dried metabolites were resuspended in 120 μL of 80% acetonitrile, vortexed for 5 min and centrifuged at 15,000 × g for 10 min at 4 °C. The supernatant was transferred to a glass amber vial with a micro insert for metabolomic analysis. Untargeted metabolomic analysis was performed on an ultra-performance liquid chromatography-high resolution mass spectrometry (UPLC-HRMS) platform. The metabolites were separated using the Dionex Ultimate 3000 UPLC system with an ACQUITY BEH Amide column (2.1 × 100 mm, 1.7 μm, Waters). In positive mode, the mobile phase comprised 95% (A) and 50% acetonitrile (B), containing 10 mmol/L ammonium acetate and 0.1% formic acid. In negative mode, the mobile phase was composed of 95% and 50% acetonitrile for phases A and B, respectively, both containing 10 mmol/L ammonium acetate and adjusted to pH 9. Gradient elution was performed as follows: 0–0.5 min, 2% B; 0.5–12 min, 2–50% B; 12–14 min, 50–98% B; 14–16 min, 98% B; 16–16.1 min, 98–2% B; 16.1–20 min, 2% B. The column temperature was maintained at 40 °C, and the autosampler was set at 10 °C. The flow rate was 0.3 mL/min, and the injection volume was 3 μL. A Q-Exactive orbitrap mass spectrometer (Thermo Fisher Scientific) with an electrospray ionisation (ESI) source was operated in full scan mode coupled with ddMS2 monitoring mode for mass data acquisition. The following mass spectrometer settings were used: spray voltage +3.8 kV/−3.2 kV; capillary temperature 320 °C; sheath gas 40 arb; auxiliary gas 10 arb; probe heater temperature 350 °C; scan range 70–1050 m/z; resolution 70000. Xcalibur 4.1 (Thermo Fisher Scientific) was used for data acquisition.In this study, all serum samples were analysed by LC-MS in 10 batches. To assess data quality, a mixed quality control (QC) sample was generated by pooling 10 μL of supernatant from each sample in the batch. Six QC samples were injected at the beginning of the analytical sequence to assess the stability of the UPLC-MS system, with additional QC samples injected periodically throughout the batch. Serum pooled from 100 healthy donors was used as reference material in each batch to monitor extraction and batch effects. All untargeted metabolomic analysis was performed at the Sun Yat-Sen University Metabolomics Centre.Public dataset collectionThe raw dataset for pan-cancer lipid metabolomics data of CCLE was downloaded from the Metabolomics Workbench database with accession ST001142(https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST001142). There are 946 samples in total, including 23 cancer types. The quantitative lipid metabolite matrix and the DNA methylation matrix were downloaded from the appendix of the article2.The LC-MS dataset of colon cancer was downloaded from the MetaboLights database (https://www.ebi.ac.uk/metabolights/editor/MTBLS1129/descriptors) with 236 samples in total, including 197 colon cancer cases and 39 healthy controls. Due to the differences of disease samples, classification purposes, instruments, and parameters of LC-MS between the public dataset and the private lung adenocarcinoma dataset, the DeepMSProfiler model needs to be re-trained on the public dataset.Data format conversionThe raw format files of LC-MS data were converted to mzML format using the MSCovert software. The data used to train the end-to-end model were sampled directly from the mzML format without any further processing. This raw data could be used directly as input to the model. In the mzML file, ion intensity and mass-to-charge ratio of each ion point for each time interval were recorded. Ions points were mapped into a 3D space by their RT and m/z. A 2D matrix was sampled from this 3D points array data using a maximally pooled convolution kernel. RT: 0.5 min and m/z: 50 as the sampling starting point and RT: 0.016 min and m/z: 1 as the sampling interval. The sampling ranges of retention time and mass/charge ratio were set. Using the sampling interval as a sliding window, the maximum ion intensity in the interval was sampled to obtain a two-dimensional matrix of 1024 × 1024 ion intensities.Extraction and annotation of metabolic peaksWe used Compound Discovery v3.1 and TraceFinder v4.0 (Thermo Fisher Scientific) for peak alignment and extraction. These steps resulted in a matrix containing retention time, mass-to-charge ratio and peak area information for each metabolite. To eliminate potential batch-to-batch variation, we used the Ref-M method to correct peak areas. This involved dividing the peak area of each feature in the study sample by the peak area of the reference compound from the same batch, yielding relative abundances. We then used several data annotation tools such as MetID, MetDNA and NetID in MZmine to annotate the metabolite features and combined the annotation results52,58,59,60. These analysis tools include mass spectral information from databases such as the HMDB, MassBank and MassBank of North America (MoNA)41,61. In addition, we performed data annotation using MS2 data based on Compound Discovery v3.1 and finally selected bio-compounds with MS2 matches to certified standards or compounds with inferred annotations based on complete matches in mzCloud (score > 85) or ChemSpider as the precise annotation results62.Raw data visualisationThe mzML files were read using the Pyteomics package in Python. Records were traversed for all times in the sampling interval63. For each time index data in mzML files, it recorded the preset scan configuration, the scan window, the ion injection time, the intensity array, and the m/z array. The intensity array and m/z array were selected to form an array of data points, and retention time, mass-to-charge ratio, and intensity are the row names. The intensity values were log2 processed. Then, the 3D point cloud data was visualised using the Matplotlib Toolkits package in Python64. The 2D matrixes were obtained by down-sampling the 3D point cloud and pooling the 3D data using median and maximum convolution kernels. Convolution spans were RT: 0.1 min and m/z: 0.001. Heatmaps and contours were plotted using Matplotlib. Retained time-intensity curves were also plotted using Matplotlib with an m/z span of 0.003.Dataset division and assignmentThe dataset of each batch was randomly divided into a training dataset and an independent testing dataset in a ratio of 4:1. The data from the first to the seventh batch contained 90 samples each, including 30 healthy individuals, 30 lung nodules, and 30 lung adenocarcinoma samples. The data for the eighth and ninth batches did not contain healthy samples. The data for the tenth batch only contained nodule samples. To avoid the effect of classification imbalance, we constrained the same sample type and sex ratio in the training and independent testing dataset. Because the samples came from patients of different ages and sexes, the lesion sizes of lung nodules and lung adenocarcinoma patients also varied. In order to avoid these attributes affecting the authenticity of the model, sex, age, and lesion size were also used as constraints for dataset division. The difference in the distribution of sample attributes between the training dataset and the independent testing dataset was verified by an unpaired t-test.Deep learning model construction in detailIn this step, we aimed to construct a model to predict the class labels for each metabolomic sample. For this, we first set X and Y as the input and label spaces, respectively. A single end-to-end model consisted of three parts, a dimension converter based on pool processing, a feature detector based on the convolutional neural networks, and a multi-layer perceptron (MLP) classifier. The input data directly from the raw data was extremely large and contained a lot of redundant information, so a pooling process was required to reduce the resolution for downstream convolution operations. The input data of the model was reduced by the maximum pooling layer to obtain D(X). Next, enter the feature extractor dominated by convolutional layers to obtain F(D(X)). The convolutional neural network had local displacement invariance and was well adapted to the RT offset problem in metabolomic data. Due to the relatively large size of the data, more than 5 layers of convolutional operations were required to reduce the dimensionality of the data to the extent that the computing power could be loaded. Different architectures were used respectively to compare the performance in the tuning set. The architectures used in different models included VGGNet (VGG16, VGG19), Inception Model (InceptionV3), ResNet (ResNet50), DenseNet (DenseNet121), and EfficientNet (EffcientNetB0-B5)33,65,66,67. In addition, two optimisation models based on Densenet121 were created to simplify the DenseNet network. The direct connection route replaced the last dense layer of Densenet121 with a convolutional layer. The optimisation route replaced the last dense layer of DenseNet with a convolutional layer that retained a one-hop connection. The pre-training parameters in pre-trained models were derived from ImageNet. Each architecture was tested on the TensorFlow + Keras platform and PyTorch platform, respectively. To reduce overfitting, we used only one linear layer for our MLP layer. In the TensorFlow + Keras model, there was a SoftMax activation layer before the output layer. The output of the model was C(F(D(X))).The positive and negative spectral data used different convolutional layers for feature extraction. Their features were combined before inputting the fully connected layer. Their pre-training parameters were shared. For a model trained on both positive and negative spectral data, a cross entropy loss was used.Model training20% of the discovery dataset was divided into tuning sets, which were used to guide model architecture selection, hyperparameter selection, and model optimisation, and the rest 80% was used for model training. Sample category balancing was used as a constraint for dataset segmentation. The model architecture was evaluated by both validation performance and operational performance. We counted the number of model parameters and evaluated the complexity of the model. The average of the 10 running times of the models was used as the runtime. Hyperas was used to preliminarily select the optimal model hyperparameters and initialisation weights68. The optimal initialisation method was he_normal. But we opted for pretraining with the ImageNet dataset due to its comparable performance and faster execution. After reducing the size of the parameter search, we used the grid search method for hyperparameter tuning.Ensemble strategyDeepMSProfiler consists of several trained end-to-end sub-models as an ensemble model, where the average of the classification prediction probabilities of the samples from all sub-models was used as the final prediction probability for classification. The ensemble model calculated a score vector of length 3 in each of the three classifications, and the category with the maximum score was selected as the predicted classification result.Each end-to-end sub-model was trained on the discovery dataset. The architecture of each sub-model is the same, but some hyperparameters are different. Two different learning rates of 1e-3 and 1e-4 were used. The optimiser used is ‘adam’ with parameter settings of beta_1 as 0.9, beta_2 as 0.999, epsilon as 0.001, decay as 0.0, and amsgrad as False. The batch size was set as 8 and the training was run for 200 epochs. A model typically took about 2 h to complete training on a GP100GL (16GB) GPU server. Each sub-model participated fairly in the final prediction result without setting a specific weight. The independent testing dataset was not used in model training and hyperparameter selection.Machine learning models for comparisonTo compare our DeepMSProfiler to other existing tools, we selected several common traditional machine learning methods to build tri-classification models based on the peak area data obtained from the previous steps. These methods included Extreme Gradient Boosting (XGBoost), RF, Adaptive Boosting (Adaboost), SVM, and DNN. The training dataset and independent testing dataset were divided in the same way as the deep learning algorithm, and the numbers of estimators for Adaboost and XGBoost algorithms were the same as those of DeepMSProfiler. XGBoost was implemented by the XGBClassifier function in the xgboost library. Other machine learning methods were implemented using the SciKitLearn library. SVM was adopted using the svm function, and the kernel of SVM is ‘linear’. RF was implemented through the RandomForestClassifier function. Adaboost was adopted through the AdaBoostClassifier function. DNN was implemented using the MLPClassifier function. The optimal hyperparameter was obtained by the grid search method.Performance metricsWe evaluated the performance of the model on the independent testing dataset. The evaluation metrics included accuracy, precision, sensitivity and F1 score. Micro was chosen as the computational method for the multiclassification model. Confidence intervals were estimated using 1000 bootstrap iterations. During the bootstrapping procedure, our model was estimated by an ensemble strategy combining 20 models trained on the discovery dataset. In addition, we calculated a confusion matrix and an AUC curve to demonstrate the performance of the model in the three classifications of lung adenocarcinoma, benign nodules and healthy individuals. When the sensitivity was 0.7 or 0.9, the specificity was calculated using the sensitivity-specificity curve. The sensitivity-specificity curve was interpolated using the NEAST method.Visualisation of “black-box”In the end-to-end neural network prediction, the data flowed in a chain of \({{{\boldsymbol{X}}}}\to D({{{\boldsymbol{X}}}})\to F(D({{{\boldsymbol{X}}}}))\to C(F(D({{{\boldsymbol{X}}}})))\) from the input layer through the hidden layer to the output layer. In the feature extraction layer, which is dominated by convolutional layers, information was passed in the same chain manner. After inputting X, we obtained the corresponding output L in different hidden layers to open the black box process. In order to observe the space of middle features, PCA was used to reduce T dimensionality to principal components. The PCA result was visualised by the Matplotlib package in Python.To evaluate the correlation of hidden layer output with batch label and type label, respectively, we calculated NMI, ARI, and ASC using the following formulas. L was the layer output and C was the cluster labels used for the cluster evaluation.$${{{\rm{MI}}}}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}} \, {{{\bf{C}}}}\right)={{\sum}_{i}^{{Ma}{x}_{L}}}{{\sum}_{j}^{{Ma}{x}_{C}}}{P}_{i,j}\frac{\log \left({P}_{i,j}\right)}{{P}_{i}{P}_{j}}$$
(1)
$${NMI}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}}{{{\bf{C}}}}\right)=\frac{2{MI}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}} \, {{{\bf{C}}}}\right)}{H\left({{{\bf{L}}}}\right)+H\left({{{\bf{C}}}}\right)}$$
(2)
In the above equations, the mutual information (MI) computed by the layer outputs L and the label cluster C. \({P}_{i,j}\) represents the joint distribution probability between i and j, and \({P}_{i}\) refers to the distribution probability of i. \({P}_{j}\) refers to the distribution probability of j. \(H\left({{{\boldsymbol{L}}}}\right)\) and \(H\left({{{\bf{C}}}}\right)\) represent the entropy values of L and C, respectively. The clusters of the output layer are clustered by the K-nearest neighbour algorithm.$${ARI}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}} \, {{{\bf{C}}}}\right)=2\frac{{TP}\times {TN}-{FN}\times {FP}}{({TP}+{FP})({FN}+{TN})+({TP}+{FP})({FP}+{TN})}$$
(3)
In the above equation, TP represents the number of point pairs belonging to the same cluster in both real and experimental cases, and FN represents the number of point pairs belonging to the same cluster in the real case but not in the same cluster in the experimental case. FP represents the number of point pairs not belonging to the same cluster in the real case but in the same cluster in the experimental case, and TN represents the number of point pairs not belonging to the same cluster in both real and experimental cases. The range of ARI is [−1, 1], and the larger the value, the more consistent with the real result, that is, the best effect of clustering.$${ASC}\left({{{\bf{L}}}}{{{\boldsymbol{,}}}} \, {{{\bf{C}}}}\right)=\frac{{\sum }_{i}^{N}\frac{{b}_{i}-{a}_{i}}{\max \left[{a}_{i},\, {b}_{i}\right]}}{{N}_{C}}$$
(4)
The output layer was first dimensionally reduced by PCA, and the cluster was specified by the real label. In the above equation, \({a}_{i}\) represents the average of the dissimilarity of the vector i to other points in the same cluster, and \({b}_{i}\) represents the minimum of the dissimilarity of the vector i to points in other clusters.Feature contributionsIn previous feature contribution studies, different branches used different methods to compute feature contributions to final classifications. These methods can help to better understand features and their impacts on model predictions. Gradient-based methods, such as GradCAM, calculate the gradients of the last convolutional layer by backpropagation of the category with the highest confidence69. Due to its convenience, this method is widely used in computer vision tasks. But it has a significant problem, that is, the resolution of the feature contribution heatmap is extremely low and cannot reach the requirements for distinguishing most signal peaks. The size of the feature contribution heatmap corresponds to the last convolutional layer of the model. The weight of the feature contribution is the average of the gradients of all features. On the other hand, perturbation-based methods, such as RISE and Local Interpretable Model-Agnostic Explanations, measure the importance of features by obscuring some pixels in raw data37,70. The predictive ability of the black box is then observed to show how much this affects the prediction. Perturbation-based methods can lead to higher resolution and more accurate contribution estimates, but their runtimes are longer. To improve the computing speed in this study, we made some improvements based on RISE, using boost sampling for the mask process.Using RISE, we can determine the characteristic contributions of RT and m/z for each sample according to its true category. The feature contribution heatmap uses RT as the horizontal axis and m/z as the vertical axis to show the feature contribution of different positions of each sample. The average feature contribution of all samples correctly predicted to be of their true category is taken as the feature contribution of the category. At the same time, by performing peak extraction in the previous steps, we determined the RT value range and the m/z median value for each signal peak. The characteristic contribution associated with the RT and median m/z coordinates is then identified as the distinctive contribution of the signal peak.Network analysis and pathway enrichmentThe extracted metabolic peaks with a contribution score greater than 0.70 to the lung cancer classification were filtered. Mass-to-charge ratio and some substance identification information of these metabolites and their classification contribution scores were used as input data. For some of the metabolic signal peaks, we have accurately identified their molecular formulae and substance names by secondary mass spectrometry as substance identification information. Due to the limitation of existing databases, many unknown signals cannot be identified through secondary mass spectrometry. Therefore, PIUMet was also adopted to search for hidden metabolites and related proteins.PIUMet built disease-related metabolite-protein networks based on the prize-collecting Steiner Forest algorithm. First, PIUMet integrated iRefIndex (v13), HMDB (v3) and Recon2 databases to obtain the relationship between m/z, metabolites and proteins, and generated an overall metabolite-protein network. The edges were weighted by the confidence level of the correlation reported in these databases. iRefIndex provides details on the physical interactions between proteins, which are detected through different experimental methods. The protein-metabolite relationships in Recon2 are based on their involvement in the same reactions. HMDB includes proteins that are involved in metabolic pathways or enzymatic reactions, as well as metabolites that play a role in protein pathways, based on known biochemical and molecular biological data. The disease-related metabolite peaks obtained by DeepMSProfiler were matched to the metabolite nodes of the overall network by their m/z, and directly to the terminal metabolite nodes of the overall network after annotation. The corresponding feature contributions obtained by DeepMSProfiler served as prizes for these metabolite nodes. The network structure was then optimised using the prize-collecting Steiner Forest algorithm to minimise the total network cost and connect all terminal nodes, thereby removing low-confidence relationships and obtaining disease-related metabolite sub-networks.Metabolite identification is an important issue in metabolomics research and there are different levels of confidence in identification. Referring to the highest level considered71, we analysed authentic chemical standards and validated 11 of the metabolites discovered by PIUMet with only m/z (Supplementary Table 5). Then, disease-related metabolites and proteins were used to analyse their pathways39. These hidden metabolites and proteins from PIUMet were then processed for KEGG pathway enrichment analysis using MetaboAnalyst (v6.0). We used joint pathway analysis in MetaboAnalyst and chose hypergeometric test for enrichment analysis and degree centrality for topology measure. The integrated version of KEGG pathways (year 2019) was adopted by MetaboAnalyst. Pathways were filtered out using 1e-5 as a p value cut-off72. The corresponding SYMBOL IDs of the proteins were converted to KEGG IDs by the ClusterProfiler package in R73.Ablation experimentWe searched the PubMed database for a total of 5088 articles using the terms “serum”, “lung cancer” and “metabolism” from 2010 to 2022. By reading the titles and abstracts of them, we excluded publications that used non-serum samples such as urine and tissue for research, as well as publications that used non-mass spectrometry methods such as chromatography, nuclear magnetic resonance, and infrared spectroscopy. We then further screened the selected literature to exclude studies that did not result in the discovery of metabolic biomarkers. Finally, 49 publications were remained and 811 serum metabolic biomarkers for lung cancer were reported. Some of the literature provides information on the retention time and mass-to-charge ratio of biomarkers. However, in other literature, only the name of the identified biomarker is given. Therefore, we searched the molecular weights of these metabolites in the HMDB database based on the literature information to match the corresponding m/z. The use of metabolite molecular weights to match the m/z took full account of the effect of adducts. Based on the number of publications of biomarkers in the literature, we determined the range of retained signals to be the m/z corresponding to biomarkers that exceeded the threshold number of publications. We filtered the signals in the raw data to exclude signals that did not fall into the 3 ppm intervals around these m/z. The filtered raw data were used as input to the model.Statistical analysisAll statistical analysis calculations were performed using the stat package in Python. The distribution of data was plotted using the Seaborn package in Python. The correlation between patient information and labels was calculated using Pearson’s, Spearman’s and Kendall’s correlation coefficients. Pearson’s correlation coefficient was preferred to find linear correlations. Spearman’s and Kendall’s rank correlation coefficients were used to discover non-linear correlations. P-values below 0.05 were considered significant.Figure preparationThe main figures in this paper were assembled in Adobe Illustrator. The photo of mass spectrometry instruments was taken from actual objects. The data structure diagrams were obtained by fitting simulated functions based on python. Some cartoon components were drawn through FigDraw (www.figdraw.com) with a license code (TAIRAff614) for free use.AI-assisted technologies in the writing processAt the end of the preparation of this work, the authors used ChatGPT to proofread the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles