Using neural networks to obtain NMR spectra of both small and macromolecules from blood samples in a single experiment

DatasetSeparating signals in the 1H NOESY-presat spectra of blood samples involves distinguishing overlapping peaks of small molecules, identifying low signal-to-noise spectral peaks that overlap with the broad peak in the low-field region of albumin, and dealing with the discrepancy between the high-intensity peaks of lipid chains and the bimodal peaks of lactate. To address these complex scenarios, this study carefully designed factors such as the dataset, neural network architecture, and loss function.To effectively train neural network models and recognize underlying data patterns, extensive training with large data sets is crucial. Optimal generalization performance in real-world applications also depends on training and validation datasets that closely resemble the actual task. To achieve this, and given the similar composition of serum and plasma28, we randomly selected a 1H NOESY-presat spectrum of plasma (heparin) as the base data for training. In addition, a set of serum samples was selected as the validation dataset to optimize the hyperparameters of the model and assess its performance. These serum samples were subjected to two different 1H NMR experiments, i.e., NOESY-presat and CPMG-presat.To generate the training dataset, the first step is to extract peak information from the 1H NOESY-presat spectrum using the peak_widths and find_peaks functions provided by the scipy.signal library29. The peak parameters extracted from the selected plasma NOESY-presat spectrum, including resonance frequency, linewidth, and peak intensity, were then randomly adjusted within a defined range. The threshold for the linewidth at half-height of the peaks distinguishing between small and macromolecules was 3.66 Hz (Supplementary Note 1 and Supplementary Fig. 1). By applying the free induction decay (FID) signal formula, we simulated spectral peaks with various parameter distributions to comprehensively model the signal distribution of potential small- and macromolecules in blood sample 1H NOESY-presat spectra. In addition, random noise was introduced to make simulated spectra more closely resemble real ones. The input features of the training dataset consist of simulated spectra containing signals from both small and macromolecules, while the output labels represent the corresponding simulated spectra of macromolecules. By subtracting the latter from the former, we can obtain the spectrum of small molecules, thus achieving simultaneous acquisition of signals from both small and macromolecules (Supplementary Note 2 and Supplementary Fig. 2).The validation dataset used authentic 1H NOESY-presat spectra as input features, with output labels derived from small molecule peaks extracted from CPMG-presat spectra using the peak_widths and find_peaks functions from the scipy.signal library29. Reconstruction of the small molecule spectra was achieved by applying the FID signal equation, with a cut-off value of 3.66 Hz (“Theory” section, Supplementary Fig. 1). This dataset was crucial to ensure that the model could accurately handle real NOESY-presat spectra and produce reliable results30.SENNet modelUsing the analogy of image segmentation31, we tackled the problem of discriminating between small and macromolecular signals in 1H NOESY-presat spectra by adapting the Unet architecture, originally designed for medical imaging tasks32. The basic version of Unet, which includes downsampling, upsampling and skip connections, was designed to extract features from two-dimensional images. Our modified architecture, called SENNet, is suitable for 1D NMR spectra, and utilizes peak linewidth information to ensure precise spectral editing, as shown in Fig. 1. The specific parameters of the basic building blocks are given in Supplementary Table 1 and the detailed parameter scales of these building blocks are given in Supplementary Table 2 (Supplementary Note 3).Fig. 1: The SENNet architecture is a modification of the classical Unet and consists of three elements: downsampling, upsampling, and skip connections.The input features of the data are intensity-normalized NOESY-presat spectra with a spectral width of 12,000 Hz and 128 k data points, while the output labels represent the spectra of the corresponding macromolecules. Subtracting the latter from the former gives the spectrum of the corresponding small molecule, allowing the spectrum to be edited. The downsampling module (red arrows) consists of two successive 1D convolutional layers and a max pooling layer. These downsampling modules progressively reduce the size of the input spectrum from 128 k, 32 k, 8 k, 2 k, 512 k, 128 k, 32 k, 8 k, 2 k, to 1 k, while simultaneously increasing the number of channels. The upsampling module (dark yellow arrows) consists of two successive 1D convolutional layers and a 1D transposed convolutional layer. The size of the output spectrum is progressively scaled by these upsampling modules, but the number of channels is progressively reduced, in contrast to the downsampling module. The corresponding output layer of the downsampling module is then channel concatenated to the output layer of the upsampling module via skip connections (black arrows). The final layer (light yellow arrow) consists of three successive 1D CNN layers that transform the output labels into a single channel of 128 k data points.Since the output spectrum of the model represents the signal from the macromolecules, the ideal output should be smooth to minimize sharp spikes relative to the label. Therefore, the loss function used during training focuses on minimizing errors in regions of unstable vibration. The loss function combines total variation error (TVE) and normalized mean squared error (NMSE). Total variation (TV) regularization is commonly used in computer vision tasks to suppress unwanted noise33,34. The training loss function is formulated as follows:$${{loss}}=w* {\sum }_{2}^{n}\left|{x}_{n}-{x}_{n-1}\right|/n\,+{{NMSE}}$$In this formula, “x” is the difference between the output and label spectra, NMSE is the normalized mean square error between the output and label, and “w” is the weight assigned to the TVE term.In this study, we used the modified TVE as the loss function to train the SENNet model. To optimize the hyperparameters of the model and evaluate its performance, we used the validation dataset consisting of NMR spectra of 113 serum samples obtained from the MetaboLights database35 (accession number MTBLS37411).To determine the optimal value for parameter “w”, multiple tests were conducted with a fixed number of training iterations, followed by the calculation of the Pearson correlation coefficient between the model’s output small molecule spectra and the small molecule signals in the validation dataset. Consequently, this value for “w” was chosen as the final parameter for the SENNet model. After training, the Pearson correlation coefficient between the small molecule average bin spectra (1.8 Hz) generated by SENNet and the small molecule bin spectra (1.8 Hz) reconstructed based on the averaged spectrum of CPMG-presat was determined to be 92.6% (Supplementary Fig. 3). It was found that a value of 22 for “w” yielded the best performance, i.e., the small molecule spectra generated by the model were closest to the small molecule spectra obtained from the CPMG-presat experiments (Supplementary Note 4 and Supplementary Fig. 4).It’s worth noting that the reconstructed small molecule spectrum represents the signals of small molecules identifiable in CPMG-presat spectra using the find_peaks and peak_widths functions from the scipy.signal library29. These two functions exclude signals below a certain intensity and do not accurately extract the intensity of overlapping peaks (Supplementary Figs. 3 and 5 and Supplementary Notes 2 and 5). While the signals in the reconstructed small molecule spectrum may differ slightly from the actual small molecule signals, this correlation analysis partially validates the performance of the model and underscores the need for our approach.In order to systematically assess the generalization ability of SENNet across different spectra and to comprehensively demonstrate the applicability of the model, we applied the trained SENNet model to several datasets (Table 1). These datasets were derived from NMR metabolomics studies performed on 600 MHz and 700 MHz NMR spectrometers and included samples such as plasma (heparin, EDTA) and serum. Firstly, the NOESY-presat spectra were processed using the trained SENNet to generate small and macromolecules spectra respectively. Then, the generated spectra were compared to experimental CPMG-presat and diffusion-edited spectra in terms of peak intensity and principal component analysis (PCA) results.Table 1 Sample information for datasetsApplication of SENNet to plasma samplesTo demonstrate the ability of SENNet to process NOESY-presat spectra of plasma samples (heparin), we analyzed the spectra of 120 plasma samples acquired on a 600 MHz NMR spectrometer (600-plasma-heparin dataset). The original study focused on using 1H NMR spectra and PCA analysis to investigate ibuprofen-plasma interactions36. As shown in Fig. 2, SENNet effectively discriminated between signals from small and macromolecules in the NOESY-presat spectra. In the δ 0.7–1.3 region, it accurately discriminated high-intensity signals from lactate and lipid chains, while detecting lower-intensity signals from free amino acids. In the δ 3.0–4.5 region, the model accurately isolated several small molecule metabolites within complex overlapping regions and skillfully managed broad baselines. Furthermore, in the δ 6.5–8.5 region, it efficiently identified 1H signals from aromatic rings and albumin in low signal-to-noise regions of low intensity.Fig. 2: A 1H NOESY-presat spectrum of a plasma sample acquired on a 600 MHz NMR spectrometer was processed using the SENNET model.The model effectively separated peaks with larger line widths at half height, as shown by the orange dashed line (macro), allowing the extraction of a small molecule spectrum similar to that obtained from the CPMG-presat experiment (CPMG). A Shows the range from 5.4 ppm to 0.7 ppm, in this range it accurately discriminated between small and macro molecule metabolites in several complex situations. B shows SENNet’s effective isolation of 1H signals from aromatic rings and albumin in low signal-to-noise regions (8.6-5.65 ppm), magnified 18 times for clarity. The blue solid line represents NOESY-presat spectrum (NOESY), the orange dashed line represents macromolecule signals separated by SENNet (macro), the green solid line represents CPMG-presat spectrum (CPMG), the red solid line represents small molecular signals separated by SENNet (small), and the purple solid line represents 1D diffusion-edited spectrum (LEDBP).To assess the quantitative capability of the SENNet model, we selected ten peaks that were not affected by protein signals. For these selected peaks, we normalized the intensity of these selected peaks from the CPMG-presat spectra and SENNet extracted spectra, using the lactate peak at 4.135 ppm as a reference (Peak 0), to compare their Pearson correlation coefficients and regression coefficients (slopes). The Pearson correlation coefficients and slopes of these ten peaks from different sources are as follows (with the small molecule spectra extracted by the SENNet model on the x-axis): Peak I (δ 3.914): 0.996, 1.54; Peak II (δ 3.857): 0.996, 1.50; Peak III (δ 3.738): 0.996, 1.52; Peak IV (δ 3.268): 0.996, 1.45; Peak V (δ 1.501): 0.993, 1.41; Peak VI (δ 8.478): 0.927, 1.54; Peak VII (δ 7.821): 0.981, 1.56; Peak VIII (δ 7.209): 0.984, 1.47; Peak IX (δ 7.087): 0.994, 1.30; and Peak X (δ 6.915): 0.984, 1.48. Scatter plots of these correlation analyses are shown in Supplementary Fig. 6 (Supplementary Note 6). From these analyses, it was evident that the Pearson correlation coefficients of peak intensities between small molecule spectra and CPMG-presat spectra are close to 1.0, indicating a strong association between them. In addition, in the regression analysis, the slope of the peak intensities was approximately 1.5 when the small molecule signal extracted by the SENNet was positioned along the x-axis, which can be attributed to the 100 ms T2 relaxation effect of the signals in the CPMG-presat experiment and the 100 ms mixing time in the NOESY-presat experiment (noesypr1d), specifically, lactate has a faster T2 relaxation at a total spin echo time of 100 ms compared to these ten peaks.Figure 2 shows the differences between the CPMG-presat spectrum (CPMG) and the small molecular spectrum extracted by SENNet (small). The CPMG has several broader peaks attributed to incompletely attenuated lipoprotein signals, whereas SENNet excludes signals with larger peak linewidth from the small molecule signals and categorizes them as macromolecular signals (macro). In addition, Fig. 2 shows the results of processing NOESY-presat spectra using the SENNet model to obtain spectra peaks with wider linewidth (macro). In the 5.4–0.7 ppm region of Fig. 2, the extracted macromolecular signals (macro) are similar to the experimental spectra (LEDBP), although not completely identical. This difference can be attributed to the effective removal of small molecular signals, while also attenuating signals from albumin during the diffusion-edited experiment37. Based on the results presented in Fig. 2, we concluded that SENNet can accurately and effectively identify larger linewidth peaks in plasma 1H NOESY-presat spectra with a high degree of similarity to the experimental data.To further demonstrate the power of SENNet, we performed PCA on the extracted spectra from the 120 plasma NOESY-presat spectra (small and macro) and then performed PCA in the same way on the experimental spectra (CPMG and LEDBP). Thus, each of these four data sets consisted of two sets of 60 samples each, one with and one without ibuprofen. PCA analysis of these datasets revealed that in the CPMG-presat spectra, the cumulative explained variance of the first three principal components (PCs) reached 89.64% (Fig. 3A). Similarly, in the small molecule spectra extracted by SENNet (small), the cumulative explained variance of the first three PCs reached 90.74% (Fig. 3C). In addition, slight differences were observed in the PCA analysis of the diffusion-edited (LEDBP) spectra and the SENNet-extracted macromolecular signals (Macro), as shown in Fig. 3B, D.Fig. 3: The PCA score plots of experimental spectra (CPMG and LEDBP) and SENNet-extracted spectra (small and macro) for 120 plasma samples divided into two groups of 60 samples each.These plots are based on CPMG-presat spectra (89.64%) (A), LEDBP spectra (94.04%) (B), small molecule signals (90.74%) (C), and macromolecular signals (94.82%) (D) obtained by processing NOESY-presat spectra. Comparison of the score plots of the SENNet-extracted spectra with the experimental NMR spectra (CPMG and LEDBP) showed similar sample distribution patterns and clusters within the dataset. Red triangle symbols represent samples with ibuprofen, while blue circle symbols represent samples without ibuprofen.Comparing Fig. 3A (CPMG) with Fig. 3C (small) and Fig. 3B (LEDBP) with Fig. 3D (Macro), we observed that the PCA score plots have similar sample distribution patterns and clustering (Fig. 3). These results suggested that SENNet’s separation of small and macromolecule data achieves similar sample grouping and pattern recognition functions as conventional NMR spectral editing methods, albeit with some nuances. The subtle differences between Fig. 3A (CPMG) and Fig. 3C (small) are due to the retention of some macromolecular signals, such as lipoproteins, in the CPMG spectra. The small difference between Fig. 3B (LEDBP) and Fig. 3D (Macro) is due to the fact that SENNet was able to capture all macromolecular signals, whereas LEDBP lost some signals from albumin. These differences are very well illustrated in Fig. 2.We have also used small molecule spectra extracted by SENNet to predict drug-plasma interactions, which could be useful in guiding personalized therapy. Because the ibuprofen induced the changed Euclidean distance could defined as the interaction index to measure the strength of the drug-plasma interaction36. In this study, we evaluated the Pearson correlation coefficient between the spectra bin integrals (1.8 Hz) total sum normalized and high-dimensional Euclidean distances for 60 pairs of plasma samples, where one group of samples was supplemented with ibuprofen and the other was not. This coefficient measures the relationship between the Euclidean distances of the paired samples and the signal intensities in the absence of ibuprofen, thus capturing the effects induced by ibuprofen36,38.Figure 4 displayed a heat map of the Pearson correlation coefficients of the Euclidean distances and the spectra bin integrals obtained from the CPMG-presat experiments and SENNet, respectively, where the color scale represents the absolute value of the Pearson correlation coefficients between signal bin integrals and changed Euclidean distance. Higher correlation coefficients (red) indicated that the metabolite contributes more to the classification in the multivariate analysis, which helps to visually identify significantly correlated peaks. The SENNet-extracted small molecule spectra in Fig. 4 were subjected to the same manipulation as the CPMG spectra. This correlation analysis highlighted the advantages of using SENNet-extracted small molecule spectra in investigating drug-plasma interaction, especially when compared to CPMG-presat spectra containing undegraded macromolecular signals. The results shown in Fig. 4 provide important insights into the identification of potential biomarkers and also highlight the utility of SENNet in small molecule biomarkers discovery.Fig. 4: Pearson correlation plot between group Euclidean distances in high-dimensional space and the bin integrals of 60 samples.The plot shows the Pearson correlation coefficients from twodifferent data sets (i.e., CPMG and small), with the absolute values of the coefficients corresponding to the colors of the variables. A higher correlation coefficient indicates a greater contribution of the metabolite to the group classification in the multivariate analysis. The CPMG-presat spectra contain undecayed macromolecular signals as indicated by the arrows.In metabolomics research, some studies on plasma use EDTA as an anticoagulant12. To further demonstrate SENNet’s ability to process NMR spectra of plasma samples containing EDTA, we analyzed the spectrum of a plasma-EDTA sample collected on a 700 MHz NMR spectrometer (700-plasma-EDTA data). In Supplementary Fig. 7 we can see that the SENNet can also separate the small and macromolecules in NOESY-presat spectra that are acquired at 700 MHz (Supplementary Note 6).Based on the above findings, SENNet effectively processes 1H NOESY-presat spectra of plasma samples acquired on 600 MHz and 700 MHz NMR spectrometers, accurately extracting signals from both small and macromolecules.Application of SENNet to serum samplesTo demonstrate the ability of SENNet to process serum samples, we selected a dataset collected on a 600 MHz NMR spectrometer (600-serum dataset). This dataset consisted of samples from 106 severely obese patients collected at multiple time points before and after (3 months, 6 months, 9 months, and 12 months) gastric bypass surgery9. As shown in Fig. 5, by applying the SENNet model, we successfully separated small and macromolecular signals in the serum NOESY-presat spectra, which is similar to the processing of plasma NMR spectra. The obtained small and macromolecule profiles showed similar signal intensities and distributions compared to the CPMG-presat and LEDBP profiles. Similarly, incompletely attenuated lipid signals were observed in the CPMG-presat spectra. In contrast, the extracted macromolecular signals contained all albumin signals that were completely attenuated in the LEDBP spectra.Fig. 5: A 1H NOESY-presat spectrum of a serum sample acquired on a 600 MHz NMR spectrometer was processed using the trained SENNET model.The model effectively separated peaks with larger linewidth at half-height, as shown by the orange dashed line (macro), allowing for the extraction of small molecule spectrum similar to those obtained from the CPMG-presat experiment (CPMG). A Illustrates the range spanning from 5.4 ppm to 0.7 ppm, while panel B shows SENNet’s effective isolation of 1H signals from aromatic rings and albumin in low signal-to-noise regions (8.6–5.65 ppm), magnified 20 times for clarity. The blue solid line represents NOESY-presat spectral data (NOESY), the orange dashed line represents macromolecule signals separated by SENNet (macro), the green solid line represents cpmgpr1d spectral data (CPMG), the red solid line represents small molecule signals separated by SENNet (small), and the purple solid line represents 1D diffusion-edited spectral data (LEDBP).To further validate the quantitative capability of SENNet, we selected 10 peaks unaffected by protein signals. For these selected peaks, we normalized the intensity of these selected peaks from the CPMG-presat spectra and SENNet extracted spectra, using the lactate peak at 4.135 ppm as a reference (Peak 0), to compare their Pearson correlation coefficients and regression coefficients (slopes). We normalized the intensity of the small molecule spectra extracted from CPMG-presat spectra and SENNet using the lactate peak at 4.135 ppm as a reference (Peak 0), and calculated their intensity Pearson correlation coefficients and slopes between CPMG-presat spectra and SENNet-extracted small molecule spectra. The results are as follows: Peak I (δ 3.914): 0.992, 1.07; Peak II (δ 3.859): 0.994, 0.99; Peak III (δ 3.757): 0.998, 0.95; Peak IV (δ 3.271): 0.994, 1.03; Peak V (δ 1.502): 0.981, 0.89; Peak VI (δ 8.481): 0.984, 1.06; Peak VII (δ 7.452): 0.984, 0.98; Peak VIII (δ 7.218): 0.983, 0.98; Peak IX (δ 7.008): 0.953, 0.91; and Peak X (δ 6.922): 0.992, 1.09. Scatter plots of these correlation analyses are shown in Supplementary Fig. 8 (Supplementary Note 7). From these analyses, it was evident that the Pearson correlation coefficients of peak intensities between small molecule spectra and CPMG-presat spectra are close to 1.0, indicating a strong association between them. In addition, in the regression analysis, the slope of the peak intensities was approximately 1.0 when the small molecule signal extracted by the SENNet was positioned along the x-axis, which can be attributed to the 76.8 ms T2 relaxation effect of the signals in the CPMG-presat experiment and the 10 ms mixing time in the NOESY-presat experiment (noesygppr1d), indicating similar relaxation behavior to lactate in this CPMG-presat experiment.The SENNet model was then applied to extract small and macromolecular spectra from all the 1H NOESY-presat spectra. The extracted spectra (small and macro) were then subjected to PCA analysis separately from the spectra obtained from the NMR experiments (CPMG and LEDBP) to demonstrate the efficiency of the SENNet method. At this point, each sub-dataset contained samples taken before and after surgery (3 months, 6 months, 9 months, and 12 months) with sample sizes of 105, 97, 98, 92, and 71, respectively, for a total of 463 samples. PCA analysis of the CPMG-presat spectra showed a cumulative explained variance of 81.66%. In contrast, the cumulative explained variance of the first three PCs of the small molecule spectra extracted by SENNet was 88.30% (Fig. 6A). In addition, PCA analysis of the macromolecular signals extracted from the SENNet and LEDBP spectra showed that the cumulative explained variance of the first three PCs was 92.85% and 93.46%, respectively (Supplementary Fig. 9). In addition, the score plots show a clear grouping pattern and provide detailed insight into the five-time points (Fig. 6 and Supplementary Fig. 9).Fig. 6: Comparison of the PCA score plots generated from CPMG-presat experimental spectra and processed small molecular spectra for 463 serum samples.The datasets consist of samples taken before and after surgery (3 months, 6 months, 9 months, and 12 months), with sample sizes of 105, 97, 98, 92, and 71, totaling 463 samples. The PCA score plots demonstrate similar distribution patterns, capturing inter-group and intra-group separations. A Shows the PCA scores plots (PC1 vs PC2) based on CPMG-presat spectra of blood serum samples. Paired PCA analysis was conducted on 106 severely obese patients’ samples using small molecular spectral datasets processed by SENNet (B). The plots exhibit consistent distribution patterns, with the horizontal line marker, left triangle marker, vertical line marker, cross marker, and plus marker representing samples collected before and after surgery at 3 months, 6 months, 9 months, and 12 months, respectively.Figure 6 and Supplementary Fig. 9 compared the PCA score plots of the spectra obtained by the SENNet model and the NMR experiments (CPMG and LEDBP). For both small and macromolecular information, both methods (NMR experiment and SENNet) reveal similar sample distribution patterns and effectively capturing both inter- and intra-group separations, demonstrating the reliability of the SENNet model in capturing spectrally information from the NMR spectral signal (NOESY-presat). Similar to the case of plasma samples, the minor differences between the two methods are mainly due to the fact that the CPMG-presat spectra contain signals from macromolecules, while the LEDBP spectra lose signals from albumin, whereas SENNet is able to completely separate all signals from both small and macromolecules. Notably, while the CPMG-presat and LEDBP experiments each took approximately 8 min for a sample, SENNet was able to rapidly generate signals from both small and macromolecules less than 1.0 s, which could be processed on a personal computer.To further demonstrate the generalizability of SENNet, we processed a serum NMR spectrum collected on a 700 MHz NMR spectrometer (700-serum data). In Supplementary Fig. 10 we can see that the SENNet can also separate the small and macromolecular signal in 700 MHz NOESY-presat spectra (Supplementary Note 7). In summary, SENNet effectively distinguishes signals from both small and macromolecules in NOESY-presat spectra obtained from serum samples at both 600 MHz and 700 MHz.

Using neural networks to obtain NMR spectra of both small and macromolecules from blood samples in a single experiment

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis