COVID-19 detection from exhaled breath

We propose a detection system that leverages mass spectrometry and AI to rapidly assess exhaled breath samples from patients and identify the presence of COVID-19. Recent studies have shown that COVID-19 patients exhibit distinct VOC profiles in their breath7,18,19.VOCs are a significant group of chemicals that can easily evaporate at room temperature. They are present in various products, including paints, cleaning agents, and building materials. The expelled breath from individuals contains several VOCs in addition to nitrogen, oxygen, carbon dioxide, and water vapor. Recent studies have identified specific VOCs as biomarkers for several respiratory diseases, including lung cancer, cystic fibrosis, asthma, chronic obstructive pulmonary disease (COPD), and COVID-19. Furthermore, variations in VOC profiles can help distinguish between smokers and non-smokers. Certain VOCs have also been associated with lung cancer. For example, elevated levels of specific VOCs in exhaled breath have been correlated with lung cancer diagnosis, suggesting their potential utility in early detection20. In cystic fibrosis, VOCs may indicate disease severity and exacerbations, providing a non-invasive monitoring tool20. However, analyzing VOC necessitates the use of specialized techniques and hardware for detecting and selecting specific VOC7,9,10,16,21,22. Electronic nose technology and other analytical methods have demonstrated high sensitivity and specificity in detecting these compounds, making VOC analysis a promising tool for rapid COVID-19 diagnosis7,9,10,21.The proposed approach completely eliminates the need for prior identification of specific VOCs, focusing instead on the direct analysis of the breath fingerprint through its mass spectrum. This method is straightforward, easy to implement, and aims to establish a correlation between a specific breath fingerprint and the presence of COVID-19 without explicitly defining or detecting individual VOCs. Breath samples can be conveniently stored in specialized containers, simplifying collection procedures that can be performed by non-specialized personnel in various locations.Our system utilizes a proprietary nano-sampling device coupled with a high-precision mass spectrometer capable of performing efficient mass spectrum analysis within the 10–351 m/z range23; this analysis requires usually few seconds, and never more than few minutes. The raw data are processed by in-house developed python tools: first they are aligned to the baseline, then filtered to reduce measurement noise, and eventually a process of data augmentation enhances the robustness and diversity. We employed standard ML classifiers from a state-of-the-art data analysis library24 to detect the presence of COVID-19. Notably, this system operates without the need for reagents, and it generates no hazardous waste, making it both efficient and environmentally friendly.Breath samples collectionFigure 1Schematic of the Mass Spectra analyzer.Figure 2Diagram of the sampling and processing procedure.For each patient under test, ambient air was sampled to verify environmental parameters and ensure the stability of the instrument. Then, the subject’s breath is collected into a sampling tedlar bag with a defined volume of 3 L, by having the subject blow through a straw directly into the bag. Each patient exhales into the bag until it is filled with approximately 3 L of air. This large volume is necessary to establish the stable sampling pressure required by the adopted technology23,25,26 and also implicitly averages the different phases of the expiratory flow. Then, the filled bag is connected to the inlet valve of the MS apparatus. The inlet valve can have two possible settings: the first setting allows for the sample mixture, at atmospheric pressure, to flow from the bag to the ionization chamber, directly through an original Micro Electro-Mechanical System (MEMS) interface27; the second setting connects the MEMS interface to a membrane pump, in order to clean the inlet line, bringing it to vacuum conditions (\(\simeq {1e}^{-3}\hbox {mbar}\)).A schematic of the sampling system is provided in Fig. 1, and an illustrative diagram of the sample collection process is shown in Fig. 2.Mass spectra are recorded via a Varian 1200L mass analyzer software, which allows the setting of some acquisition parameters like mass range, acquisition time, and electron multiplier (EM) voltage. The latter parameter ultimately sets the detector amplification factor. We recorded mass spectra in the following ranges:

10–51 m/z, with an acquisition time of 10 s and EM voltage of 1000 V;

49–151 m/z, with an acquisition time of 14 s and EM voltage of 1800 V;

149–251 m/z, with an acquisition time of 14 s and EM voltage of 1800 V;

249–351 m/z, with an acquisition time of 14 s and EM voltage of 1800 V;

To avoid signal saturation, the amplification in the first mass range was reduced due to the presence of the most abundant breath components, namely \(\text {CO}_{2}\) (44 m/z) main peak, \(\text {N}_2\) (28 m/z) main peak and \(\text {O}_{2}\) (32 m/z) main peak. For each breath sample, 10–20 acquisitions were taken at fixed time intervals, allowing for the collection of multiple data points from each patient. This approach of acquiring multiple samples per patient enhances the robustness of the mass spectrum analysis by averaging out potential variations and noise in individual measurements. Finally, the raw spectra are filtered and analyzed using our proposed method. The successive analysis of these multiple acquisitions improves the reliability and accuracy of the breath fingerprint profile, leading to more consistent and representative results. This methodology ensures that the breath fingerprint are not anomalies but are reflective of the patient’s actual metabolic state, thereby increasing the diagnostic precision of the breath analysis.The acquisition of each mass range for a subject under test takes less than two minutes (approximately one and a half minutes), requiring about six minutes to acquire all 4 ranges and thus obtain the complete spectrum. Although the approach is not real-time, it is still significantly faster than traditional methods.By summing all the intensities for each m/z in each acquisition, we can obtain the Total Ion Current (TIC) curve plot. Figure 3 shows the TIC behavior when the breath sample flows into the MS system: the initial increase is due to the abrupt pressure change at the valve opening and, after a few tens of seconds, the TIC curve reaches a plateau region27, when the flux stabilizes. These procedures allowed for storing a dataset composed of the acquisitions of the spectra for each patient.Figure 3TIC of a recording from one sample. The recording is made up of about 10 acquisitions (green dots), each corresponding to a mass spectrum. The spectra used for the analysis are selected on the plateau of TIC (red dotted region).Pre-processingOnce the raw measures have been obtained, data are cleaned through a pre-processing procedure that reduces noise and machine variation of the acquisitions.For each acquisition, the recorded m/z positions may be shifted with a specific alignment when the machine records the quantity of the ionized molecules due to measurement noise. A peak-alignment procedure is thus necessary. This procedure enables reducing the noise of the machine and compacting information. The peak alignment procedure is based on moving the peak to the nearest integer position, using them as anchors. The curves between two nearby peaks are stretched or compressed to sustain their original shape, preventing information loss. Stretching and compression between peaks are done by linear interpolation to fit the corresponding segments in the reference. A graphical plot after the peak alignment can be seen in Fig. 4.Figure 4Aligned and non-aligned peaks of the mass spectrum of a single patient.As previously mentioned, multiple acquisitions are taken for each patient to ensure the accuracy of the breath analysis. To mitigate potential noise in the measurements and enhance the stability of breath fingerprint recognition, these multiple acquisitions are agglomerated into a single robust mass spectrum. This is achieved by focusing on the plateau zone of the TIC curve, thus the region where the signal stabilizes, indicating that the breath sample flux has reached a steady state.The plateau zone is composed of the most stable acquisitions, which are indicative of a consistent breath sample. To accurately combine these acquisitions, the first step is to identify the plateau in the TIC curve. A plateau-searching procedure is implemented, which involves detecting acquisitions that show minimal variation from one another. This is done by computing the gradient of the signal; acquisitions within the plateau zone are those where the gradient is minimal, indicating a stable signal.Once the plateau zone is identified, the acquisitions within this region are averaged to produce a single, robust mass-spectra measurement. This averaging process reduces the influence of any outlier data points or transient fluctuations, resulting in a more reliable representation of the patient’s breath fingerprint. This method enhances the overall robustness and accuracy of the mass spectrometry analysis, ensuring that the VOC profile obtained is both stable and reflective of the true metabolic state of the patient.The plateau searching procedure was implemented as follows:

For each acquisition, we computed the gradient of the signals.

The plateau is defined as a zone that is nearly flat, ideally where the gradient is zero or where the gradient does not vary significantly from zero. To identify this flat zone, we compute a tolerance guard-band, denoted as \(\epsilon\), which allows us to classify a region as “flat” if the absolute value of the gradient remains below \(\epsilon\). The value of \(\epsilon\) is determined based on the \(q\)-th quantile of the gradient distribution, where \(q\) is a parameter within the range \([0, 1]\). This parameter \(q\) controls the stringency of the requirement for a constant slope in the plateau region; a lower \(q\) value indicates a stricter requirement, leading to a narrower definition of the plateau, while a higher \(q\) value allows for more variation in the gradient, resulting in a broader plateau definition.

The TIC curve may present more than one plateau: the first one is in the region in which the breath sample has not flown yet into the MS machine. This can be composed of 1–3 acquisitions. Thus, to avoid potential errors, we considered the plateau of maximum length, that is, in the region where the ion flow stabilizes.

Once the plateau of maximum length is found (which varies from 3 to 5 acquisitions for each patient), we computed the standard deviations of acquisitions in this region by deploying a rolling window of size 4. We then chose the 4 acquisitions that minimized the standard deviation, and we computed the mean among these, obtaining a single spectrum for each patient.

Computing the average of the 4 acquisitions with minimum standard deviation permits the extraction of a single robust spectrum for each patient.An alternative approach to artificially increase the dataset is to not average the selected 4 acquisitions but instead insert all of them into the dataset. This approach allows for an increase in both the number of training samples (by a factor of 4) and the variability in the data, which can lead to more accurate models. During the testing phase, instead, we average the acquisitions to obtain a single spectrum for each tested patient.Some samples may present high noise in the mass spectrum, which can adversely affect the analysis. To address this issue, we identified outlier samples as those with a \(z\)-score greater than 8 for at least one feature. Additionally, for some patients, it was not possible to identify a plateau in the TIC curve, leading to their exclusion from the dataset.To overcome noise in the measurements and possible variations in the machine’s settings, a signal filtering and smoothing procedure was applied to the remaining patient samples. The steps involved in this procedure are as follows:

Normalization: Each spectrum was normalized by dividing by the TIC value to obtain relative information about the breath composition. This step scaled each intensity by the sum of all intensities, bringing the features within the range \((0,1)\).

Initial High-Pass Filtering: A high-pass filter was applied, treating as zero any intensity below 0.0001, which was considered noise. This helped in filtering out low-intensity signals that might contribute to noise.

Savitzky–Golay Smoothing and Differentiation: The Savitzky–Golay Smoothing and Differentiation Filter28,29 was used to reduce high-frequency noise and align the signals to the baseline. This filter is effective in spectral analysis as it smooths the data while preserving important spectral features.

Secondary High-Pass Filtering: After smoothing, the high-pass filter was reapplied, treating as zero any intensity below 0.001. This step removed any artifacts that may have been introduced during the smoothing process.

The filtering and pre-processing procedures were applied separately to each mass range. Once these steps were completed, the spectra obtained from the 4 mass ranges could be combined to produce a single, comprehensive spectrum spanning the range of 10–351.If different acquisitions were previously considered for each mass range, merging them involved computing all possible combinations of acquisitions across the mass ranges. This approach effectively augmented the dataset by creating new combinations of the different acquisition spectra for each patient. This procedure can be likened to generating artificial patients, where each new patient varies based on one of the 4 segments of the spectrum. An example of the resulting augmented dataset is shown in Table 1.Finally, the entire spectrum was normalized again by dividing it by the total sum of the intensities, ensuring that only relative information was retained.Table 1 An example of the dataset augmentation procedure.Machine-learning modelsThe mass spectrum analysis was conducted using a comprehensive pipeline comprising several key steps: data normalization, feature selection, dimensionality reduction, and classification. Each stage in this pipeline contributes to the development of a complete ML model. The results of these models are presented in the following sections.Initially, a Variance Threshold filter was applied to the dataset. This filter removes features with zero variance, thereby eliminating \(\text {m/z}\) values for which no intensities were measured post-filtering.Subsequently, each feature was individually normalized using one of two methods: the Standard Scaler (SS) or the Robust Scaler (RS). The Standard Scaler normalizes features by subtracting the mean and scaling according to the variance. In contrast, the Robust Scaler subtracts the median and scales based on the interquartile range (the range between the first and third quartiles), which mitigates the impact of outliers.Further feature reduction was performed to retain only the most informative features. Some of the experiments involved the use of a supervised feature selection method, SURF*, a Relief-based algorithm30, to select 100 \(\text {m/z}\) features. To further reduce dimensionality, Principal Component Analysis (PCA)31,32 was applied to linearly combine the selected features, utilizing 20 principal components.Finally, a range of ML classification models was trained to distinguish between COVID-19 positive and negative patients. Various ML techniques have been explored in the context of COVID-19 detection9,10,15,33,34. We utilized a combination of state-of-the-art models from24, including K-Nearest Neighbors (KNN), Random Forest (RF), Logistic Regression (LR), Gradient Boosting (xGB), and Support Vector Machine (SVM) with an RBF kernel. Additionally, we implemented an ensemble model that integrates all the aforementioned classifiers using a soft-voting approach (Ens).In soft voting, predictions from an ensemble of classifiers are amalgamated by considering the probabilities assigned to each class by individual classifiers. The final prediction is determined by selecting the class with the highest cumulative probability across all classifiers involved in the ensemble.

COVID-19 detection from exhaled breath

Population variability in X-chromosome inactivation across 10 mammalian species

Advice for academic authors | Science

The integrated molecular and histological analysis defines subtypes of esophageal squamous cell carcinoma

Variational benchmarks for quantum many-body problems | Science

Reprogramming tumor cells to fight cancer | Science

Hot Topics

Population variability in X-chromosome inactivation across 10 mammalian species

Advice for academic authors | Science

The integrated molecular and histological analysis defines subtypes of esophageal squamous cell carcinoma

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Population variability in X-chromosome inactivation across 10 mammalian species

Advice for academic authors | Science

The integrated molecular and histological analysis defines subtypes of esophageal squamous cell carcinoma

Variational benchmarks for quantum many-body problems | Science

Popular Articles

Population variability in X-chromosome inactivation across 10 mammalian species

Advice for academic authors | Science

The integrated molecular and histological analysis defines subtypes of esophageal squamous cell carcinoma