Application of untargeted liquid chromatography-mass spectrometry to routine analysis of food using three-dimensional bucketing and machine learning

LC-HRMS analysisThe honey samples were analyzed using three LC-HRMS devices comprising Thermo Scientific™ UltiMate™ 3000 systems coupled to Thermo Scientific™ Q Exactive™ Hybrid Quadrupole-Orbitrap™ High-Resolution Mass Spectrometers. Each sample was measured with two different Liquid Chromatography (LC) approaches. Due to preliminary tests, hydrophilic interaction liquid chromatography (HILIC, Accucore-150-Amide-HILIC 150 × 2,1) in negative ion mode was applied to analyze polar compounds while non-polar compounds were analyzed by reverse phase (RP, Hypersil Gold C18* 150 × 2,1) chromatography in positive ion mode. This setup generated two different types of data for each sample, which will be referred to as HILIC and RP data. For both, the mobile phases water and acetonitrile with acetic acid as modifier were used. For the normalization strategy ISTD_add (see following subsection), a third mobile phase with 2% sorbic acid in positive ion mode (RP method) and 10% in negative ion mode (HILIC method) in an acetonitrile–water-mixture (50/50, v/v) with acetic acid as additive was isocratically injected. The different sorbic acid concentrations were chosen due to missing signals in the spectra obtained in negative ion mode.The ionization of the chromatographically separated compounds was performed using electrospray ionization (ESI) and the data was acquired in profile mode. The mass spectrometric analysis was conducted using the variable data-independent acquisition (vDIA) method developed by Thermo Scientific™ Orbitrap™. vDIA is a data acquisition method that utilizes MS/MS precursor isolation windows of differing mass ranges, covering the entire mass range of the preceding full scan33. For the data analyzed in this study, the full scan comprised 100–1500 Da (MS1). This range was covered by six isolation windows, in which ions of the masses 150, 250, 350, 450, 750 and 1250 Da were fragmented respectively (MS2). The respective fragments were detected in the ranges 50–225, 50–330, 50–430, 50–535, 69–1045 and 104–1555 Da4.Data processing with BOULSThe developed BOULS approach is schematically illustrated in Fig. 1 and published at https://github.com/AGSeifert/BOULS (requires Linux OS). It is based on the established xcms workflow in R and uses the same functions for data import and chromatographic peak detection34.Figure 1(a) Overview of the novel data processing approach for the analysis of untargeted LC-HRMS data in routine analysis: A new step called BOULS replaces the correspondence step in previous workflows. To introduce this step, an example LC-HRMS spectrum after retention time alignment is shown in (b). Subsequently, the same spectrum after the application of BOULS summing up the signals in the respective buckets is depicted in (c).For data import (step 1 in Fig. 1), the thermo-specific raw files are first converted to open-format mzML files using MSConvert, which is part of the ProteoWizard software package (Version: 3.0.21078-7da1f1136 (developer build)). Here, the filter peakPicking is used to convert the profile mode data into centroided data35. Subsequently the data is imported into R using the Bioconductor package mzR35,36,37,38,39 (version 2.26.0) storing information about the samples and analyses, such as name, geographical origin, instrument and method. The package MSnbase40,41 (version 2.18.0) is then used to load and store the data in an object that is compatible with the xcms package21,22,23 (version 3.14.0). The chromatographic peak detection (step 2 in Fig. 1) is performed by the centWave algorithm23 using the parameters peakwidth of 15 s and ppm of 5 Da, which is individual for each experimental set up and which was set according to the extracted ion chromatogram of the internal standards. The retention time alignment (step 3 in Fig. 1) is performed by using the obiwarp method42, whereby the parameters binSize and localAlignment are set to 0.1 and TRUE, respectively. The spectrum with the most peaks was used as center spectrum.Subsequently, the novel BOULS step (step 4 in Fig. 1) is conducted and the data is divided into 3 dimensional buckets summing up the total intensity of the signals (see Fig. 1b and c). The parameters of this step, namely the bucket size in the retention time (RT) and m/z dimensions, are optimized in a DOE, which is described and evaluated below.Various approaches have been applied for normalization (step 5 in Fig. 1). For the first normalization strategy, the internal standard in the third mobile phase (ISTD_add) sorbic acid was used and the total intensity of an RT-bucket was divided by the total intensity of all sorbic acid signal intensities. For total ion current (TIC) and base peak chromatogram (BPC) normalization, the summarized intensities of each bucket were divided by the sum of intensities of all signals and the most intensive signal, respectively. For TIC_RT and BPC_RT, the individual RT-buckets were considered separately and the intensity of a bucket was divided by either the sum (TIC_RT) or the maximum (BPC_RT) of the intensities of the respective RT‑bucket. Also, the different normalization approaches were evaluated in a DOE.For classification (step 6 in Fig. 1), RF was applied using the R package ranger43 (version 0.14.1, CRAN) with the parameters ntree = 5000 and the respective default parameters for mtry and min.node.size (square root of the total number of variables and 1, respectively). To compensate for class imbalance, the parameter case.weights was chosen according to the size of the respective classes43.DataFor the development, validation and application of the BOULS approach, four data sets compiled from samples of customers of a routine laboratory were used. As the successful analysis and comparison over a long period of time are crucial, different time periods were chosen for obtaining the training data and between obtaining the training and test data. For the analysis of each sample set, both, HILIC and RP, were used and the samples were measured in approximately equal proportions using one of the three instruments.Data set 1To optimize the BOULS approach parameters by a DOE (see section DOE1: Initial optimization), a data set obtained from 123 honey samples (246 spectra) from Argentina (18), Brazil (19), Canada (18), China (8), Ukraine (40) and the USA (20), was analyzed. The measurements were performed over a period of six weeks and the performance was evaluated by the OOB error of the obtained random forests.Data set 2To test the robustness of the developed approach over time, training and test data measured with a time offset of seven months were analyzed. For training, data set 1 was used. However, since they were not included in the test data, the data from the Chinese samples were omitted. The test set consisted of 27 spectra from 5, 5, 4, 6 and 7 samples from Argentina, Brazil, Canada, Ukraine and the USA, respectively.Data set 3The third data set was used for the initial implementation of the developed approach in routine analysis. For training, data of 565 samples from Argentina (93), Brazil (65), Canada (9), India (69), Ukraine (250) and the USA (79) were used. The test set consisted of 126 spectra from 8, 11, 65, 1, 13 and 28 samples from Argentina, Brazil, India, Canada, Ukraine, and the USA, respectively.Data set 4The fourth data set was used to analyze, whether the prediction accuracy improves by extending the model in long-term application. The extended training data set contained data from 835 samples from Argentina (170), Brazil (133), Canada (16), India (126), Ukraine (264), and the USA (126). The resulting model was used to classify the same test data as in the previous section to compare the performance of the two models directly.Design of experiment (DOE) for the optimization of the BOULS parametersIn order to optimize the parameters of the BOULS approach, two full factorial DOEs described in the following two subsections were conducted. In each case, the expand.grid function of the R base package44 (version 4.2.2) was used to generate a data frame with all possible combinations of the factor levels. After the execution of the workflow, Analysis of Variance (ANOVA) (aov function of the base stats package) in combination with Tukey’s honest significant difference (TukeyHSD function in base R package) was applied to compare the different factors and calculate p-values.DOE1: Initial optimizationTable 1 shows the applied factors together with their respective levels of the first DOE. Bucket sizes for the RT and m/z dimension, as well as the different normalization approaches introduced in the previous section were evaluated.Table 1 Factors and levels of DOE 1 for the initial optimization of applied parameters of the BOULS approach.DOE2: Validation and optimization for long-term applicationIn order to establish the BOULS approach for long-time application, a second DOE was applied with the factors and levels shown in Table 2. In addition to the already applied factors in DOE 1, which were validated for long-term application, also the addition of fragment (MS2) data and the combination of HILIC and RP data was evaluated. For the former, the fragment RP data was processed with the BOULS step with either small buckets (5 Da and 5 s) or large buckets (20 Da and 80 s) and combined with the full RP data or not. Since the individual isolation windows vary in size (see section LC-HRMS analysis), the exact size of the buckets was adjusted to achieve divisibility. For the latter, the HILIC data was either included with or without the fragment data that was processed as just described or not.Table 2 Factors and levels of the second DOE for validation and optimization for long-term application of the BOULS approach.

Hot Topics

Related Articles