Disease progression associated cytokines in COVID-19 patients with deteriorating and recovering health conditions

Data summary and preprocessingCytokine expression profiles and EHR samples of 145 healthy (i.e., non-infectious) and 444 COVID-19 patients were collected. For the COVID-19 infectious group, the omics and clinical data were sampled in a longitudinal manner, with each contributing data from 1 to 7 different time points. The COVID-19 dataset is comprised of patients infected with various SARS-CoV-2 strains, predominantly Delta (AY.69, 47.9%) and Beta (B.1.497, 45.5%) variants. Strain information was available for 396 out of 444 patients. They were collected from three institutions located in South Korea: Chungnam National University Hospital, Seoul Medical Center, and Samsung Medical Center. The dataset includes cytokine exampression samples from plasma and laboratory data (e.g., neutrophil and lymphocyte counts) from blood samples, alongside the patient matched clinical data. Cytokine profiles were measured by the Korea National Institute of Health (KNIH) using the Luminex MAGPIX system with a customized panel, following a standardized protocol. The EHR data, detailing clinical therapy and patient information, was recorded at various stages of their hospital stay, with durations ranging from 5 to 25 days. Blood samples, providing cytokine data, were collected at different intervals during and post-hospitalization. Cytokine measurements were taken at regular intervals during each patient’s hospital stay and after discharge. On average, these measurements were taken every 5.7 days, with intervals ranging from 1 to 35 days. Such sampling frequency was chosen based on previous studies of cytokine dynamics in acute viral infections, which suggest that significant changes in cytokine levels typically occur over a period of days rather than weeks18. As a result, the dataset is comprised of a total of 1844 time point specific EHR samples, 1159 time points specific cytokine profile. The cytokine profile consists of 191 cytokines, including IL17, IL25, and CXCL11, associated with SARS-CoV-2 infection. Table 1 presents a summary of the demographics and clinical characteristics of the COVID-19 cohort collected between February 2020 and July 2022 in the Republic of Korea. The initial dataset included 191 cytokines, selected based on their known association with SARS-CoV-2 infection and relevance to immune response pathways. Given the presence of 13,203 missing data points, a stringent criterion was applied: only cytokines with less than 15% missing values were included for further analysis. Consequently, 166 cytokines met this criterion. Missing values were imputed using a Random Forest regressor via the MissForest package (version 1.5) in R (version 4.2.0)19, trained over ten iterations to ensure the dataset’s completeness and reliability.Table 1 Summarized statics of the COVID-19 cohort samples.Constructing pathological progression groupsTo reflect the pathological progression of a patient, we defined the concept PPG, which encapsulates the dynamic shifts in a patient’s pathological progression status post SARS-CoV-2 infection. Under the natural immune response to the virus, it is typical for patients to experience an exacerbation during the disease’s initial phase, known as the pathogenic burden20. In this early stage, the innate immune system is activated, discerning and counteracting the pathogen. Subsequently, as strategies to combat the virus are solidified, the adaptive immune system takes precedence, heralding a transition towards restoring the immune system’s homeostasis and alleviating the pathogenic burden. PPGs are used to delineate a patients pathological progression so that biomarkers specific to recovery or deterioration can be detected. For simplicity, the PPG was categorized into two groups: the DP and the RP. The conceptual depiction of a standard pathological progression is demonstrated in Fig. 1, which assumes that a patient gets ill but recovers afterwards.Fig. 1The conceptual illustration of the DP and RP PPGs. The transition of COVID-19 patients through different stages of the disease is shown. The region colored in red refers to the duration of a patient with deteriorating health conditions, whereas the blue region represents the recovery of the patient.The challenge lies in gauging the disease’s dynamism, as its severity can fluctuate. A prevalent metric employed for this purpose is the World Health Organization (WHO) ordinal score system21, which is based on clinical features with a specific emphasis on oxygen therapy-related information. However, a significant fraction of patients display static clinical patterns throughout their hospitalization, making the demarcation of PPG challenging in the absence of discernible differences (Fig. 1). Hence, we incorporated the Neutrophil-to-Lymphocyte Ratio (NLR) and lactate dehydrogenase (LDH) levels as markers, which are used for labeling PPGs22,23. These biological markers not only furnish insights into immune responses but also facilitate the classification of the disease’s progression, especially in cases presenting limited change clinical status. Laboratory markers were measured multiple times during the patients’ hospitalization, including at admission, during hospitalization, and at discharge. Given that these indicators are integral to routine laboratory data and are consistently procured during hospitalization, they serve as reliable tools in tracing the disease’s phases. For each patient, DP and RP intervals were manually labeled by observing the longitudinal NLR and LDH levels. An interval (i.e. in units of days) of a hospitalized patient was labeled as DP if the NLR and LDH levels elevated. Similarly, an interval was labeled as RP if the NLR and LDH levels decreased. This sample-wise classification allowed us to capture the dynamic changes in patient status more sensitively. Consequently, the same patient could be classified into both DP and RP if their condition exhibited both upward and downward trends during their hospitalization.The reason for such manual labeling was because the NLR and LDH levels did not always show consensus with a patient’s severity level as shown in Supplementary Fig. S1. Generally, an interval is labeled as RP if either NLR or LDH levels started to decrease. However, if one marker showed a significant increase while the other decreased, we maintained the DP classification to avoid premature categorization of recovery.However, we showed that the quality of our manually labeled PPGs were sufficient to capture DP and RP specific cytokines.The PPG was further split into severity level specific groups: moderate Deterioration Phase/Recovery Phase (mDP/mRP) and severe Deterioration Phase/Recovery Phase (sDP/sRP). The severe DP and RP groups correspond to samples with pronounced disease severity at any instance during hospitalization, typified by a WHO scale of 6 or above, which is a conventional demarcation of severe illness21. Similarly, the moderate DP and RP groups correspond to samples with a WHO scale less than 6. The WHO scale can have a range of one to ten depending on the type of pathological state or burden of a patient as described in Supplementary Table S1. As described, the PPGs were further split to by the WHO scale to distinguish moderate and severe status specific cytokines in terms of progression. As a result, cytokine samples that were collected within a DP or RP interval were labeled as the interval’s PPG. The statistics of PPGs are summarized in Table 2, including the number of samples, patients, gender distribution, age, WHO scale and related clinical features. After the PPG labeling process, we focused on 112 patients (81 moderate and 31 severe) for downstream analysis, as these patients exhibited clear DP or RP patterns. Since time-course samples were available per patient, the total number of moderate and severe samples were 276 and 128, respectively. This approach, focusing on patients with clearly identifiable DP or RP phases, allows us to capture the dynamic nature of disease progression more effectively and ensures that our analysis is based on samples with well-defined progression characteristics.Table 2 Patient statistics by PPG categories.Supervised clustering and PPG analysisThe objective of our study is to identify differentially expressed cytokines between the different PPGs. Instead of directly comparing the raw cytokine levels of the PPGs, it is advantageous to measure the feature importance of cytokines and compare them between the PPGs for enhanced interpretation of the result. SHAP values offer a consistent and fair metric for assessing the importance of features across various machine learning models24 and transcriptomic studies25. These values distribute the contribution of each cytokine to individual predictions, enhancing the interpretability of the model. Often, the computed SHAP values are used to cluster samples that exhibit similar feature importance profiles26,27. Hence, for robust results, they are computed based on a classification model that was built using a well curated dataset with clear difference between the target prediction groups.Accordingly, a RF model was constructed using the expression of 166 cytokines, targeting the WHO ordinal scale to distinguish between two distinct groups: Healthy (n = 25) and Severe (n = 35). The healthy category represents non-infected healthy samples, indicative of the zero pathological burden. The RF model for discriminating the healthy from severe COVID-19 patients constructed using 100 trees with Gini impurity as the splitting criterion. The minimum number of samples required to split an internal node was set to 2 and the minimum number of samples required to be at a leaf node to 1. These parameters were chosen to balance model complexity with performance and to mitigate overfitting. We focused on these two extreme categories to establish clear boundaries of the severity spectrum, allowing us to model severity as a continuum rather than discrete classes. The selected 25 healthy donors were to match the time period of the 35 severe patients, ensuring consistency and avoiding periodic differences among the training data. In contrast, the severe category encapsulates the peak severe status of COVID-19 infection. Severe cases were defined as those with a WHO ordinal scale score 6 or above, in line with established clinical criteria. This selection was made to ensure that the RF model could learn from well-defined extremes of disease severity, thereby improving the model’s accuracy and reliability. By using these clear-cut cases, the model could effectively compute SHAP values, which were then applied to the broader and more variable exploration cohort for computing the SHAP values, or feature importance, of cytokines, which are used for the supervised clustering of the exploration dataset that is composed of cytokines samples from 81 moderate and 31 severe COVID-19 infected patients. The selection of the two extremes are expected to capture the associated cytokines characteristics so that the cytokines SHAP values from the exploration dataset will be placed between them according to their severity level, and thus serving as a continuum of the disease progression. The schematic workflow of analysis is depicted in Fig. 2.Fig. 2The analysis workflow. First, a severity classifier using healthy and severe COVID-19 patients was built. Then, the model was used to compute the SHAP values of additional input data (i.e., non-training data). The computed SHAP values were then subject to clustering to identify PPGs associated with disease progression.For a given cytokine sample, the prediction output of the RF model is decomposable into a sum of SHAP values for all cytokines and a base value \(f(x_{0})\). Thus, \(f(x_{0})\) represents the model’s output in the absence of any cytokines, essentially the expected model output across the dataset and computed as follows:$$\begin{aligned} f(x) = f(x_{0}) + \sum _{i=1}^{M} \gamma _{i}. \end{aligned}$$
(1)
Here, f(x) and \(f(x_{0})\) refer to the model’s severity and to the base value, or the model’s average prediction over the dataset, respectively. \(\gamma _{i}\) denotes the SHAP value for cytokine i, reflecting the cytokine’s contribution to the prediction deviation from the base value. At last, M is the number of total cytokines in the data, which is 166 in our case. In essence, f(x) is the severity prediction of the model for a cytokine sample x, and each \(\gamma _{i}\) signifies the impact of the respective cytokine. It implies that a model’s prediction is the cumulative effect of each cytokine expression level, starting from the average model prediction as the reference point \(f(x_{0})\). This additive explanation model is essential for interpreting SHAP values and is fundamental to our analysis of feature importance. The SHAP value \(\gamma _{i}\) of a cytokine i is computed as follows:$$\begin{aligned} \gamma _{i}(v) = \sum _{S \subseteq M \setminus \{i\}} \frac{|S|! (M – |S| – 1)!}{M!} \left[ v(S \cup \{i\}) – v(S) \right] , \end{aligned}$$
(2)
where v refers to the coalition value function that assigns a prediction value to each coalition of cytokine, S refers to a subset of cytokines, excluding cytokine i and M refers to the total number of cytokines, which is 166 in our study. Collectively, \(\gamma _{i}(v)\) represents the average marginal contribution of cytokine i in the prediction model across all possible cytokine subsets S. The coalition value function v(S) for a subset S of cytokines is defined as the expected output of the model when only the cytokines in S are known:$$\begin{aligned} v(S) = \mathbb {E}[f(x)| x_S], \end{aligned}$$
(3)
where \(\mathbb {E}[f(x)|x_s]\) represents the expected value of the model’s prediction given that the cytokines in S are known. This involves averaging the model’s predictions over all possible values of the cytokines not in S, according to their distribution.To identify samples that showed significantly different SHAP values, the density based clustering algorithm DBSCAN was applied on the UMAP of the SHAP value samples. DBSCAN was selected for its capacity to identify high-density areas separated by regions of low density, thus distinguishing clusters with dissimilar SHAP profiles. From the DBSCAN result, clusters were evaluated for homogeneity in terms of PPG. Clusters with high homogeneity were selected for downstream analysis, which are expected to well capture the differences between the DP and RP groups. The optimization of the DBSCAN algorithm was contingent on the calibration of two parameters: the minimum number of samples (MinPts) required to establish a dense region, and the neighborhood size (\(\epsilon\)), which dictates the proximity required for points to be considered part of a cluster. Iterative refinement led to the determination that an \(\epsilon\) value of 0.8 and a MinPts threshold of 20 were optimal for our dataset, resulting in the formation of distinct and meaningful clusters. This process was facilitated by the DBSCAN implementation available in the sklearn Python package28. To ascertain clusters of analytical relevance, particularly those reflecting the nuances of the RP and DP, including their respective severity, we performed a purity assessment. To ensure the quality and robustness of the selected clusters, we performed a manual review process. Each cluster was carefully examined to assess the homogeneity and purity of the samples in terms of PPG representation. This manual review allowed us to identify and address any potential outliers or anomalies that could affect the subsequent analysis. By applying this quality control step, we aimed to select clusters that were representative of the specific PPG subgroups and provided a reliable basis for understanding the immunological responses in COVID-19 progression. To provide a comprehensive comparison between cytokine expression and cytokine SHAP value clustering, we also applied DBSCAN to the preprocessed cytokine expression value. For comparison, we used the same 77 cytokines that showed significant SHAP values from the measurement. To address the high variance in raw cytokine expression value, we applied two preprocessing steps: a) log transformation and b) quantile normalization to standardize the scale across different samples.Finally, we focused on the statistical analysis of the Shapley values derived from the DBSCAN clusters. This analysis identified significant cytokines influencing the PPGs subgroups. To achieve this, we employed the t-test, a robust statistical tool, to evaluate the differences in feature importance across the PPG subgroups. The t-test was performed using the pingouin.ttest function, which automatically applies Welch’s correction for unequal variances when sample sizes are unequal, as recommended by Zimmerman29. To account for multiple testing, we applied the Benjamini–Hochberg procedure to control the false discovery rate at 0.05 across all t-tests comparing SHAP values between DP and RP groups. The results from these t-tests provided insights into the cytokine set that significantly contributes to the differentiation of disease severity in COVID-19 patients. Furthermore, to explore the interactions and pathways involving the key cytokines identified in the analysis, we performed a protein-protein interaction (PPI) network analysis using the STRING database (version 12.0)30. The analysis was based on the statistical results obtained from the t-test result.

Hot Topics

Related Articles