Affordable and real-time antimicrobial resistance prediction from multimodal electronic health records

Insights from FIDDLESince FIDDLE is a flexible framework, we adjust it to fit the AMR task. In this study, there is a greater focus on creating a dataset according to the antibiotic compared to the pathogen. However, similar details follow when creating the P. aeruginosa dataset. Generally, when preparing for the AMR task, the pipeline should extract a new ICU stays dataset. First, we use the original microbiology events MIMIC-IV table and choose patients having results for the antibiotic sensitivity test. We only take records labeled as sensitive or resistant, excluding those with pending or intermediate statuses. Then, we extract records belonging to one type of antibiotic to use them as resistance labels for the classification task. In our case, Gentamicin is considered since it has the largest amount of data. After that, the microbiology table is merged with the ICU stays table on the patient and admission IDs. For the Genatmicin cohort, this process reduces the number of ICU stays from around 76,000 to 13,658. A total of 12,442 ICU stays had a sensitive result and 1,216 had a resistant result, having an imbalanced ratio of 91/9, or approximately 10. We then add the label information to the onset hour from the microbiology events tables.Studying the cohort having a sensitivity result for Gentamicin further, most patients have their results during the first hours of their stay. This insight is described in Supplementary Fig. 2. Regarding the length of stay, the highest frequency is for ICU stays of almost 25 to 30 hours. Around 24% of the patients stayed in the ICU for over 200 hours.Moving to the dataset of patients infected with P. aeruginosa, a similar analysis is done. The result is a dataset that includes information related to 2,103 ICU stays with an imbalance rate of 3.14, which is much smaller in size than Gentamicin’s dataset. For patients who are infected with P. aeruginosa, their prescriptions include five types of antibiotics: Cefepime (\(\approx \)9%), Meropenem (\(\approx \)89%), Piperacillin/Tazobactam (\(\approx \)1.4%), Ciprofloxacin (\(\approx \)0.38%), and Ceftazidime (\(\approx \)0.05%). Similar to the Getanimicin cohort, the P. aeruginosa cohort also has the patients’ cultures’ results during the first hours of the ICU stay.The next step is the inclusion criteria, in which we filter the data according to the prediction hour set for each task. We first start by removing data for children under 18 years old, followed by removing data of patients who die or are discharged before the prediction hour. After excluding death and discharge cases, the summary statistics for the Gentamicin \(T=4\) dataset are shown in Supplementary Table 2. Then, patients who have a resistance result during the period [0, T], where T is the onset hour, are given a label of 1, and 0 otherwise. From our analysis presented in Supplementary Table 1, patients had their lab cultures taken in the first hours which assisted in the choice of T for the tasks.As discussed, the work in11 works by combining three different modalities from EHR data: time-invariant, time-dependent, and clinical notes. Setting B as the batch size, the time-invariant data is denoted by \({\textbf {I}}_{ti} \in \mathbb {R}^{B \times D_{ti}}\), the time-series data by \({\textbf {I}}_{ts} \in \mathbb {R}^{B \times L \times D_{ts}}\), and the tokens IDs of the clinical notes by \({\textbf {I}}_{nt} \in \mathbb {R}^{B \times D_{nt}}\). Here, L refers to the timestamps and represents T/dt, \(D_{ti}\) to the dimensions of time-invariant features, \(D_{ts}\) to the dimension of the time-series features, and \(D_{nt}\) to the length of the clinical notes.To encode the time-invariant data, a linear fully-connected layer is used, followed by a ReLU activation layer, \({\textbf {E}}_{ti} = ReLU(Linear({\textbf {I}}_{ti}))\), where \({\textbf {E}}_{ti} \in \mathbb {R}^{B \times D’_{ti}}\) and \(D’_{ti}\) is the dimension of the time-invariant encoded feature. To encode the time-series data,11 proposes different approaches, all following the equation$$\begin{aligned} {\textbf {E’}}_{ts}&= ENC({\textbf {T}}_{ts}), \nonumber \\ {\textbf {E}}_{ts}&= ReLU(Linear({\textbf {E’}}_{ts})), \end{aligned}$$
(1)
Here, \({\textbf {E’}}_{ts} \in \mathbb {R}^{B \times L’}\), where \(L’\) is the hidden size, and \({\textbf {E}}_{ts} \in \mathbb {R}^{B \times D’_{ts}}\), where \(D’_{ts}\) is the number of neurons in the network. The time-series encoders can be LSTM, StarTransformer, or the original transformer encoder. Finally, the weights of the ClinicalBERT model presented in13 are used to encode the clinical notes tokens. The encoded notes are described as \({\textbf {E}}_{nt} = ClinicalBERT({\textbf {I}}_{nt})\), where \({\textbf {E}}_{nt} \in \mathbb {R}^{B \times D’_{nt}}\). Even though other large language models were developed after ClinicalBERT and trained on medical data, ClinicalBERT remains useful in our context as it relies on the MIMIC-III dataset, which directly fits our tasks. At this stage, all modalities are ready to be fused.Fusion mechanismsThe four fusion mechanisms discussed in this section are originally established for multimodal sentiment analysis (MSA). In MSA, there are three modalities: text, visual, and audio. All three are synchronous and have a high impact on understanding human sentiment. Since this is not the case for EHR data, as different modalities have different impacts on the medical task, we adapt the fusion strategy accordingly.We considered the fusion mechanism that is taken as the outperforming one in11 as the baseline, which is multimodal attention gating (MAGBERT)15 using BERT. We also adopted three other fusion mechanisms: tensor fusion17,18, attention fusion19, and Multimodal InfoMax (MMIM)20. Attention fusion considers the long-range dependencies between the modalities by having one transformer per modality and computing the cross-attention between them. Tensor fusion investigates the aspects of unimodal, bimodal, and trimodal representations by calculating the outer product between modalities. We will provide more details about the approach of fitting MMIM to the EHR data since it obtains the best results. Further explanation regarding the other three mechanisms can be found in their original works or in11.Fusion with hierarchical mutual information maximizationMMIM works by maximizing the mutual information between inter- and intra-modality data20. Since the approach adopts the mutual information (MI) technique, the issue of the intractable MI bounds has to be tackled for higher computational efficiency. The authors in20 utilized a variety of parametric and non-parametric methods to estimate the true values of the MI bounds. The architecture of the model takes the unimodal representations of the three modalities and passes them into two parts. The first part is the fusion network that leads to the prediction result, while the second implements MI for the input and fusion levels. After that, the losses produced from the prediction and MI tasks are back-propagated to increase the learning of the model.Maximizing MI at the input level: By assuming a correlation between the two modalities: X and Y, the mutual information is then estimated as follows$$\begin{aligned} {\begin{matrix} I (X,Y) &{} = \mathbb {E}_{p(x,y)} \left[ \log \frac{q(y|x)}{p(y)} \right] + \mathbb {E}_{p(y)} \left[ KL(p(y|x) || q(y|x)) \right] \\ &{} \le \mathbb {E}_{p(x,y)} \left[ \log q(y|x)\right] + H(Y) \\ &{} \triangleq I_{BA}, \end{matrix}} \end{aligned}$$
(2)
where H(Y) is the differential entropy of Y. In20, it is shown that the text modality dominates and gives higher performance when considered as the main modality as it has a higher dimension. Thus, we consider the clinical notes as X and pair it with Y which is either the time-series or time-invariant data. In this case, we optimize two MI bounds for the two pairs of modalities. To approximate q(x|y), we follow23 that considers it as a multivariate Gaussian distribution \(q_\theta ({\textbf {y}}|{\textbf {x}}) = \mathcal {N} ({\textbf {y}}| \varvec{\mu }_{\theta _1}({\textbf {x}}), \varvec{\sigma }^2_{\theta _2}({\textbf {x}}) {\textbf {I}})\), where \(\theta _1\) and \(\theta _2\) are the parameters for the two neural networks that predict the mean and the variance, respectively. For optimizing the approximation of q(y|x) for the two pairs of modalities using likelihood maximization, we utilize the same loss function available in20. To compute the entropy term H(Y), we use the Gaussian Mixture Model (GMM) by constructing two normal distributions for the two classes. In case of the antibacterial resistance task, we have \(\mathcal {N}_S(\varvec{\mu }_1, \varvec{\Sigma }_1)\) and \(\mathcal {N}_R(\varvec{\mu }_2, \varvec{\Sigma }_2)\), where \(\varvec{\mu }\) is the mean, \(\varvec{\Sigma }\) is the covariance matrix, and S and R denote the sensitive and resistance classes. Having a sufficiently large sample, the entropy of a multivariate normal distribution is given by$$\begin{aligned} H = \frac{1}{2} \log \left( (2\pi e)^k det(\varvec{\Sigma }) \right) , \end{aligned}$$
(3)
where the GMM vectors dimensionality is denoted by k and the determinant of the covariance matrix \(\varvec{\Sigma }\) is denoted by \(det(\varvec{\Sigma })\). When applying this approach to the MSA task, we consider that the two classes, positive and negative, have almost equal weights (prior probabilities), leading to a certain computation of the entropy adopted in20. However, since the EHR tasks are highly imbalanced, we cannot assume equal weights for both distributions. Also, for some training iterations, it will not be possible to calculate \(det(\varvec{\Sigma }_2)\) since the number of occurrences of the underrepresented class in one batch could lead to having a negative determinant. Therefore, assuming imbalanced weight and that the distributions of the two classes are disjoint, we take the lower bound as an approximation and calculate the entropy as$$\begin{aligned} H(Y) = \omega _S \log (det(\varvec{\Sigma }_1)), \end{aligned}$$
(4)
where \(\omega _S\) is the weight (prior probability) of the sensitive class. More details of the derivation are present in20,24. Finally, the optimization of the MI lower bound maximization at the input level is given by the following loss function$$\begin{aligned} \mathcal {L}_{BA} = – I_{BA}^{nt, ts} – I_{BA}^{nt, ti}, \end{aligned}$$
(5)
Maximizing MI at the fusion level: The main goal of fusing the three modalities is to perceive modality-invariant information. This is done by maximizing the MI between the modalities and their fused outcome. Following25 and20, we adopt Contrastive Predictive Coding (CPC) as a way to predict the representations from the outcome of the fusion. We apply the Euclidean norm on the prediction produced by a neural network \(F_\phi \) with parameters \(\phi \) as well as the true vector \(h_m\) where \(m \in \{ti, ts, nt\}\). The neural network is applied on the vector \(Z = g(E_{ti}, E_{ts}, E_{nt})\), where g is the fusion network. After that, we estimate their correlation as follows$$\begin{aligned} s(h_m, Z) = \exp \left( {\frac{h_m}{||h_m||_2} \left( \frac{F_\phi (Z)}{||F_\phi (Z)||_2}\right) ^T}\right) , \end{aligned}$$
(6)
This score is then used to compute the Noise-Contrastive Estimation loss between the fusion outcome and each modality. Each representation is considered as a positive point and all other representations in the same batch are considered as negative. Denoting the representation in one batch as \({\textbf {H}}_m\), we calculate the NCE loss according to$$\begin{aligned} \mathcal {L}_N(Z, {\textbf {H}}_m) = – \mathbb {E}_{\textbf {H}} \left[ \log \frac{s(Z, h_m^i)}{\sum _{h_m^j\in {\textbf {H}}_m} s(Z, h_m^j)} \right] , \end{aligned}$$
(7)
Therefore, the CPC loss for the fusion level is calculated as$$\begin{aligned} \mathcal {L}_{CPC} = \mathcal {L}_N(z, ti) + \mathcal {L}_N(z, ts) + \mathcal {L}_N(z, nt), \end{aligned}$$
(8)
Combining all three losses, the total loss during training is computed as$$\begin{aligned} \mathcal {L}_{total} = \mathcal {L}_{task} + \alpha \mathcal {L}_{CPC} + \beta \mathcal {L}_{BA}, \end{aligned}$$
(9)
where \(\alpha \) and \(\beta \) dictate the effect provided to the MI maximization (both kept as 0.1), and \(\mathcal {L}_{task}\) is the cross entropy loss. Figure 1 explains the overall approach.

Hot Topics

Related Articles