Development and transfer learning of self-attention model for major adverse cardiovascular events prediction across hospitals

Study datasetStudy population and outcomesData was previously extracted from the anonymized EMR system of AMC. The criteria of inclusion were defined as patients who had visited AMC and are suspected of having cardiac disease between January 1st, 2000, and December 31st, 2017. For the target dataset for transfer learning, patient records of individuals who visited the CNUH regarding cardiac disease between January 1st, 2012, and December 31st, 2018, were extracted. The following information was extracted from both centers: patient encounter, demographics, diagnosis, medication, treatment, laboratory test regarding human-derived materials, physical information, vital signs, digital tests regarding coronary angiography (CAG), echocardiography/Computed Tomography(CT), myocardial Single Photon Emission CT(SPECT), cardiopulmonary function test, procedure, surgery, blood transfusion, smoking history.In this study, a high-risk cardiovascular disease (HRCVD) cohort is defined as patients diagnosed with critical conditions, including MI, chronic ischemic heart disease, stroke, transient ischemic attack (TIA), angina, and heart failure. The primary objective is to predict MACE within three years during study period, with MACE defined as MI, stroke, TIA, heart failure, and death of all causes. The operational definition of each MACE was decided to be identifiable in the EMR system, considering diagnosis, treatment and patient encounter types including inpatient, outpatient, and emergency room (ER) visits. The details of operational definition are provided in Supplementary Table S1. Online. The model is primarily designed to learn from time series data based on tables in EMR, as its architecture does not incorporate multimodal inputs such as medical images or signals.Data preprocessing for individual hospitalsWe have manually categorized the codes of diagnosis, medication, and laboratory tests into each category of disease, drug ingredients, and lab test, respectively. Specific coding systems for diagnosis, medication, and laboratory tests were utilized following the common data model (CDM): International Statistical Classification of Diseases and Related Health Problems (ICD) codes for diagnoses, RxNorm for medications, and Logical Observation Identifiers Names and Codes(LOINC) for laboratory tests.For the AMC dataset collected for 18 years, the coding system was institutionally maintained including the version changes of coding systems, such as a version upgrade of ICD codes from 9 to 10. Mapping between codes utilized guidelines or vocabularies provided by the systems or the guidelines for mapping insurance claim codes provided by the National Health Insurance Sharing Service (NHISS) and HIRA Bigdata Open Portal.The data of a patient is structured into a table where a row represents each visit with columns of variables. To account for the irregularity in patient records, instead of deciding on a fixed interval, we organized the data by each visit and formed it into a row consisting of status variables. The variables for the initial dataset were manually selected from extracted information mentioned in the previous section. Categorical variables, such as records of treatment, procedure, or laboratory tests holding codes, we used codes within top 50 frequency to be encoded utilizing a one-hot encoding method. Continuous variables’ mean value for each visit was calculated with outliers removed. The binary label was assigned to each row, indicating 1 when reported with MACE within three years and 0 for without, as described in Fig. 1. A report with a minimum 28-day difference from the last visit date was considered for labeling purposes. The constructed dataset of source data has 1,101 features with a maximum of 100 visits per patient. A maximum number of 100 visits includes an entire visit of a patient with 99.9 confidence for the cohort. For patients with less than 100 visiting dates, data was padded from the beginning with values of the first record and label −1. For patients with over 100 visiting dates, only the record of recent 100 visits was used.Fig. 1A schematic example of constructing time-series data of a patient and an overall structure of prediction model, SA-RNN.Data processing for multi-source studyBecause the recording system differs per medical center, the schema of the EMR system is written with a detailed description to maximize the coverage of variables in AMC data. We have considered the availability difference of features and codes between centers and EMR systemic differences in recording time during medical procedures. The variables are chosen from the schema written by the categorized information. The unrecorded variables were filled with null values, and missing values were replaced with calculated averages within the same inpatient period. In addition, the time gaps between inpatient admissions, receptions, tests, and treatments were ignored, consolidating all related records under a single inpatient number. For instance, records of inpatient admissions within 2 days of an ER visit were combined.In addition, to ensure two datasets have matching codes, institutional codes of CNUH were mapped into CDM codes that datasets share, assuring corresponding data inputs after one-hot encoding. Codes in diagnosis, surgery, digital/laboratory tests, and procedure/treatments tables were mapped into SNOMED-CT codes, except for some numerical data in laboratory tests, which are mapped based on LOINC. Codes of medication are mapped into corresponding medications of RxNorm or RxNorm extension codes. Categorized codes above are mapped into applicable categories. The manual mapping process was based on CDM, and for gaps between the codes used in each center, codes were mapped based on computerized text-similarity and subsequent manual curation.Experimental ethicsThis study protocol was approved and conducted under the supervision of the Institutional Review Board (IRB) following the Declaration of Helsinki in medical research. All patient-identifiable information was removed following a policy of the Health Insurance Portability and Accountability Act (HIPAA), and the data access was limited to authorized researchers. In addition, the protocol regarding multicenter data was approved by the IRB board of CNUH. Consent from individual subjects was waived by the Institutional Review Board (IRB) due to the retrospective nature of the study protocol. Access to the anonymized data was limited to authorized researchers.Design of prediction modelDevelopment of sequential modelUnlike basic Multi-Layer Perceptron (MLP), also known as feed-forward DNN, RNN was developed to predict succeeding sequences referring to previous information. The key point of RNN in learning sequential data is based on a recursive update of the model weights adapting the gradient descent algorithm. However, this structure made RNN vulnerable to the vanishing gradient problem as differences of gradients grow smaller over learning process. It also results in the limitation of maintaining long-term dependency when prediction needs to be inferred from far ahead data.Consequently, LSTM32 was developed to learn sequential data without vanishing gradients or information loss caused by long sequences. A memory cell and gate units are designed to learn a sequence at time step \(\:t\) with the passed information from the cell state at \(\:t-1\). Each memory cell has input \(\:{x}_{t}\) and output \(\:{h}_{t}\) and learns cell state \(\:{C}_{t}\) to maintain or discard information learned from the entire sequence while learning information from new data. Each gate unit consists of several types of layers and holds weights and biases for each layer.The learning process starts with a forget gate \(\:{f}_{t}\) that controls the information passed from the previous cell \(\:{h}_{t-1}\) with a sigmoid layer. An input gate decides information learned from the present sequence at \(\:t\) with sigmoid and tanh layer, denoted as \(\:{i}_{t}\) and \(\:{\stackrel{-}{C}}_{t}\) respectively. The new cell state is updated following \(\:{C}_{t}=\:{f}_{t}*{C}_{t-1}+{i}_{t}*\:{\stackrel{-}{C}}_{t}\). The output \(\:{h}_{t}\) of the cell is decided from the input \(\:{x}_{t}\) and previous output \(\:{h}_{t-1}\) by the sigmoid layer and multiplied by \(\:\text{tanh}{(C}_{t})\).Since the emerging of LSTM, further variation was developed such as Bi-directional LSTM (Bi-LSTM)33. It was introduced to learn from information both previous and subsequent data. This model trains forward first as vanilla LSTM explained above, denoted as \(\:\overrightarrow{{h}_{t}}\). The backward states are \(\:\overleftarrow{{h}_{t}}\) learning from the cell state at \(\:t+1\). At the end, outputs of each direction are concatenated and returned.Later as the state-of-the-art model, the transformer34 was introduced based only on multiple self-attention layers in an encoder-decoder structure. The attention mechanism was known to be effective in modeling sequences, where sequences are modeled based on dependencies between tokens instead of sequential distance between them. Scaled dot-product attention was used to calculate attention score for Query \(\:Q\), Key \(\:K\), and Value \(\:V\) as below.$$\:Attention\:value\left(Q,K,V\right)=Softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$The dependency of tokens \(\:K\) to input \(\:Q\) are attention distribution and attention value regarding \(\:V\) is calculated based on them. Attention distribution is calculated with applying Softmax on dot-product between \(\:Q\) and \(\:K\), scaled with \(\:{d}_{k}\), dimension of \(\:K\). This attention value is the weighted sum of this distribution and \(\:V\).Transfer learningTransfer learning involves adapting a model, pretrained on abundant data, to perform the same task on a target source with limited data8. Due to the lack of access to healthcare datasets, transfer learning in EMR is studied to compensate for the lack of data9,10,11. For transfer learning for different sources, the differences in data distribution between sources are adapted in the training stage12,13and before training via calculation of deviation14or feature importances15. In this study, adapting model was pretrained by AMC dataset and transfer learning was performed for the target dataset of CNUH. We have added the feature selection stage to adjust for the change in importance of a feature.Feature selectionThe features of EMR were selected by SHAPley value35 which calculates importance by excluding any bias from the current method through recursive feature elimination and by evaluating the feature’s importance of ML model. The importance of the feature was obtained from the Xgboost model. A tree model consists of classification models, each assigned with trainable weights to supplement the errors from the previous stage. In this study, we utilized the Xgboost as a classifier. An Xgboost is an ensemble method that combines the results of multiple classifiers to achieve better performance. The model is trained with a gradient with parameters λ and γ to avoid overfitting, and the evaluation of performance is estimated with the root mean square error (RMSE). The Xgboost model, yielding the lowest RMSE, was chosen to predict outcomes for 20% of the randomly selected patients via grid search. The feature selection results are shown in Sect. 4.2. Features ranking in the top 50 for feature importance were selected to discard those with lower feature importance and frequency. Furthermore, we have validated the model by comparing its predicted score with survival analysis from the traditional statistics model.Model architectureThe prediction model uses previous time-series data of selected features of a patient to predict MACE within three years. We have suggested a Self-attention recurrent neural network (SA-RNN) to learn serial data with an attention mechanism. Convolution layers are often used as a feature extractor with other modules suited for the task31, as a convolutional model from previous work, EEGNet30, is combined with the attention module. Likewise, the SA-RNN model consists of convolution blocks as a feature extractor combined with LSTM and the attention module.Figure 1 illustrates a schematic example of constructing time-series data and the overall architecture of SA-RNN, segmented into three principal sections between an input layer and an output layer. Details of architecture and data flow of SA-RNN are depicted in Fig. 2. The initial section is a feature extraction section composed of three 1-dimensional convolutional (1D-CONV) blocks, represented in green. Each 1D-CONV block consists of two 1-dimensional convolutions with added latencies, dropout, and a LeackyReLu as an activation function. With constructed time-series data as an input, 1-dimensional convolution is applied on a row of features to learn feature-level representation for each row in sequence.Succeeding the convolutional blocks are two layers of Bi-LSTM units. These blocks are designed to capture progression in time-series data from both forward and reverse directions16. Concluding the architecture is an attention layer, which is employed to weigh the significance of the hidden states learned from the previous sections. Finally, the output layer is trained to classify the occurrence of the MACE learned from patient-level information.Fig. 2Architecture and data flow of SA-RNN.Experimental settingFor internal validation, 20% of patient data was randomly selected, and predictions were made using the chosen feature set. The data distribution of train and test datasets from each source is provided in Supplementary Fig. S1 online. The model was trained with grid search to determine the most suitable parameters. (For details see Supplementary Table S2 online) The three 1D-convolutional blocks used a kernel size of 5 with filter sizes of 64, 128, and 256, respectively. The LeakyReLu is used as an activation function throughout the model with a dropout rate of 0.2. For the two Bi-LSTM layers, the dimensionality of the output space was each set to 50 dimensions. The Adam optimizer was used with a learning rate of le-04.For transfer learning the small target dataset, the model went through 10 epochs with a small portion of weights set trainable. The 1D-Conv feature extraction was set trainable for it is where the difference between two hospital needs to be learned. In addition, the Attention module and output layers were set trainable to adjust the weights for classification. All weights set trainable except for the two Bi-LSTM layers to freeze where the weights learning sequential data is stored. The learning rate started at le-04 and was adjusted when the performance did not improve per epoch.

Hot Topics

Related Articles