Synthesis and quality assessment of combined time-series and static medical data using a real-world time-series generative adversarial network

The study protocol was in accordance with the ethical guidelines of the 1975 Declaration of Helsinki. This retrospective study was approved by the Institutional Review Board of Samsung Medical Center (IRB file No. SMC 2021-02-128). The requirement for an informed consent was waived by the ethics committee of Samsung Medical Center because the study involved a retrospective review of anonymized medical data.Data setsIn this study, the real data used for data synthesis consisted of records of patients with colorectal cancer who received their initial treatment at the Samsung Medical Center from January 1, 2008, to July 31, 2021. During this period, 30,527 patients with colorectal cancer were registered at the Samsung Medical Center. The total patient population included those who initially received treatment at outside institutions and were subsequently transferred to the Samsung Medical Center. However, our data did not include records of treatments received at other institutions. Therefore, to select only patients who received the initial treatment at the Samsung Medical Center, the following process was undertaken. First, we calculated the “Diff” value, which is defined as [Min (date of first surgery, date of first anti-cancer treatment)—date of first diagnosis]. The distribution of the “Diff” value among all patients is shown in Fig. 4. Next, we investigated the number of patients selected when the Diff value was at three months and six months (Table 4). Typically, treatment begins within three months of diagnosis. Therefore, after consultation with clinical experts, we decided on a three-month cutoff for the Diff value to select the study subjects. As a result, out of 30,527 raw data, 15,899 patients with a Diff value of three months or less were selected.Figure 4Diff distribution (bin size = 1 month). The x-axis represents the Diff value, and the y-axis represents the number of patients.Table 4 Selection of subjects according to Diff value cutoff.After screening the collected real data, we identified data errors that did not align with clinical scenarios. We applied logic based on actual clinical situations to exclude these erroneous data. By omitting such erroneous data before synthesis, we enhanced the quality of synthetic data. The logic applied is as shown in Appendix Textbox 1. Finally, a real dataset comprising 15,799 individuals was constructed for data synthesis. The basic statistics of the real data group are shown in Table 5.Table 5 Demographics of real data group.Preprocessing and dataset construction for data synthesisThe National Cancer Center, a public institution in South Korea, built a platform called CONNECT, which contains standardized cancer big data from medical institutions for research purposes. Real-world data were collected according to the data format of the CONNECT platform. In the CONNECT platform, the data are in a multi-table format, divided into separate tables for static patient baseline information that does not change over time, and dynamic information with specific occurrence times, such as examinations, pathology, surgeries, chemotherapy, and radiation therapy. The details of these tables are provided in Appendix Table 1. Considering their sequential relationships, a process was undertaken to convert the date variables into numerical ones for synthesis. This process is described in Appendix Textbox 2. We then joined other event tables based on patient basic information tables whose values did not change over time, such as the first diagnosis date or sex, to generate consolidated data on how many days had passed since the first diagnosis date the patient had an event.Data synthesisIn this study, the RTSGAN was used to synthesize medical data with irregular time intervals and variable time lengths. The RTSGAN is a model specialized for medical data synthesis with irregular data generation cycles and consists of an encoder–decoder module and a generation module (Fig. 5). The detailed description of RTSGAN is provided in the Appendix Textbox 3. The main parameters used to train the model are listed in Appendix Table 2.Figure 5Architecture of the RTSGAN. The blue line represents the learning process of the encoder–decoder module, the red line represents the learning process of the generation module, and the green line represents the process of generating synthetic data after learning.Evaluation of quality of synthetic dataQuantitative and qualitative evaluations were conducted to ensure that the synthetic data were accurately generated. Furthermore, we applied the synthetic data to a real medical AI model to evaluate its performance and confirm the feasibility of utilizing synthetic data.Quantitative methods include the Hellinger distance; TSTR; TRTS; and propensity MSE. The Hellinger distance is used to numerically determine how similar the two probability distributions are and is calculated based on the Bhattacharyya coefficient, which is similar in nature17. The Hellinger distance has a value of 0 when the two probability distributions match and a value closer to 1 when the two probability distributions do not match. The calculation method of the Hellinger distance is described in detail in the Appendix Textbox 4.TSTR is a method of training a classifier model with synthetic data and validating it using real data. A comparison is made between the model’s metric on the train data and the model’s performance on the test data18. If synthetic data are well-generated and follow the characteristics of real data, the model will behave normally when real data are applied to a model trained with synthetic data, and the difference between the two performances will be small. Conversely, in TRTS, the classifier model is trained with real data and validated with synthetic data. The problem with this method is that, even if a mode collapse phenomenon appears in the synthetic data, it does not affect the TRTS evaluation method. However, the WGAN-GP19, which was used in the generation module of this study, improved the mode collapse problem; therefore, we also verified TRTS.Propensity MSE is a method that generates training and test data by labeling real and synthetic data, mixing them in a 1:1 ratio, and then training and testing a model to classify them to measure how well it can distinguish between real and synthetic data20. In the case of propensity MSE, it had a value between 0 and 0.25; however, in this study, scaling was performed to make it easier to interpret, and as a result, it was converted to have a value between 0 and 1 as follows:$$propensity MSE(Scaled)= \frac{1}{N}\sum_{i}^{N}\frac{{({p}_{i}-0.5)}^{2}}{0.25}$$where N is the size of the dataset and p is the model-generated pseudo probability for each sample. If the classification model does not distinguish between real and synthetic data, the propensity MSE converges to zero.Qualitative evaluation methods included histogram and t-SNE methods. Histograms were used to examine the distribution of data and compare the distribution between the real and synthetic data. Although it cannot provide a numerical representation of the similarity between real and synthetic data, it can provide the first indication that the probability distributions of the two datasets are similar. t-SNE is a dimensionality reduction method for high-dimensional complex data to low-dimensional data and is characterized by close correspondence between similar data in low dimensions21. Considering these features, if the synthetic data are well-generated, the t-SNE for the synthetic data should not be significantly different from that for the real data.Application of synthetic data to a real-world medical AI modelWe applied synthetic data to a real-world medical AI model to evaluate its usefulness. After applying real and synthetic data to a medical AI model, the quality of the synthetic data can be evaluated by comparing the similarity of the results obtained using the synthetic data with those obtained using the real data. As a medical AI model, we used a random survival forest22 that predicts five-year survival using colon cancer patient data; the survival model’s evaluation indicators include the C-index, Brier score, and integrated Brier score (IBS)23,24,25. Details about the C-index, Brier score, and IBS are described in the Appendix Textbox 5.Evaluation of disclosure risk of synthetic dataDCR is a method for assessing the likelihood of personal information exposure by measuring the distance between synthetic data and real data, typically measuring the distance from each synthetic data to the nearest real data26. The closer the distance between the data is to zero, the more similar the synthetic data is to the real data, which means the higher the probability of personal information disclosure.The MIT is a method of inferring whether a particular data point belongs to the training dataset14,15. It consists of a target model that is trained with the actual training dataset, an attack model that inferred whether a particular data point belongs to the training dataset, and shadow training that is used to train the attack model. Shadow training, which is used to train a dual attack model, is based on the idea that training two models on similar datasets using the same model structure increases the likelihood that the two models will make similar predictions. If a shadow model trained on synthetic data that resembles real data learns to infer the presence of data points that contain the information the attacker wants to extract, an attack model can be created that successfully inferred data points that also exist in the target model’s real training data. In other words, the attack model can analyze the shadow model’s predictions to determine whether a particular data point belonged to the target model’s training dataset. As a result, if the attack model can accurately distinguish whether a certain data point belongs to the target model’s train data, it can infer that the data point is likely to belong to the target model’s training dataset. In this paper, we use the membership inference test to verify whether the synthetic data belongs to the real data used to train the model.

Hot Topics

Related Articles