Predicting graft and patient outcomes following kidney transplantation using interpretable machine learning models

All methods were carried out in accordance with relevant guidelines and regulations. This study, referenced under IRAS project ID 304542, has received approval from the Health Research Authority and Health and Care Research Wales (UK research ethics committee). All UK transplant recipients provide consent to the use of their data in the mandatory national registry at the time of addition to the transplant waiting list. This project uses anonymised data from the national registry, so individual patient consent was not required.DataOur work is based on the analysis of a data set from the UK Transplant Registry, provided by NHSBT. It describes 36,653 accepted kidney transplants, performed between the years 2000 and 2020, across 24 UK transplant centres. All transplants are from deceased donors. The total follow-up duration is around 22 years. Each transplant is originally described with 3 identifiers, 12 immunosuppression follow-up indicators, 143 donor, recipient and transplant characteristics, and 7 entries describing targeted outcomes. Considering transplants as independent, we exclude the transplant, donor, and recipient identifiers. Information regarding post-transplant immunosuppression is discarded as this is not available at the time of the offer decision. The donor, recipient and transplant characteristics serve as input features for modelling. Among them, 24 describe the recipient, 109 represent the donor, and 10 refer to the overall transplant. Both recipient and donor characteristics contain generic information such as gender, ethnicity, age, blood group, height, weight, or body mass index (BMI). More specific information is also available, such as the transplant centre, number of previous transplants, waiting time, ease of matching, and the dialysis status. Donor data include the cause of death, past medical history and results of blood tests including kidney function (estimated glomerular filtration rate, eGFR). Transplant data include the donor-recipient immunological match.Duplicate rows are removed, and values outside of a plausible clinical range are removed. Categorical values are checked by clinicians and simplified (or removed) if needed. BMI is recomputed based on weight and height. Both weights and heights are discarded to limit redundant information. Blood measurements are harmonised across the data set by selecting the first measurement available (generally at donor registration) and the maximum value during the donation process. Since the calculation of eGFR varies across hospitals, this metric is recomputed over the whole data set using a consistent definition (see appendix, section A.1). Recipient dialysis status is also simplified into a dialysis duration and dialysis modality at time of transplant (predialysis, haemodialysis or peritoneal dialysis). Notably, the time on dialysis for predialysis recipients is set by default to 0. Transplant offers not meeting the inclusion criteria, such as dual and multi-organ transplants, are discarded.Outcomes present in the dataset include information about graft failure, patient death, and transplant failure. Graft failure excludes death with a functioning graft, whilst transplant failure denotes either graft failure or death. In this work, we focus on predicting graft failure and patient death. Each outcome is represented as a pair containing an event time and a right-censoring indicator. Right-censoring is a common type of censoring in survival analysis that describes the loss of follow-up on the event of interest. It can occur for various reasons, such as the end of the study, competing events, etc. Thus, right-censored information provides some partial information about the survival time, where it is only known to be greater than the censoring time. Transplant outcomes are recomputed for the sake of consistency.After removing the features presenting more than 50% missing values across the whole data set, the data is described through 50 input variables. At this stage, the data contains 8% missingness. A summary of this data-cleaning process is given in Fig. 1. Additionally, an exhaustive list of the features and targets considered at the latest stage of this process is given in the appendix (section A.2).Figure 1End-to-end data processing pipeline, from raw data to model testing. Data cleaning is detailed on the left. Cross-validation is performed before and after feature selection.Model training and validationIn this article, we compare the Cox PH model, DeepHit, and random survival forests in a single risk setting. The different models are interpreted a posteriori, and their performances are discussed.The following methodology is applied. First, the data is split in a stratified manner with regards to censoring indicators, where 80% of the data is reserved for training and the remaining 20% is left for testing. Due to matching policy changes and follow-up time differences between old and recent offers, we do not split the data according to transplant dates. After this first step, numerical values are standardised and categorical ones are one-hot-encoded. Mean and variance are computed over training data only. Standardisation has appeared to be more relevant than normalisation due to the presence of outliers in the data. Then, we impute missing features with the help of MissForest, an iterative imputation technique relying on random forests17. MissForest is first trained on the training data and then applied to all the data. This solution has been selected among several imputation techniques including MIDAS, a variational autoencoder-based imputation technique18; MICE, an iterative method for multi-column imputation19; MissForest itself, which is a variant of MICE; and a naive imputer simply returning average values. These methods have been compared on a sample of the data, where missingness was introduced by randomly masking known values. In order to simplify the end-to-end data processing pipeline and alleviate any burden on data requirements, we use the same training data set for both pre-processing and model training. Prior tests have shown no particular difference with more partitioned data management. Thus, after the imputation step, survival analysis models are trained through 5-cross validation. This process is performed a first time for feature selection. This is achieved by inputting Gaussian noise as a feature: we select any feature whose importance is higher than the importance attached to noise. Based on this subset of features, 5-cross validation is then repeated for final model training. Model calibration is then performed: predictions are adjusted a posteriori to match observed outcome ratios by training a logistic regression model. Model evaluation is undertaken by computing concordance and AUROC scores over 100 bootstraps of the testing data. The survival models are clinically interpreted using SHAP. To do so, we fix a particular time point (1, 5, or 10 years) and consider how models predict event occurence up to that point. The coefficients of the Cox PH model are also provided. The choice of considering Cox model’s coefficients for interpretability rather than using SHAP is motivated by the fact that Cox model’s inherent interpretability is a key factor in model selection. Notably, this is how this model is usually interpreted. The Fig. 1 illustrates the overall methodology and the code used for experiments can be found at https://github.com/AchilleSalaun/Xamelot.Following this processing pipeline, we compare the Cox PH model, random survival forests, and neural networks. Since both DeepHit and survival forests require the time to be discretised, we restrict transplant outcome prediction to 1, 5, and 10 years. This step follows the discretisation process described in8. We rely on grid search to tune hyperparameters. As a result, Breslow’s estimator is used to derive the Cox PH model’s baseline20. In addition, a regularisation parameter is introduced and set to \(1e^{-4}\) to deal with colinearities in the data. The survival random forest is given 300 trees. Finally, we train DeepHit in a single risk fashion. While predicting graft failure, the model is instantiated with two hidden layers of 100 neurons, with 10% dropout. The neural network used to predict patient death shows one hidden layer of 200 neurons followed by two layers of 100 neurons. For both graft failure and patient death predictions, the training is done through 50 epochs, with batches of size 64, and a learning rate equal to \(1e^{-2}\).

Hot Topics

Related Articles