Indirect reference interval estimation using a convolutional neural network with application to cancer antigen 125

Deep learning for indirect reference interval estimationData preparationTo create training data for the deep learning model, we simulate mixtures of reference and pathological distributions, covering a wide range of characteristics. By varying parameters that affect the accuracy of indirect methods such as reference distribution shape, fraction of pathological results, pathological overlap, total sample size, and precision of measured values, we aim to train a robust model that is applicable to many real-world datasets.Our simulation method expands upon that presented in RIBench, a recently proposed synthetic dataset for evaluating indirect methods10. The RIBench dataset is intended as a benchmark to compare the performance of different indirect methods, using carefully defined parameters to model 10 real-world analytes. In order to generate synthetic data suitable for training ML models, we generalize the data simulation approach to enable random sampling of mixture distribution parameters while maintaining real-world fidelity. The main assumption of the method is that there is an underlying distribution of healthy results which accounts for the majority of the data. On top of the healthy distribution are superimposed satellite pathological distributions, with varying sizes and overlap with the central reference component. As with several modern indirect methods, reference distributions were modeled as inverse-Box-Cox-transformed Gaussians3,4,5,6,14. Based on the knowledge of the exact reference parameters (i.e. mean, standard deviation, and Box-Cox skewness), the exact percentiles are also available, with the central 95 percentile typically used as RIs. Along with enabling high variability, this knowledge of exact target RIs make synthetic datasets advantageous for training and evaluating indirect methods because the size of the training data is not limited, and it avoids errors in ground truth target RIs due to limited sample sizes and mislabeling in real-world data.For each simulated sample, the target outputs were an array of percentiles of the reference distribution, including 1%, 2.5%, 5%, 95%, 97.5%, and 99%, and otherwise evenly distributed between 10 and 90% in intervals of 10%. Additionally, the fraction of the total sample taken from the reference distribution was added as a target output. The simulated dataset was partitioned into training and evaluation portions, with 9000 samples for training and 1000 for validation. The model was further evaluated on 1000 random samples from 8 analytes modeled in the RIBench benchmark dataset, namely hemoglobin, calcium, free thyroxine, aspartate transaminase, lactate, gamma-glutamyl transferase, thyroid-stimulating hormone, and immunoglobulin E. Further details about data preparation are provided in the supplementary material.Feature extractionTo extract features for model input, samples are first transformed to a standard scale of 0 mean and 1 standard deviation. In practice, the sample statistics are used to convert predictions back to original scale. A histogram of 100 bins evenly spaced within the range [− 4, 4] is then computed from the input sample. This array contains information describing the shape of the mixture distribution across the majority of its range after standardization, with the possible exception of outlying values. The format also addresses the basic requirements for model inputs to have a constant shape, and for the quantified information to be consistent for each element of the feature vector. Histogram magnitudes are normalized between 0 and 1.Model trainingWe used the Tensorflow-Keras framework for model development15. A deep CNN architecture was trained using backpropagation and gradient descent16. The hidden layers of the model consisted of three convolutional layers with 32, 64, and 64 nodes respectively, along with a dense layer of 64 nodes. The common rectified linear activation unit (ReLU) function was used as the activation function in the hidden layers17. ReLU is a computationally efficient non-linear function that simply sets negative values to 0, and aside from its simplicity, it promotes sparsity of learned representations, which acts as a form of regularization and can improve model generalization. The output layer contained 16 nodes for predicting various quantiles of the target reference distribution, concentrated near the limits as described in “Data preparation”. Additionally, one output node predicted the fraction of the sample size contributed by the reference distribution, which provided extra information for guiding model optimization. To avoid the influence of target feature scale and ensure each one contributed equally to model optimization, each was converted to standard scale based on the training set statistics. In practice, the outputs are converted back into original scale using the saved statistics. The mean absolute error was used as the loss function for guiding model learning. The popular adaptive Adam optimizer was used with an initial learning rate set to 0.00118. The model was trained for 20 epochs with a batch size of 8. The best model weights were selected based on the epoch with the lowest loss on a held-out validation set consisting of 1000 samples.EvaluationOne measure previously proposed for RI estimation error is the absolute z-score deviation metric, which quantifies error on the standard deviation scale10. The z-scores of the lower and upper limits, \({r}_{l}, { r}_{u}\), of the true or estimated RI are:$${z}_{l/u}=\frac{BoxCox\left({r}_{l/u}-S, \lambda \right)-\mu }{\sigma }.$$
(1)
where \(\lambda\) is the power parameter, \(S\) the shift, \(\mu\) the mean, and \(\sigma\) the standard deviation of the parametric reference distribution, respectively. Then the absolute z-score deviation is defined as:$${z}_{err}=\frac{1}{2}\sum_{i\in \left(l,u\right)}\left|{\widetilde{z}}_{i}-{z}_{i}\right|$$
(2)
where \(\widetilde{z}\) and \(z\) are the z-scores for the estimated and true RIs, respectively.We argue that an alternative, simpler score we call “normalized error” gives a better indication of model performance. The normalized error is simply a fraction of the true RI’s range, making it easily interpretable. It can be calculated as follows:$${n}_{err}=\frac{1}{4}\sum_{i\in \left(l,u\right)}\left|{\widetilde{r}}_{i}^{N}-{r}_{i}^{N}\right|$$
(3)
where \({r}^{N}=\left[{r}_{l}^{N}, {r}_{u}^{N}\right]=[-\text{1,1}]\) is the true RI in standard scale, \(\widetilde{r}=[{\widetilde{r}}_{l}, {\widetilde{r}}_{u}]\) is the estimated RI, and \({\widetilde{r}}^{N}\) is standardized \(\widetilde{r}\) defined as$${\widetilde{r}}_{l/u}^{N}=\frac{{\widetilde{r}}_{l/u}-{\mu }_{r}}{{\sigma }_{r}}$$
(4)
where \({\mu }_{r}\) and \({\sigma }_{r}\) are the mean and standard deviation of the true RI, respectively. This metric achieves the same effect of being scale independent with the added advantage of not requiring the parameters of a reference distribution modeled using a transformed Gaussian. Importantly, the z-score error also exaggerates performance for highly skewed distributions by compressing the tail, reducing the distance between true and predicted limits in tails. We chose a normalized error threshold of 0.1 (10% of the RI range) for classifying predictions as acceptable or unacceptable and calculating the accuracy of each method.We compared the performance of the neural network against a recently published expert-based algorithm, refineR6. Two synthetic datasets were used for quantitative evaluation, one created by our own simulation method, and the other consisting of 1000 random samples from 8 analytes in RIBench, as listed in section “Data preparation”.Estimating age-specific reference intervals for CA-125 in a Puerto Rican cohortData collection and preprocessingThe dataset is supplied by Abartys Health, a health data analytics company working in the Puerto Rican market. The source of clinical lab results are laboratories in Puerto Rico. The data provided is de-identified by removing all personally identifiable information (PII) such as names, dates of birth, and addresses. Nevertheless, separate individuals can be distinguished by a unique person ID within the dataset. Non-PII demographic information, such as age and gender, are available. Each lab result is identified by the Logical Observation Identifiers Names and Codes (LOINC) code of the result, and the timestamp of when the sample was taken.We curated CA-125 data measured between 2018 and 2023. Results from male patients, accounting for roughly 3% of all samples, were filtered out from the dataset. Females less than 18 years of age were also filtered from the dataset. To avoid an overrepresentation of diseased patients, who are more likely to be tested multiple times, we filtered results to only include the most recent result per patient.The laboratory test results were retrieved from laboratory information systems serving the clinical laboratories in Puerto Rico. As stipulated by the US Code of Federal Regulation CFR 46.104(d), the analysis of test results does not require patients’ explicit informed consent if the identity of the human subjects cannot readily be ascertained directly or through identifiers linked to the subjects. The use of the datasets has been reviewed and approved by the Institutional Review Board of the office of Human Research Subjects Protection at the University of Puerto Rico—Medical Sciences (reference number 2301072914). All methods were performed in accordance with the relevant guidelines and regulations.Age-specific RI estimationOutlier removal was applied via Tukey’s method after normalizing the data via Box-Cox transformation14,19. The dataset was stratified by age using intervals of 5 years. Along with the CNN, we applied refineR to each age group in the CA-125 data. Similar to several other modern, expert-based indirect methods, refineR assumes that input data consists of a mixture of pathological and healthy (i.e. reference) results, with the reference component modeled using an inverse-Box-Cox transformed normal distribution6,14. refineR first applies statistical analysis to identify the range of values that well-characterizes the main peak of input data distribution. Based on this range, search regions for the parameters of an inverse-Box-Cox transformed normal distribution are derived. A grid search is implemented in this parameter space to identify the best-fitting parameters. RI estimates can then be extracted using the percentiles of the fitted parametric distribution. For more details see6. We used the refineR package for the R statistical computing platform to implement the method20.As the lower limit of CA-125 is generally considered to be 0, we estimated 95th and 99th percentiles of the reference distributions as the upper limits. To estimate 95% confidence intervals for the upper limits from each method, we ran 1 prediction on each full sample, followed by 199 more predictions on bootstrapped samples. The 2.5th and 97.5th percentiles of the resulting estimates were used as 95% confidence intervals.

Hot Topics

Related Articles