A better performing algorithm for identification of implausible growth data from longitudinal pediatric medical records

In this report, we provided new methods for removing biologically implausible values for height and weight in pediatric populations. We also provided free and open-source R28 and SAS27 code to implement the algorithms. Both a visual demonstration and a simulation study were used to evaluate the utility of the novel algorithms. The visual demonstration showed smoother height trajectories after removal of biologically implausible values with the Harrall algorithm. The simulation study compared the novel algorithm to three published algorithms1,2,3. The novel and published algorithms had similar sensitivity values for both height and weight, across all experimental conditions examined. However, the Harrall algorithms had better specificity for height, and much better specificity for weight. Run times were similar for the novel and existing algorithms. Because algorithms were compared on similar testbeds, with known correct and known biologically implausible data, it was possible to compare the algorithms head-to-head against a gold standard.Figures 1, 2 visually demonstrate the appeal of the Harrall height algorithms. The graph of height data, with biologically implausible values removed, shows smoother spaghetti plots of height growth over development, both individually (Fig. 1) and for all participants together (Fig. 2). Because the EPOCH data were observed, and not simulated, there is no gold standard. The better-looking spaghetti plots are aesthetic improvements, but may not represent truly cleaner data.We did not produce spaghetti plots illustrating the EPOCH29 weight data for the following reason. Height increases monotonically with age. By contrast, weight may fluctuate between consecutive time points, and from month to month. Both increases and decreases are plausible. Looking at growth curves for height provides information, because it shows the removal of implausible decreases in height. Looking at growth curves for weight for a particular example shows little information, as small increases or decreases may reflect changes between true measurements, or may reflect an error or more than one error in measurement. Thus, we reasoned that results from the simulation study would provide better overall information about the algorithmic performance.The Monte Carlo study (Tables 3, 4) shows significant, but not clinically relevant differences between the novel and published algorithms in sensitivity for both height and weight. The significance of the hypothesis test results reflects the size chosen for the simulation design. Because there are 9000 replicates in the Monte Carlo study, the standard deviations are very small, and the p-values achieve significance. Differences in sensitivity in tenths or hundredths probably make no difference for scientists. The results for specificity for weight demonstrate both statistical and clinical significance. The Harrall algorithm achieved specificity of 0.908 for weight, compared to 0.309 for Shi et al.2 and 0.127 for Phan et al.3. These differences are both statistically and operationally significant. The Harrall algorithm detects about nine in ten incorrect values for weight, compared to about one in three for the algorithm of Shi et al.2, and about one in ten for the algorithm of Phan et al.3. For height, Harrall (0.928) and Daymont et al.1 (0.903) algorithms are relatively close in specificity, with both having higher specificity than the algorithms of Phan et al.3 (0.400) and Shi et al.2 (0.332).We used sensitivity and specificity as metrics for the simulation study. Because both sensitivity and specificity are proportions, they describe the percentage of correct values which are correctly kept, and the proportion of incorrect values which are correctly removed. For any study, the proportion of incorrect and correct values for weight and height may differ. It is impossible to know a priori how many, and what percent of each sort of measure will be removed.Removing biologically implausible values can reduce analytic problems for modeling series of repeated measurements of height and weight. Removing large peaks and dips can increase convergence of models. Removing noise improves precision and can remove bias in model results.The source of errors in measurements in height and weight is not clear, although one can speculate. Errors can arise from unit errors, such as measurements in pounds or inches instead of kilograms or centimeters. Errors can arise from medical staff skipping height measurements in a busy clinic. Errors can arise from wiggly children, bouncing on scales and squirming while being measured. Errors can arise from infrequent calibration of scales. Measurement errors can arise from digit preferences by clinic or study staff. And errors can arise from transcription, or translation mistakes.Scientists must make affirmative and transparent choices as to whether to clean or not to clean data. In any observed set of data, the probability structure which generated the true data and the errors is impossible to detect. Because the probability structure is unknown, it is unclear whether removing biologically implausible values will create bias in models fit to the data. Without cleaning data, it is difficult to fit models with pediatric height or weight as predictors or outcomes. The high variance of dirty data may lead to difficulties with convergence of models. Even if modeling is possible, conclusions from hypothesis testing with dirty data may or may not match conclusions from hypothesis testing with cleaned data. If the conclusions do not match, we advocate using the clean data, because conclusions drawn from biologically implausible values may be incorrect. If the conclusions from clean and dirty data match, we still recommend using the clean data, because the lower variance of cleaned data will give tighter confidence intervals. In some cases, it may be reassuring to report the conclusions from the dirty data as a sensitivity analysis. Another possible approach is to fit measurement error models, such as the nonlinear models of Caroll et al.30.The manuscript used Monte Carlo simulation to generate both clean and dirty data. Using simulated data to assess algorithm performance relies on opinion. Subjective questions that must be addressed include: Is this an appropriate model for the underlying data? Are these appropriate parameters? Does the frequency and regularity of the data represent real-world data? Is this capturing not just underlying patterns of growth but typical variation (from minor illnesses, etc.)—and if not, is that important? Does the mechanism for introducing errors mimic the type, variety, and severity of errors found in real-world data? Do the features varied during simulation represent all of the important aspects of underlying data or errors that can impact algorithm importance? Sometimes the answer to one of these questions is clearly no. But there are no objective criteria to determine when the answer is yes.The manuscript had several limitations. First, good performance on a Monte Carlo study reflects good performance on data generated under certain assumptions, and thus may not guarantee good performance on other data. In particular, the simulated data matched observed data from the EPOCH study29 in terms of the average and variance of the number of observations per year of life. The algorithm may not have performed as well in a set of data with less frequent measurements. In addition, a Monte Carlo study can only demonstrate superiority on the experimental conditions considered, rather than all possible experimental conditions. However, we sought to mirror as accurately as possible both normal human growth, and the errors added to the data. We assumed a Preece-Baines25 nonlinear growth model, and added repeats, cups and caps at the frequency observed in the EPOCH cohort29. The conditions we selected were representative of real observed data in terms of the percentage of participants with at least one repeat or biologically implausible value, the size of the errors, and the total number of errors. Second, the algorithm did not use an error tolerance to limit removal of small decreases in height. Because of errors in measurements, such decreases may be common. It is possible that future improvements of the algorithm will add this feature. Third, we used percentiles from the CDC5 clinical growth charts for age to flag unreasonable changes in weight across time. Weight thresholds were calculated as the absolute magnitude of the difference between the 97th percentile of weight at the age one year after the index age, and the 97th percentile of weight at the index age, irrespective of the time difference between two subsequent points. The threshold used is in some sense arbitrary, as the choice of percentiles was made to remove the most extreme data. Fourth, the Harrall algorithms require at least three data points for every participant. Fifth, if there are several measurements at the same time point, the algorithm will select the first one, which may subsequently be removed if anomalous. A future version of the algorithm may permit better error handling for multiple measures at the same time point. Finally, the algorithm is not appropriate to use in cases of severe illness, when skeletal height may decrease, or when rapid weight gain or loss is expected, as after bariatric surgery, treatment with GLP1-R agonists, or in the context of severe illnesses like cancer.The Harrall algorithms do not use population quantiles from the CDC to remove single extreme points for each participant, although the CDC weight-for-age quantiles are used to generate cutpoints for implausible growth velocity in weight. The use of CDC quantiles may represent a limitation, particularly for researchers studying populations inside of the United States. Using source data collected outside the study to define biologically implausible values is often called an externally driven cleaning approach4. Both the CDC5 and the World Health Organization (WHO)31 organizations provide percentiles of pediatric height and weight by age and sex. Our rationale for not using population quantiles to declare single points implausible was based on results from Freedman et al., who showed that this method often removed true values32. Freedman et al.32 tested the validity of using WHO percentiles to remove extreme values of childhood height, weight, and BMI in the National Health and Nutrition Examination Survey (NHANES) study. The group found two major issues when assessing outliers flagged by the WHO standards. First, almost all of the outliers sat at the upper range of the measurement’s distribution. By removing outliers from only the upper range of the distribution, cleaning using quantiles may lead to bias in model inference32. Second, at least 75% of the participants who were identified as having high biologically implausible values of height also had leg lengths above the 95th percentile32, suggesting that the removed values may have in fact correctly represented real data from proportionally large youth.Since the Harrall algorithm cannot identify anomalies at the first or last point of the trajectory, researchers may consider using the CDC or WHO population quantiles to flag improbable values prior to running the algorithm. In general, we discourage doing that. Our rationale is that the Harrall algorithm removes anomalous values in the context of the entire trajectory of growth for a single child. Thus, any values that remain are contextually valid, even if apparently improbably large or improbably small, in comparison to population quantiles.The manuscript also has several strengths. The algorithm is based on very few assumptions. The algorithm for height is based on the assumption that skeletal height increases throughout childhood. In addition, the code is available, both in R28 and SAS27, and released as open-source, copy-left under the GNU33 public license. This means that the code can be used or modified by others, as long as they do not charge for the software. The goal is to allow wide use of the software, and application to other datasets involving pediatric height and weight.Another strength of the Harrall algorithms is that they are computer-based. Because computer-based algorithms are fast, they are increasingly useful in studies with large sample sizes. The increasing use of electronic medical records and the rise of large epidemiological groupings of cohorts, such as ECHO34 and HELIX35, means that sample sizes are increasing. Large sample sizes means that manual curation takes more time, and costs more. In addition, computer-based algorithms provide replicable results. By contrast, since manual curation is based on expert opinion, rather than an algorithm, two curators, or even the same curator making a second review may get a different answer. In addition, manual curation is sometimes conducted by reviewers with little training, and thus may not even represent expert opinion.It is our hope that computer-based algorithmic curation of pediatric height and weight data will improve accuracy and reduce the bias in studies of growth. Understanding which data are biologically implausible will allow researchers to design systems to limit the sources of measurement error. We hope to extend the algorithmic approach to other longitudinal biological measures in humans, in which rapid changes are improbable.

Hot Topics

Related Articles