Scalable incident detection via natural language processing and probabilistic language models

All methods were performed in accordance with the relevant guidelines and regulations. The study was approved by the institutional review board (IRB) at Vanderbilt University Medical Center (VUMC) with waiver of informed consent (IRB #151156) given the infeasibility of consenting these EHR-driven analyses across a health system and a large, retrospective dataset.Cohort generationData were extracted from the Vanderbilt Research Derivative, an EHR repository including Protected Health Information (PHI), for those receiving care at Vanderbilt University Medical Center (VUMC)28. PHI were necessary to link to ongoing operational efforts to predict and prevent suicide pursuant to the suicidality phenotypic work here29,30. Patient records were considered for years ranging from 1998 to 2022. For both suicide attempt and sleep-related behaviors, we focused on adult patients aged over 18 years at the time of healthcare encounters with any clinical narrative data in the EHR.While the technical details of the Phenotypic Retrieval (PheRe) system adapted here have been published elsewhere23,27, the algorithm’s retrieval method determined which records were included in this study. In brief, after query formulation to establish key terms for each phenotype (see “Automatic extraction.” below), this algorithm assigned scores to every adult patient record. To be included in this study, those records with any non-zero NLP score, i.e., any single term in the query lists, were included in subsequent analyses.Phenotype definitionsOur team has published extensively in the suicide informatics literature on developing, validating, and deploying scalable predictive models of suicide risk into practice. As a result, suicide attempt was defined based on prior work using published diagnostic code sets for the silver standard27 and domain knowledge-driven definitions for the gold standard annotation (see below for details on both).For sleep-related behaviors, our team reviewed the literature and sleep medicine textbooks for structured diagnostic codes for sleep-related behaviors31,32. We also consulted with clinical experts in sleep-related behaviors and sleep disorders in the Department of Otolaryngology at VUMC (see Acknowledgement). This expertise informed both the silver and gold standards for this phenotype. The specific standards will be detailed below; in brief, the silver standard was hypothesized to be less specific and less rigorous a performance test than the gold standard yet easier to implement since it relied on structured data.Early in the study, we considered various levels of granularity, e.g., “parasomnias” – more general – versus “sleepwalking” – less general. We prioritized sleep-related behaviors that might be associated with black-box warnings if found to be associated with a hypothetical medication or other intervention. We selected a subset of sleep-related behaviors – sleepwalking, sleep-eating, sleep-driving – as the resulting foci for this investigation.Temporality23In prior work, we applied NLP to ascertain evidence of any suicide attempt or record of suicidal ideation from EHRs across entire healthcare records27. In this work, the intent was to ascertain evidence of new, clinically distinct incidents of these phenotypes. To move from prevalence-focused to incidence-focused algorithms, we constrained the time windows for input data processed by NLP. For example, rather than ascertaining suicidal ideation from every note in an individual’s health record, we considered looking at notes documented on a single calendar day. The team discussed numerous potential temporal windows to focus the NLP including: (i) healthcare visit episodes (time of admission to time of discharge); (ii) set time-windows, e.g., twenty-four-hour periods; seven days; thirty days; multiple months; (iii) combinations of those two, e.g., clinical episodes plus/minus a time-window to capture documentation lag.After discussion and preliminary analyses, we selected a 24-hour period, midnight to the following midnight, as the window for this incident detection NLP approach. This window was chosen for clinical utility, simplicity, and agnosticism to vendor EHR or documentation schema. Operationally, this meant that we considered all the notes of a patient on a given day to encode a potential incident phenotype.Automatic extraction of phenotypic profiles from clinical notesWe developed a data-driven method to extract relevant text expressions for each phenotype of interest (see Fig. 1). The method involved processing large collections of clinical notes from EHRs (including tokenization and extraction of n-gram representations such as unigrams and bigrams) and unsupervised training of Google’s word2vec33 and transformer-based NLP models such as Bidirectional Encoder Representations from Transformers (BERT)34 to learn context-independent and context-sensitive word embeddings. The extraction of phenotypic profiles consisted of iteratively expanding an initial set of high-relevant expressions (also called ‘seeds’) such as ‘suicide’ as follows. First, we ranked the learned embeddings by their similarity to the seed embeddings. Then, we manually reviewed the top ranked expressions and selected the relevant ones as new seed expressions. The final sets of text expressions corresponding to each phenotype of interest are listed in eSupplement.Fig. 1Overview of automatic extraction process enabling Incident detection, steps Numbered and legend shown.Large-scale retrieval of incident phenotypesWe implemented a search engine to identify incident phenotypes in all the notes from the Vanderbilt Research Derivative and to rank them by relevance to their profile. In this context, each phenotypic profile corresponds to an input query for the search engine while each meta-document comprising of all the notes of a patient on a given day encodes a potential incident phenotype. In the implementation framework, we represented the meta-documents and input queries as multidimensional vectors, where each vector element is associated with a single- or multi-word expression from their corresponding phenotypic profile. The relevance of a patient meta-document to a phenotype was measured as the similarity between their meta-document and input query vectors using the standard term frequency-inverse document frequency (TF-IDF) weighted cosine metric. The final NLP-based score was a continuous value ranging from low single digits (< 10) to hundreds (< 500 typically). Higher scores indicated more similarity and therefore more evidence of the phenotype.To further improve the performance of our search engine, we performed query reformulation based on relevance feedback by iterative assessment of the top 20 retrieved incident phenotypes of each run23. The selection and ranking of the incident phenotypes was performed using the Phenotype Retrieval (PheRe) software package in Java, which is available at https://github.com/bejanlab/PheRe.git.Silver standard generationA silver standard represents a source of truth that might be less precise or more error-prone than ground truth, a gold standard. An advantage to silver standards remains their relative efficiency to generate and validate compared to more labor-intensive gold standards. To generate silver standards for sample size calculations for all phenotypes, we used ICD9CM (Clinical Modification) and ICD10CM diagnostic code sets to generate preliminary performance for the NLP to identify presence/absence of structured phenotypic data (i.e., International Classification of Disease [ICD] codes)35. For suicide attempt, we used validated code sets from published literature29,36. For sleep-related behaviors, we reviewed the literature and adapted code sets from the literature with clinical experts in Sleep Medicine at VUMC. Codesets for all phenotypes used in this project are available in the eSupplement.To evaluate preliminary performance, we used presence/absence of a single ICD code from the validated lists within thirty days of the date of NLP score calculation as a positive label (label = 1) and the absence as a negative label (label = 0). Thirty days was chosen as a common time period in which clinical risk is close enough to warrant documentation and intervention but not so close as to be imminent and required emergent care. The continuous NLP scores were a univariate ‘predictor’ of presence/absence. Performance was measured with typical information retrieval and model discrimination metrics: Area Under the Receiver Operating Characteristic (AUROC), Precision (P), Recall (R), P-R Curves, and F-measures (F1-score). These metrics were clinically relevant to the phenotypes under study as both phenotypes are rare at health-system scale. That is, algorithms might perform with adequate c-statistics at the expense of low P (high false positive rates). For a hypothetical post-market safety surveillance system, such false positives would be problematic to burden and accuracy. Similarly, high recall ensures cases of potential adverse events in a hypothetical system would not be missed (true positive rate).Gold standard generationThe intent in gold standard generation was to generate corpora of charts across all NLP score bins (e.g., NLP scores from 5 to 10, 10–15, 15–20, …) to evaluate performance of the NLP incident detection system. Unlike top-K precision used in information retrieval, we sought insight into performance with confidence intervals across score spectra to plan for downstream CDS applications. In top-K precision, error analysis concentrates on the highest scored records. Here, for broader clinical application, we sought error estimation for low scores (i.e., low number of assertions supporting the phenotype) as well as high, as we anticipated systems considering adopting similar systems would want to evaluate error estimates for all possible scores.For sample size calculation, we used preliminary performance from the silver standards to calculate numbers of chart-days to extract for multi reviewer chart validation. The rationale for this key step was the need for (1) efficient chart review within a selected marginal error and (2) intent to understand NLP performance for all possible scores – not simply performance in the highest ranked charts, aka “top-K” performance typical in this literature where K is some feasible number of highly ranked records, e.g., K = 200. To determine the number of encounters for chart review, we used the marginal error method which involves the half-width of the 95% confidence interval for the performance metrics in the investigated cohort. The process involved the following steps: (1) dividing the predicted risk scores into 5-point intervals and calculating the number of encounters in each interval; (2) estimating the precision and recall for each interval using ICD9/ICD10 codes; and (3) computing the sample size by setting the marginal error for the probability estimate in the individual interval to 0.05. We assumed that the number of positive encounters selected for chart review approximates a normal distribution and follows a hypergeometric distribution. The larger sample size between the calculated sample sizes for precision and recall estimates determines the required sample size for chart review.To conduct chart review, annotation guides were developed and revised after initial chart validation training (fifty chart-days for each phenotype). These guides included instructions on labeling and factors contributing to label decisions. All reviewers (K.R., J.K., C.W. [adjudicator]) were trained on the required annotation and participated in a Training phase using fifty chart-day examples for each phenotype. Chart labels included: Positive – phenotype documented in notes “reports suicidal ideation”; Negative – phenotype documented as not present, e.g., “denied suicidal ideation”; Unknown – insufficient evidence to determine labels. In subsequent analysis after the third reviewer adjudicated disagreement and unknown labels, these three labels were collapsed into two: Positive; Negative. Annotation guides are available as eSupplements.Evaluation metricsMetrics to evaluate NLP performance mirrored those used in preliminary analyses above, including P-R Metrics and curves; F1-score. We also calculated error by score bin to understand how well the NLP score performed across all thresholds. The intent was to replicate a common clinical implementation challenge – discretizing a continuous output from an algorithm into a binary event, e.g., a decision or an intervention that cannot be discretized in practice.Threshold selectionBecause the NLP produces a continuous score amenable to precision-recall metrics, users might select optimal performance thresholds through traditional means, as well. For example, thresholds might be chosen that maximize F-scores such as F0.5-, F1-, or F2-scores which emphasize precision, balanced precision/recall, or recall, respectively. We use F1-score here to select optimal thresholds for these NLP algorithms.

Hot Topics

Related Articles