Finding Long-COVID: temporal topic modeling of electronic health records from the N3C and RECOVER programs

Topic usage and coherence across sitesOf 75 sites with data in N3C, 63 passed initial quality filtering, representing 12,486,133 patients with at least one condition recorded between 1/1/2018 and 8/2/2022. The topic model training set contained 7,992,339 patients and 387,401,304 conditions, while the topic model validation set contained 1,996,380 patients and 96,738,753 conditions, representing a corpus of 48,372 unique condition identifiers. Mean topic coherence on validation data improved as the number of generated topics increased from 150 to 300, but not beyond (Supplementary Fig. 2), so we selected the model with 300 topics for final analysis.Figure 1 illustrates selected topics as word clouds, displaying the top conditions of each by weight. Topics are named T-1 to T-300 in order of their usage \(U\) (rounded to nearest 0.1%, see “Methods”), font size is proportional to condition weight in each topic, and color indicates condition relevance to the topic. Supplementary materials include word clouds for all topics (Supplementary Fig. 3). Jensen-Shannon distance indicates that topics have little overlap (Supplementary Fig. 4), with a median distance of 0.82 (range 0.39–0.83). The last 10 topics however, T-290 to T-300, form a group with increased co-similarity and many generic, low-relevance conditions mixed with a small number of high-relevance conditions.Fig. 1: Word clouds illustrating top-weighted conditions for selected topics.Conditions are sized according to probability within each topic and colored according to relevance, with positive relevance indicating conditions more probable in the topic than overall. Each condition displays the numeric OMOP concept ID encoding the relevant medical code used for clustering, as well as the first few words of the condition name. Per-topic statistics in panel headers show usage of each of each topic across sites (\(\rm{U}\), rounded to nearest 0.1%), topic uniformity across sites (\(\rm{H}\), 0–1, higher values being more uniform), and relative topic quality as a normalized coherence score (\(\rm{C}\), z-score, higher values being more coherent).Coherence scores follow a roughly normal distribution across topics (Supplementary Fig. 5), and overall coherence tends to increase with rarer, more specific topics except for the last 10. Topic coherence varies by site, moreso for rarer topics (Supplementary Fig. 6). All sites exhibit low coherence for the final 10 topics, and most of the final ~35 are low coherence for most sites except for one. Two sites report low coherence for most topics. Topic usage also varies by site, though most sites and topics follow similar patterns of usage (Supplementary Fig. 7). T-4 was used almost exclusively by a single site and has very low coherence with only a few high-relevance terms, although this site uses other topics similarly to other sites.N3C sites contribute data from one of several source common data models (CDMs). The source CDM used by sites is not strongly correlated with coherence or usage (Supplementary Figs. 6 and 7), except for two sites in the PEDSnet network specializing in pediatric care and another using TriNetX. These three sites exhibit distinctive patterns, for example lower coherence and usage for T-153 pertaining to Gout (not typically associated with pediatric patients) and higher usage for T-127 pertaining to male pediatric conditions such as Phimosis and Undescended testicle.Individual conditions significant for PASC and COVIDFrom the 2,495,414-patient assessment set, 4386 PASC, 105,967 COVID, and 335,841 Control patients met cohort eligibility requirements for individual-condition tests. Amongst PASC patients, 36% had a strong primary infection indicator at least 45 days prior to their PASC indication. After removing duplicates from the top entries for each topic, we tested 4794 individual conditions for new onset post-infection. Of these, 213 are significant for the PASC cohort, 208 for COVID, and 89 for both with p < 0.05 after multiple correction. The complete list of significant results is available in Supplementary Table 4 and Fig. 2 labels a subset of these. The PASC cohort shows larger rates for most significant conditions, although several conditions are represented in the COVID cohort as well, such as Pneumonia caused by SARS-CoV-2, Viral pneumonia, Postviral fatigue syndrome, Loss of sense of smell, and Abnormal menstrual cycle. Additionally, the following conditions have significant estimated odds ratios (ORs) greater than 2 in both cohorts: Loss of sense of smell, Disorder of respiratory system, Acute lower respiratory tract infection, Upper respiratory tract infection due to Influenza, Telogen effluvium, and Non-scarring alopecia.Fig. 2: Increased and decreased new-onset conditions in PASC and COVID patients compared to Controls post-infection.The x-axis shows estimated odds ratios and the y-axis shows the adjusted p-values for new incidence of top-weighted, positive-relevance terms from all topics amongst COVID (left) and PASC (right) cohorts compared to Controls, in the 6-month post-acute period compared to the previous year. Many known PASC-associated conditions increased in both cohorts, while some conditions are cohort-specific. Additionally, in the COVID cohort, incidence of many conditions associated with regular care or screening is reduced compared to controls.Several conditions are strongly increased in the PASC cohort, including Chronic fatigue syndrome, Malaise, Finding related to attentiveness, Headache, Migraine (with and without aura), and Anxiety disorder. Neurosis is also present, but it should be noted that site-labeled source codes for this are almost entirely ICD-10-CM F48.9, Non-psychotic mental disorder, unspecified or similar (F48.8 and ICD-9 300.9). Notably, Impaired cognition is more common in PASC patients (OR 4.26) but less common in COVID patients (OR 0.53) compared to Controls. Other neurological conditions increased in PASC include Inflammatory disease of the central nervous system, Disorder of autonomic nervous system, Polyneuropathy, Orthostatic hypotension, and Familial dysautonomia (a genetic condition–see “Discussion”).The significant results for PASC also highlight a variety of symptoms related to the cardiovascular, pulmonary, and immune systems. Cardiac conditions such as Tachycardia, Palpitations, Congestive heart failure, Myocarditis, Cardiomyopathy, and Cardiomegaly are observed. Pulmonary issues are well represented with Pulmonary embolism, Bronchiectasis, Fibrosis of lung, and various generic labels for respiratory failure or disorder. Amongst immunological conditions are Reactive arthritis triad, Elevated C-reactive protein, Lymphocytopenia, Hypogammaglobulinemia, Systemic mast cell disease, and generic Immunodeficiency disorder. In addition, bacterial, viral, and fungal infections are increased, including Bacterial infection due to Pseudomonas, Aspergillosis, and Pneumocystosis. Other common themes include musculoskeletal issues (Fibromyalgia, Muscle weakness, various types of pain) and hematological issues (Blood coagulation disorder, Anemia, Hypocalcemia, Hypokalemia).The analysis also reveals estimated odds ratios less than 1, indicating decreased incidence post-infection compared to Controls, for 219 conditions in one or both cohorts. Most of these (174) were significant only for the larger COVID cohort, and several are related to routine screening or elective procedures potentially disrupted by a COVID-19 infection or lack of care access during the pandemic, such as Pre-operative state, Nicotine dependence, Radiological finding, Gonarthrosis, and Hypertensive disorder24. Preoperative state was largely coded as SNOMED CT 72077002 or ICD-10-CM Z01.818, both widely used across sites and indicative of pre-surgical examination. Unable to Assess Risk appears to be a custom code used by a single site, mapped to OMOP concept ID 42690761 by N3C. Other conditions may be more difficult to identify in the 6 months after a COVID-19 infection due to symptom masking or altered care-seeking behavior. Examples include Diverticulosis of large intestine and Esophageal dysphagia25,26. In addition to Pre-operative state, five conditions are significantly decreased for PASC patients, all related to late-term pregnancy, while Third trimester pregnancy is increased in COVID patients (see “Discussion”).Topics significant for PASC and COVID by demographicFrom the assessment set, 2859 PASC patients, 89,374 COVID patients, and 303,017 Control patients met cohort eligibility criteria for per-topic logistic models; Supplementary Table 5 provides per-group patient counts. Baseline contrasts broadly reflected expected trends by life stage and sex (Supplementary Fig. 3). T-2 for example pertains to pregnancy, with an estimated female/male OR of 45, pediatric/adult OR 0.06, adolescent/adult 0.2, and senior/adult 0.03. Similarly, T-3 highly weights neonatal conditions and generates a pediatric/adult OR of 43, but no significant female/male trend.Our primary contrasts considered life stage, sex, and infection-wave demographic groups, evaluating post-vs-pre topic odds radios for PASC or COVID patients compared to corresponding odds ratios for Controls. For example, the contrast ((PASC adult post)/(PASC adult pre))/((Control adult post)/(Control adult pre)) results in an OR estimate of 9.89 for T-23, suggesting that post-infection, adult PASC patients increase their odds of generating conditions from this topic nearly 10 times more than Controls do over a similar timeframe. Figure 4 illustrates this result and others for the subset of topics with significant OR estimates >2 for more than one demographic group. All baseline and primary contrast results are listed in Supplementary Table 6 and visualized in Supplementary Fig. 4.Amongst the 5400 sex, life-stage, and wave-specific contrasts, 314 are significant after multiple correction, representing 68 distinct topics. Of these, 130 are represented by the final 10 low quality topics with OR ~ 0.6 for all patient groups, potentially reflecting broad healthcare access patterns driven by their shared similarity and few high-relevance terms. Most contrasts have small ORs, with only 30 contrasts across 9 topics having an OR of 2 or higher. The majority of strong effects are seen for the PASC cohort, and while topic coherence was largely uncorrelated with PASC or COVID association, topics with the strongest significant increases in the PASC cohort were less coherent than average (Supplementary Fig. 8). PASC confidence intervals were larger due to this cohort’s much smaller size, a trend also seen across relative group sizes.T-23 stands out as a topic with strong migration among PASC patients, with all subgroups having significant estimated ORs of 5–10. High-weight, high-relevance conditions in T-23 include Fatigue, Malaise, Loss of sense of smell, and other well-known PASC symptoms, as well as the diagnosis code for PASC itself (Post-acute COVID-19). By contrast, COVID patients do not show statistically significant migration to this topic, with the exception of Adults with a small OR of 1.2.T-19 shows significant OR estimates for several PASC and COVID groups with similar magnitudes. This topic includes several variants of pneumonia and acute respiratory infection symptoms (Disorder of respiratory system, Dyspnea, Hypoxemia, Cough), suggesting significant long-term COVID-19 or secondary infections at least 45 days post-primary-infection. For both PASC and COVID cohorts, these increases are most associated with early-wave infections.Topics 86 and 137 show increases for several PASC groups, especially pediatric and adolescent patients. While T-86 is characterized by Pleural and Pericardial effusion and related pain, T-137 describes skin conditions, particularly hair loss, including Non-scarring alopecia and Telogen effluvium, both identified individually above. While effusion is a known factor for severe COVID-19 pneumonia, especially in older patients27, these results highlight similar outcomes in young patients. A systematic review of alopecia in COVID-19 patients by Nguyen and Tosti found that Anagen effluvium was associated with younger patients compared to other types of alopecia, but few of the reviewed studies included young patients28.Figure 4 displays additional results for selected topics with cohort or demographic-specific patterns. T-8 represents cardiovascular conditions, and shows a mild but significant increase for adult COVID patients compared to controls. T-43 (not shown) is also significant for PASC adult patients, and encompasses pulmonary conditions. Several of the top-weighted conditions within these topics were individually significant, such as Palpitations, Cardiac arrhythmia, Chronic obstructive lung disease, and Pulmonary emphysema for both cohorts, and for PASC Dizziness and giddiness and Tachycardia. While all of these were individually increased in the PASC cohort, Cardiac arrhythmia, Chronic obstructive lung disease, and Pulmonary emphysema were decreased in the COVID cohort relative to controls.T-72 is increased for both COVID and PASC pediatric patients compared to Controls, though this is only statistically significant for the larger COVID cohort. It covers a range of non-specific PASC-like conditions, including Illness, Neurosis (also discussed above), Ill-defined disease, Mental health problem, and Disease type and/or category unknown. Brain fog and Neurocirculatory asthenia are additionally found in this topic.T-77 is increased in female PASC patients compared to controls. This topic is diffuse and has no particularly highly weighted conditions, although many had high relevance scores to the topic. Several of these are laboratory-based, such as Hypokalemia, Anemia, and Hyponatremia. Tachycardia, Pleural effusion, Deficiency of macronutrients, and Adult failure to thrive syndrome are also present. The low specificity and coherence of T-77 make it difficult to interpret, although many of these conditions were individually significant above. T-20 (not shown) was increased for COVID adults and COVID delta-wave patients and also has few high-weight terms, but relevant conditions include Acute renal failure syndrome, Sepsis, and Acidosis.T-36 strongly decreased for both pediatric and senior PASC patients, and covers only a few conditions with high weights and relevance scores, including Acquired hypothyroidism and Autoimmune thyroiditis. This result is paradoxical, as these conditions are common long-term outcomes of COVID-19 infection29. Another paradoxical result is a strong (OR 11.7) increase in T-92 for adolescent PASC patients, which covers a variety of physical contusions, lacerations, and abrasions. The highest-weighted condition in this topic however, is Traumatic and/or non-traumatic injury, all of which were originally coded as ICD-10 T14.8 Other injury of unspecified body region for these patients.Adolescent PASC patients are increased in four topics: T-23, T-86, and T-137 already discussed, and T-174 which highly weights Thyrotoxicosis, C-reactive protein abnormal, and Polymyalgia rheumatica. PASC pediatric patients increase significantly in T-23 and T-137 already discussed, as well as T-57 covering a variety of pulmonary issues such as Chronic cough, Bronchiectasis, and Hemoptysis. On the other hand, PASC adolescent patients were reduced in seven topics and PASC pediatric patients showed a reduction in sixteen, covering a broad range of conditions. These assessment cohorts are small, with 49 pediatric and 66 adolescent patients. Chart reviews revealed that they were distributed across 18 and 20 sites, respectively, and had a similar mean number of conditions recorded in the year prior to infection as other cohorts in the same life stages. However, mean condition counts for these PASC patients were nearly 50% higher in the 6-month post-infection phase (Supplementary Table 5).These models included covariates to account for site-level differences in topic usage, percentage of PASC patients, and source common data model. To assess the importance of these, we also ran models without them for the subset of topics shown in Figs. 3 and 4. Results are highly similar (Supplementary Fig. 9), with models without site-level covariates showing slightly higher (<6%) odds ratios for topics 23, 36, and 72.Fig. 3: Topics with significant OR estimates > 2 for at least two demographic groups.The top row illustrates topics using the same color and size scales as Fig. 1; OR estimates are shown for demographic-specific contrasts of PASC or COVID pre-vs-post odds ratios compared to similar Control odds ratios. For example, adult PASC patients increase odds of generating conditions from T-23 post-infection nearly 10 times more than Controls do over a similar timeframe (see “Results”). Lines show 95% confidence intervals for estimates; semi-transparent estimates are shown for context but were not significant after multiple-test correction.Fig. 4: Other select topics with demographic or cohort-specific trends.T-8 is statistically significant only for COVID adults compared to controls. Topics 72 and 77 include diffuse sets of conditions, while T-36 is reduced for PASC pediatric and senior patients, despite representing known PASC outcomes (see “Discussion”).

Hot Topics

Related Articles