“Ask” or “Inquire”: operationalizing speech formality in psychosis and its risk states using etymology

Study design and settingThis analysis draws from two studies of language production spanning the schizophrenia spectrum. Participants were individuals with ROP (symptom onset within 5 years), youth at CHR, and HC individuals with similar demographics. Data in the single-site study (R01MH107558) were collected between 2016 and 2023 in New York, USA. In the multisite study (R01MH115332), data were collected between 2018 and 2022 at clinical research programs in New York, USA; Melbourne, Australia; and Toronto, Canada. In both studies, open-ended interviews were collected for computational analysis. All interviews were conducted using the same protocol, using qualitative research methods as described previously, and each lasted approximately 30 min [11,12,13]. Interviewers first asked participants “How have things been going for you lately?” and asked open-ended follow-up questions to encourage further discussion. Studies were approved by the Institutional Review Boards of the Icahn School of Medicine at Mount Sinai and the New York State Psychiatric Institute at Columbia University, as well as at Orygen, The National Centre of Excellence in Youth Mental Health at the University of Melbourne; the Centre for Addiction and Mental Health in Toronto; and now approved under Clinical and Translational Sciences (CaTS) BioBank by the Research Ethics Board of the Centre intégré universitaire de santé et de services sociaux (CIUSSS) de l’Ouest-de-l’Île-de-Montréal – Mental Health and Neuroscience subcommittee. All participants (and their parents or guardians if minors) provided written informed consent.ParticipantsAcross the studies, language samples were collected from 92 individuals with ROP, 144 individuals at CHR, and 173 HC. Exclusion criteria included risk of harm to self or others incompatible with research participation, medical or neurological disorders that might affect language, IQ under 70, and for HC individuals only, a DSM Axis I diagnosis. All sites used the Structured Clinical Interview for DSM disorders [14] to determine diagnoses. CHR status was determined using the Structured Interview for Psychosis-Risk Syndromes (SIPS) [15] in North America or the Comprehensive Assessment of At-Risk Mental States (CAARMS) [16] in Australia. We chose to study individuals with ROP to minimize confounding from chronicity and antipsychotic exposure.AssessmentsSelf-reported age, sex, and racial identity were collected. Medication use, including antipsychotic use, was recorded as yes/no. Symptom severity was assessed in individuals with ROP using the Positive and Negative Syndrome Scale (PANSS) [17], and in those at CHR the SIPS was used in New York/Toronto and the CAARMS in Melbourne. HC participants in R01MH107558, in New York, were administered the SIPS. HC participants in R01MH115332 were administered the CAARMS in Melbourne, the SIPS in Toronto, and the PANSS in New York. Functioning was assessed with the Global Functioning: Role and Social scales (GF-R and GF-S), which were developed for use in individuals at CHR [18]. In R01MH115332, IQ was estimated using the Wechsler Abbreviated Scale of Intelligence (WASI) vocabulary and matrix reasoning sub-sections [19].Analysis of etymology contentRecorded interviews were transcribed by the HIPAA-compliant transcription service TranscribeMe! (www.transcribeme.com; Fig. 1). Audio transcripts were uncapitalized, then lemmatized using the NLP package Stanza [20]. Lemmatization converts different word inflections to their root inflection (e.g., “is” and “am” both become “be”). Special characters and punctuation were removed. Since the etymology of many words is contested, we determined etymologies using two resources: Etymonline.com [21], which is based on a curated set of etymological dictionaries; and a database derived from Wiktionary.com [22], which is open-source and has many contributors.Fig. 1: Etymology analysis pipeline.The pipeline of language data collection, preprocessing, and analysis for etymology content is outlined here. Participants engaged in a recorded Zoom interview, which was then transcribed by the HIPAA-compliant transcription service TranscribeMe!. From there, participant speech was isolated, converted to lower case, and lemmatized using the Stanza NLP package for Python. To calculate etymology proportions, word lemmas were searched on Etymonline.com, and the names of origin languages were pulled from the etymology description on the returned webpage. Separately, word lemmas were searched in a database derived from Wiktionary.com, and word origins with the relation types “inherited from”, “derived from”, etc. were retrieved.We determined for each lemma whether its etymology contained a Germanic or Old French language origin. Because determiners (e.g., “which”, “that’) and other common structural parts of speech (“the”, “she”, “it”) are predominantly Germanic in origin [23], analysis of Germanic vs. Old French content of the speech was restricted to nouns, adjectives, verbs, and adverbs, where there are more Germanic and Old French synonyms (e.g. “ask” vs. “inquire”). The quantity of Germanic and Old French word use was calculated as the proportion of these parts of speech that have Germanic or Old French origin in their etymology. Some words had both Germanic and Old French origin, e.g. due to prefixes or suffixes (“talkative”), or neither Germanic or Old French origin (“karaoke”). Only words of exclusively Germanic or Old French origin were included in these calculations.Calculation of lexical diversityLexical diversity was determined for each preprocessed and lemmatized language sample using Honoré’s Statistic [8]. This scale captures variance in vocabulary normalized by transcript length. The formula for Honoré’s Statistic is:$$R=\frac{100\log \left(N\right)}{1-{V}_{1}/V}$$where N is the total text length, V1 is the number of words that appear exactly once, and V is the number of unique words. Larger values of R indicate greater lexical diversity.In keeping with past reports on this calculation [24,25,26], Honoré’s Statistic was calculated from the whole text of transcripts and was not filtered by part of speech like the etymology proportion and perplexity calculations were.Calculation of Perplexity (Rarity)Within each transcript, all word lemmas assessed for etymology were also assessed for their rarity as a measure of first-order perplexity. First-order perplexity assumes that the probability of a string of words is simply the product of each individual word’s unary probability in a language. This model was chosen to match the context-independent nature of the etymology content analysis. To quantify perplexity, we identified each lemma’s frequency in the Google N-grams database, which contains over one trillion words of text derived from publicly accessible webpages [27]. We derived a perplexity score for each transcript according to the following formula:$$\log \left(\Pi \right)=\frac{1}{n}\sum _{i=1}^{n}\log ({W}_{i})$$where Π is the perplexity of the entire transcript, n is the number of lemmas (counting only nouns, verbs, adjectives, and adverbs), and wi is a given word’s probability based on its frequency in the Google N-grams database [28].Statistical analysesAssessing and adjusting for potential covariatesWe tested sequentially for associations of lexical variables with sex, age, recruitment site, education, racial identity, socioeconomic status, and IQ (in the subset where available), and where we found associations with etymology content, we adjusted all lexical variables for association with that covariate. Socioeconomic status was approximated using maternal education, a common proxy [10]. For categorical variables (sex, recruitment site, racial identity), we tested for associations using t tests or ANOVAs and adjusted lexical variables by the difference of HC median scores in each category. Because the cohorts of individuals identifying as Black and “Other/more than one race” were small and did not differ from one another on any lexical variable, these cohorts were combined for adjustment. For continuous variables (age, education, socioeconomic status, IQ), we tested for associations using Spearman correlations and adjusted scores using residuals from linear regression models trained on HC data.Antipsychotic use (dichotomized as yes/no) was assessed as a covariate in the CHR and ROP cohorts separately. No HC used antipsychotic medications.Group differencesAfter adjustments for covariates, ANOVA was used to test for significant differences between HC, CHR, and ROP cohorts in etymology proportion, as well as for lexical diversity (Honoré’s Statistic) and perplexity. To identify cohorts driving significant group differences, we calculated pairwise independent t tests. Tests were repeated within recruitment sites for significant whole-dataset group differences. We hypothesized differences between HC and ROP across linguistic variables.Correlates and predictors of etymologyWe calculated a Pearson correlation matrix between etymology proportions, lexical diversity, and perplexity to determine the proportion of variance in etymology content explained by perplexity and lexical diversity.To determine the relative contributions of diagnosis, lexical diversity, perplexity, age, sex, recruitment site, education, and racial identity to etymology content, we generated two multiple linear regression models, one with Germanic origin word frequency as the dependent variable, the other with Old French, and compared standardized regression coefficients. These variables were chosen because significant correlations or group differences were found in the analysis of covariates, excepting IQ since it was only available in a subset. These analyses were performed using Jamovi [29].Clinical relevance of etymology patternsTo determine whether lexical features were related to symptom severity in the clinical cohorts, we calculated Spearman’s correlations between etymology proportions and total positive/total negative symptom scores from the SIPS (New York/Toronto) and CAARMS (Melbourne) for the CHR cohort and the PANSS for the ROP cohort. We determined associations of etymology proportions with concurrent functioning by calculating Spearman’s correlations between etymology proportions and GF-R and GF-S scores in the combined CHR and ROP cohort. When correlations were significant, we tested also for associations with lexical diversity and perplexity. We additionally re-tested significant correlations within recruitment sites.A Bonferroni-corrected significance level of alpha <0.0025 was chosen based on the twenty tests that include multiple linear regression models and correlations with clinical measures.

Hot Topics

Related Articles