Genetic and multi-omic resources for Alzheimer disease and related dementia from the Knight Alzheimer Disease Research Center

Demographics of Knight-ADRC participants with -omic dataThe Institutional Review Board (IRB) of Washington University School of Medicine in St. Louis approved the study with the IRB number 201109148, and research was performed in accordance with the approved protocols. Participants are asked to sign the informed consent by the staff at the Knight-ADRC. To allow for family member recruitment, interested participants are consented in-person during community events, while others are consented over the phone followed by the shipment of two paper copies of the consent. One copy will be signed and sent back to the Knight-ADRC, and the other will be retained by the participant. Individuals with memory issues must have a Legally Authorized Representative present at the time of enrollment. When no legal guardian or attorney-in-fact is present, a spouse, adult child, parent, sibling, or relative may sign.Knight-ADRC participants must be at least 40 years old and have no memory problems or, at the most, mild dementia at the time of enrollment. They are recruited thought the Memory and Aging Project and Memory Diagnostic Clinic participant pool. Flyers, word of mouth, and referrals are the main tools for participants to be enrolled at the Knight-ADRC. To characterize participants the Knight-ADRC uses a combination of clinical, psychometric, biochemical, and imaging information. Clinical diagnosis is performed by neurologist at the clinical core and use Clinical Dementia Rating® (CDR®) to determine the memory impairment. A CDR® score 0.5 corresponds to very mild dementia, 1 to mild dementia whereas a CDR® = 0 represent no cognitive impairment. This classification is re-evaluated for all participants at every assessment. At death, autopsy will provide a neuropathological diagnosis even though brain donation is not requirement for enrollment. To date, 1,182 participants have donated their brains and are stored at the Knight-ADRC Neuropathological core.Out of the 6,625 Knight-ADRC participants (Table 1) recruited during the last 30 years, 2,746 of them are classified at the last assessment as AD (with 29 of them being Autosomal Dominant AD), 2,369 are considered cognitively unimpaired, and 1,510 are considered to suffer from other dementias (1,285 AD Related Dementias (ADRD), 153 FTD, and 72 DLB – Table 1. Overall, most of the participants self-report as European Americans or Non-Hispanic Whites (NHW, 83.6%) and African American (AA, 12.3%); the remaining 4.1% self-report as Hispanic White (HW), Asian, American Indian, Alaskan Native, Native Hawaiian, and Pacific Islander (Table 1). The cohort is almost even regarding sex, with 57.3% identifying as females (Table 1). However, within the AA participants, females have a higher representation (73%) compared to the NHW (56%). As reported in previous studies, different APOE genotype distribution is observed between ethnic groups (Table 2).Table 1 Basic demographics of Knight-ADRC participants by disease status at last assessment.Table 2 APOE Genotype differences across self-reported ethnicity from the participants of the Knight-ADRC.Resources of Knight-ADRC participantsThe GHTO Core leverages the resources from the Knight-ADRC to further understand the pathobiology of AD. It aims to identify novel risk and protective molecular factors along with biomarkers for AD. To do so, the GHTO relies on access to high quality biological samples and accurate and detailed phenotypes. The GHTO stores nucleic acids (DNA and RNA) extracted mainly from blood, but also brain, CSF, and plasma. To date, the GHTO has banked more than 9,633 blood DNA samples from 4,787 unique participants (Table 3). DNA and RNA has been extracted from 1,178 brain samples (817 unique participants – Tables 4), and 3,022 blood RNA samples from 1,828 unique participants (Table 3). Recently, the GHTO core initiated the isolation and storage of PMBCs (675 samples from 609 participants) as well as the storage of CSF cell pellets (252 samples from 250 participants). The CSF and brain samples are not stored long term at the GHTO, they are obtained via collaboration with the Liquid Biomarker and the Neuropathological Knight-ADRC cores respectively. Finally, GHTO also collects and banks non-fasted plasmas (8,299 samples from 4,088 participants; Table 3; Fig. 1a). There are 1,701 participants that have donated plasma more than once, with some of them with up to nine longitudinal plasma samples spanning 12 years (Fig. 1a,b). Similarly, DNA and RNA have been extracted from individuals at different time-points. DNA is available for a total of 5,269 visits, having up to 14 visits from some participants and spanning more than 10 years (Fig. 1a,b). Similarly, RNA is available for 1,831 blood samples, with some participants with up to 5 visits (Fig. 1a,b).Table 3 Samples currently stored at the GHTO.Table 4 Brain samples currently at the GHTO distributed by brain region.All these resources are being leveraged to generate state of the art high throughput data. As of October 2023, -omic data from more than 26,000 biological samples (blood, several brain regions, plasma, and CSF) has been generated, from 5,283 Knight-ADRC participants (Table 5). Genomic data has been generated for the 5,283 Knight-ADRC participants: 4,843 genome wide array data (GWAs); 1,069 whole exome sequencing (WES), and 2,074 whole genome sequencing (WGS). Transcriptomic data (bulk RNA-Seq, ribodepletion) is available for 2,334 whole blood samples from 1,485 unique participants, and 630 brain samples from 490 unique participants. Additionally, small RNA sequencing is available for 330 parietal brains samples and 164 plasma samples (Fig. 2a,b). Finally, plasma cell-free RNA (cfRNA) sequence data is also available for 293 participants. High throughput proteomic data is available for 412 brain samples (Fig. 2a), 1,159 CSF samples (1,064 unique participants – Fig. 2c,d), and 4,084 plasma samples (3,137 unique participants – Fig. 2e). Metabolomics and lipidomic has been generated from 455 brain, 948 CSF and 3,169 plasma samples (Fig. 2f). Additionally, methylation data from 464 brain samples (from 444 unique individuals) and 20 blood samples is also available. As shown in Fig. 2, there is a high level of overlap, with multiple Knight-ADRC participants with multiple data layers available. In summary, blood transcriptomic, proteomic, metabolomic, and lipidomic data is available for 1,106 participants and brain transcriptomic, proteomic, methylomic, and metabolomic data is available for 404 participants.Table 5 Knight-ADRC participants with high-throughput data available distributed by data type and tissue of origin.Fig. 2Upset plots summarizing the -omic data layers generated by GHTO investigators and available to the scientific community. This representation does not contain longitudinal samples or brain regions, all graphs represent unique individuals with at least one data point (a). Summary of all the high-throughput data available for brain samples; (b). Distribution of the transcriptomic data by RNA tissue of origin; (c). Fasted CSF proteomic and metabolomic data available; (d). High-throughput protein measurements performed across tissues (brain, CSF, and plasma); (e). High-throughput data availability from whole blood (bulk transcriptomics) and plasma; and (f). Metabolomic data generated in brain, CSF, and plasma samples. Transcriptomic data is depicted with dark pink, proteomic data in yellow, metabolomic data in blue, and genomic data in green.Genomic dataOverview of genomic data: Array, WES and WGS dataOver the years, three types of genomic data have been generated for the Knight-ADRC samples: array-based genotyping, Whole Exome Sequence (WES), and Whole Genome Sequence (WGS). There are 4,843 participants with array-based genotyping data, 1,069 with WES completed, and 2,074 with WGS (Table 5). Array-based genotyping and either WES/WGS is available for 2,421 participants (Fig. 3a).Fig. 3Genomic data available at the GHTO (a). Distribution of genomic data from Knight-ADRC individuals across technologies (b). Ancestry distribution of Knight-ADRC participants with genomic data available calculated using the first two principal components (c). Polygenic risk scores for AD, PD, and FTD distribution across knight-ADRC participants by diagnostic at last assessment. WES = Whole Exome Sequencing; WGS = Whole Genome Sequencing; GWAS = Genome Wide Association Study; AD = Alzheimer Disease; ADAD = Autosomal Dominant Alzheimer Disease; ADRD = Alzheimer Disease Related Dementia; DLB = Lewy Body Dementia; FTD = FrontoTemporal Dementia.Array-based genotyping has been generated since 2008 using nine chips (Illumina Human660W-Quad, Infinium OmniExpressExome-8, Illumina Omni1-Quad, Illumina Human1M-Duo, Infinium Neuro Consortium Array, Infinium CoreExome-24, Infinium Global Screening Array-24, Human610-Quad, and Affy UK Biobank Axiom). Similarly, whole exome data (WES) data generation started in 2010 using Nimblegen VCRome v2.1 with IDT spike-in, then Agilent human Exon V5 (50 Mb), VCR_crome, and later on Sure Select human all exon 50 Mb kit. Finally, Whole Genome Sequence (WGS) has been generated recently using the Kapa Hyper PCR free using Illumina NovaSeq 6000, while in the past, Illumina HiSeq 4000 or HiSeqX were used. To integrate and analyze all the platforms and data generation approaches, ensuring its homogeneity, the GHTO scientists have established stringent pipelines to generate the genotyping clusters and impute for array-based genotyping, or align and call for WES and WGS, and then perform quality control to finally integrate all genomic data. Please see GitHub (https://github.com/NeuroGenomicsAndInformatics) and NGI website (https://neurogenomics.wustl.edu/open-science) for further details.Array-based genotyping data
Sample preparation
The Hope Center DNA/RNA Purification Core at Washington University in Saint Louis (associated to the GHTO) uses the Autogen FlexSTAR+ salt precipitation to isolate pure DNA. The FlexSTAR is fully automated to perform high-quality DNA isolation from large volumes (5–10 ml) of whole blood and buffy coat samples. To isolate DNA from small volume samples (<1 ml), that can originate from a range or sources (blood, buffy coat, saliva, blood cards, tissue, buccal swab, plasma, and CSF among others) the Core uses a bead-based automatized purification system, the Maxwell 48 automated workstation. After extraction, DNA quality and quantity are assessed with the TapeStation 4200, a high-throughput automated electrophoresis platform, that runs up to 96 samples in one run. All samples are then normalized to 100 ng/μL and stored in 2D barcoded tubes at −80 °C.

Data processing and quality contry
Array-based genotyping data generation, quality control (QC), and imputation is handled separately for each genotyping round. Once the data reaches the GHTO standard of quality, the genotyping rounds are merged. Briefly, after genotyping, all single nucleotide variants (SNPs) are called using Genome Studio. A two-step QC pipeline is implemented to ensure maximum retention of individuals and variants. Firstly, the GHTO uses a loose QC in which all variants and individuals with call rate less than 80% are removed in that order. Then a more stringent second pass using 98% for both parameters prior to data export into plink format. The Plink files are used to feed the TOPMed Imputation Server pipeline using the reference genome GRCh38. After imputation, any variant with an imputation quality of Rsq > 0.30 is retained. For the remaining variants, standard QC is applied: retention of variants and individuals with 98% call rate, removal of SNPs not in Hardy-Weinberg equilibrium (HWE), concordance check between reported sex and genetic sex, and concordance with expected identity-by-descent (IBD) estimates for technical replicates and family members (if present).
Once the different data generation rounds are merged, a final round of stringent QC is applied. Briefly, variants and individuals are filtered by 98% call rate, and autosomal SNPs not in HWE (P < × 10−6) are filtered out. The concordance between reported and genetic sex is assessed a second time, any additional discordances are removed. Finally, IBD estimates allows to confirm expected duplicated samples, familial relatedness, and remove any potential sample swap. All the quality control procedures are performed using Plink1.9 or Plink2.021. Full QC report can be found in Appendix 1.

Data availability
The Knight-ADRC GHTO has fully imputed and QCd array-based genotyping data available for 4,843 participants which includes 2,085 sporadic AD cases and 1,999 controls (Table 5). This data is available at the NIAGADS Knight-ADRC collection and specifically the NG00127 dataset. Ethnically, 4,173 are NHW, 626 AA, 33 Admixed American, and 11 of Asian genetic background, as determined by genetic principal components (Fig. 3b). Approximately 30,000,000 high quality variants passed QCed and are present in the fully imputed file. At a 98% call rate, 7,202,789 common variants (minor allele frequency (MAF) greater than 5%) and 8,448,089 rare variants (MAF < 1%) are currently available.
Genomic data (WES & WGS)
Sample preparation
DNA isolation protocols are the same regardless of the genomic data generated downstream. Protocol is described in the Array-based genotyping data section.

Data processing and quality control
All original files, regardless of being originated by WES or WGS, or their format (fastq, cram, bam, or sra) are aligned individually to the reference genome GRCh38 following GATK’s Best Practices Workflow standards22 with NVIDIA’s Clara Parabricks optimizations for High Performance Computing23. The GHTO most current pipeline can be found on GitHub (NeuroGenomicsAndInformatics/WXS-Pipelines (github.com)). Briefly, three different implementations have been generated to accept all different types of incoming files (fasq, alignment BAM or CRAM). The pipeline includes an optimized set of commands to go from sequencing reads to gVCF files. For each input file, the pipeline generates: an aligned file, a sorted file, a duplicate-marked cram file and a gVCF file. Additionally, it will provide a set of QC metrics gathered with GATK’s CollectWgsMetrics, VariantCallingMetrics, and VerifybamID2 to perform a first QC step before joint-calling. Samples with freemix value greater than 0.05 or coverage less than 30x (for WES) or 20x (for WGS) are marked as failed.
Joint Calling is performed by chromosome using GenomicsDBImport for WES and GenotypeGVCFs for WGS. A second round of QC performed at chromosome level includes GATK VQSR, removal of low complexity regions, removal of variants with excessive depth (for WGS only), removal of monomorphic variants, allele balance heterozygosity ratio filter, and hard filtering for SNPs and INDELs separately. After QC, all chromosomes are merged together for a final QC on SNPs and samples using PLINK1.9 following the same criteria described for the array-based genotyping data.

Data availability
WES is currently available for 1,069 participants, 578 of those classified as sporadic AD cases, and 382 as controls in the last clinical assessment (Table 5). Ethnically, most of them self-report as NHW, 1,021. There are 1,366,915 variants available with 98% call rate, 1,255,468 variants with MAF below 5%, and 1,186,359 variants with MAF below 1%. For WGS data, 2,074 participants’ sequences are available distributed as 928 sporadic AD cases, and 907 controls (Table 5). Similar to WES and the overall distribution of the Knight-ADRC participants, most of them identify as NHW, 1,807, followed by 240 AA. Variant wise, 85,891,726 high quality variants with a call rate greater than 98%, 79,568,847 variants with MAF below 5% and 75,450,819 with MAF below 1% are available.
In-house mutation screeningAutosomal Dominant AD (ADAD) is a rare form of AD characterized by the presence of specific missense mutations in the genes APP, PSEN1, or PSEN2 with high penetrance and autosomal dominant inheritance8. At Washington University Knight-ADRC, there is one family with a mutation in APP, eleven families with mutations in PSEN1, and one family with a mutation in PSEN2. However, none of these families fully qualify for DIAN given their later age at onset and incomplete penetrance. In fact, it has already been reported by investigators at the Knight-ADRC that mutations in these three genes can also be found in individuals with later onset AD and without perfect penetrance18. There are other genetic variants that modify AD risk. On top of APOE ε4 allele, which is the major genetic risk factor24, TREM2 variants are strongly associated with risk for developing AD25. There are 17 families with mutations in TREM2 followed at the Knight-ADRC. Additionally, specific mutations in MAPT, GRN, and a repeat element in C9or72f are known to contribute to FTD26. There are three families with mutations in MAPT, five families with mutations in GRN, and three families with the C9orf72. In consequence, all samples from the Knight-ADRC participants undergo variant screening in the GHTO, and this data is included in the GHTO database.Polygenic Risk Scores Calculation and availabilityPolygenic risk scores (PRSs) allow calculation of background genetic risk of a given individual for a given trait assuming that a genome-wide association analysis (GWAS) is available. At the GHTO, PRS is calculated as part of the established pipeline prior to data release. The latest available GWAS summary statistics of several traits of interest are used to generate risk scores for all the individuals with the genotype data using PRSceV2.327. Briefly, PRS are computed by calculating the sum of risk alleles weighted by the effect size estimate from the GWAS. Despite the AD-centric nature of the Knight-ADRC samples, the GHTO calculated PRS not only for AD, with and without the APOE locus, but also for Parkinson’s disease (PD)28, and Frontotemporal dementia (FTD)29, among others (Fig. 3c and Appendix 2). All PRSs are available for all samples with array-based genotyping data and have already been successfully leveraged to investigate the shared genetic structure between the earlier and later familial forms of AD, finding a high overlap among those forms10, and rate of progression11.Transcriptomic dataOverview of transcriptomic data: Bulk, single nuclei, spatial transcriptomics data, and tissue of originSimilar to genomic data, transcriptomic technologies have evolved and become widely available with highly competitive prices. Unless somatic mutations are considered, transcriptomics is highly dependent on tissue of origin. Bulk transcriptomics was generated for a total of 630 brain samples from 490 unique participants. Parietal RNA-seq data is available for 487 individuals (total of 525 samples). The remaining 105 brain samples are distributed across frontal cortex (n = 40), temporal cortex (n = 26), and cerebellum (n = 39) brain regions and correspond to a total of 42 participants (Table 6).Table 6 Transcriptomic data available at GHTO by brain region and library type (bulk or small).Single nuclei transcriptomic data is also currently available for 54 parietal brain samples. Brain spatial transcriptomics is available for a small proportion of samples (eight from temporal cortex, and one from parietal cortex). All the data described above contains coding and long non-coding transcriptome, however, in the context of the Knight-ADRC, there is also interest on the small non-coding transcriptome, thus, small transcriptome is also available on 330 knight-ADRC participants’ parietal brain samples.Brain is the hallmark tissue affected by AD, however, blood is being intensively investigated as a source of biomarkers. Recently, PAXgene RNA tubes have been collected as part of each Knight-ADRC assessment. Of those, transcriptomic data is available for 2,334 assessments, corresponding to 1,485 individual participants. Additionally acellular transcriptomic data is available from 293 participants, and small transcriptomic for 164 Knight-ADRC participants (Table 5).Bulk transcriptomic data
Sample preparation
RNA is extracted from 20 mg of brain, 2.5 mL of PAXgene preserved blood, or 500μL of plasma. For brain, frozen tissue is homogenized using metal beads immediately followed by cell lysis. Then, RNA is extracted in a Maxwell® RSC 48 instrument using the Maxwell® RSC simply RNA Tissue kit. For blood and plasma, there is no sample preprocessing, PAXgene mixed blood or plasma is directly loaded into the Maxwell® RSC simply RNA Blood kit for purification or Maxwell® RSC miRNA from plasma or serum kit respectively. Finally, brain and blood purified RNA are evaluated using RNA Screentapes on a 4200 TapeStation Instrument. Any RNA extraction not meeting quality standards (DV200 > 85%) is not processed for sequencing. For plasma, RNA is known to be degraded, thus only fluorometric quantification is performed, with a required concertation of at least 1.5 ng/μL. If additional tissue is available, a new extraction is conducted for those samples that fail QC.
Similar to the array-based genomic data, library preparation kits and sequencers have change and been updated over time. There are three batches of brain transcriptomic data; the first one contains 132 parietal brain samples that were spiked-inn with ERCC RNA ExFold Spike-Ins prior to ribodepletion (Ribo-Zero Gold kit) and library construction with TruSeq Stranded Total RNA Sample Prep kit. Finally, 100 million 150 bp pair ended reads per sample were targeted using a HiSeq 4000 (Illumina) instrument. The second batch contained 312 parietal brain samples from 288 unique participants that were ribo-depleted (kit) prior to library preparation using the Tru-Seq Stranded libraries with ERCC ExFold Mix. 70 million 150 bp pair ended reads per sample were targeted in an Illumina NovaSeq 6000 using S4 flowcells. The most recent data generation is composed by 184 brain samples from 89 participants. In here, Ribosomal RNA depletion was performed using FastSelect libraries. As in the prior batch, about 70 million 150 bp paired end reads were targeted in an Illumina NovaSeq 6000.
The blood transcriptomic data corresponding to 2,334 PAXgene preserved blood samples was generated at the same time. Both ribosomal RNA and globin were blocked using FastSelect, followed by library preparation (QIAGEN) and sequencing an average of 60 million 150 bp pair ended reads using an Illumina NovaSeq 6000. Finally, acellular plasma RNA was ribo- and globin-depleted prior to library preparation for Illumina sequencing using 1 ng of RNA as input. Due to the low input, libraries were cleaned from adaptor content prior to sequencing. Sequencing was performed in two batches, on the first one (91 plasma samples), 15 million 100 bp single-end reads were targeted using an Illumina HiSeq 2500, on the second one (245 plasma samples), the number of targeted reads increased to 40 million, and an Illumina NovaSeq 600 was used to generate them30.

Data processing and quality control
To obtain the linear counts, the GHTO teams follows standard pipelines. In summary, the initial quality of the data is firstly evaluated using fastqc; any sample with less than 1,000 reads or only library adaptor reads is removed at this stage. The remaining samples are aligned to the reference genome GRCh38 using STAR31 followed by generation of transcript counts using Salmon32. Alignment quality is evaluated using metrics collected with Picard: CollectRNAseqMetrics, MarkDuplicates, and CollectAlignmentSummaryMetrics (http://broadinstitute.github.io/picard). Additionally, transcript integrity numbers (TIN) are also calculated with the RSeQC33 package. All QC data is aggregated with multiqc34 for visual inspection. Any sample with multiple fastqc category failures, low percentage of mapped reads from STAR or Salmon (less than 50%, more than 20% of ribosomal RNA content, or low median TIN values) will be considered of poor quality and thus, removed from further analyses. Transcriptomic principal component is also calculated using the normalized counts from DESeq 235 (using the vst function for brain and blood, and the rlog in plasma). Samples outside two standard deviations from the first two principal components are considered outliers. The complete pipeline for brain and blood transcriptomics can be found here: https://github.com/NeuroGenomicsAndInformatics/RNAseq_pipeline. The pipeline for plasma RNA-seq can be found here: https://github.com/Ibanez-Lab/PlasmaCellFreeRNA-AlzhiemerDisease.
Bulk transcriptomic data (from brain and blood only) is also processed to obtain circular transcript counts. Unlike traditional linear RNAs (which have 5′ and 3′ ends), circular RNA (circRNA) has a closed-loop structure that is unaffected by RNA exonucleases. Thus, circRNA has sustained expression and is less sensitive to degradation. Previous studies from Knight-ADRC investigators identified more than 100 circRNAs differentially accumulated in AD brains compared to controls, that are independent of changes in the linear RNA forms regardless of estimated brain cell-type proportions12,20. Given the importance of circRNA, the GHTO also provides circRNA quantification for those samples with enough sequencing depth (about 40 million reads). To obtain those counts, the original fastq or bam files are aligned to the human reference genome (GRCh38) using STAR31 in chimeric alignment mode. Other parameters are set according to the instruction manual for Detection of Circular RNA from Chimeric read (DCC)36. The collection of chimeric bam files is used as input for DCC that will quantify and collapse circular transcripts by the host gene.

Data availability
Overall, linear and circular transcriptome is available in brain for 371 AD cases, 7 ADAD cases, 37 controls, 14 DLB cases, 23 FTD cases, and 38 ADRD, and in blood for 312 AD cases, 2 ADAD cases, 801 controls, 2 DLB, 9 FTD, and 359 ADRD (Table 5). Linear acellular transcriptome is available for 333 plasma samples corresponding to 293 individual participants (134 AD Cases, 110 Controls, 16 DLB, 12 FTD, and 21 ADRD cases – Table 5).
Bulk small transcriptomic data
Sample preparation
The amount of purified RNA obtained using the protocols described above is enough to generate transcriptomic and small transcriptomic data. Library preparation is performed by RealSeq Biosciences (CA) using the RealSeq®-AC sRNA kit version 2. This is a new sRNA library preparation technology that reduces sequencing bias compared to previously used methods37,38. Briefly, previous methods use two adaptors, while this new technology leverages a single adaptor and a circularization reaction to reduce the sRNA sequence bias.

Data processing and quality control
Upon sequencing file receipt, libraries undergo adaptor trimming using cutadapt39, followed by alignment using bowtie240, to available sRNA databases. The small transcriptome is populated by several families of sRNA, with private databases that the GHTO team leverages to obtain the counts. MicroRNA (miRNA) counts are obtained by aligning the trimmed sequences to miRbase41. Similarly, PIWI interacting RNAs (piRNAs) counts are obtained by aligning to piRBase42. GtRNAdb43 contains information about transfer RNA (tRNA), snoDB44 about small nucleolar RNAs (snoRNAs), and Rfam45 about vault RNAs (vtRNAs) and Y-RNAs. Finally, to obtain the count of small nuclear RNAs (snRNA), trimmed sequences are aligned to the human genome (GRCh38). Only perfectly aligned reads are quantified and QCed (each class of sRNAs is treated independently) using custom bash and R scripts (GitHub page under development: https://github.com/Ibanez-Lab/).

Data availability
Currently a total of 517 small RNA (sRNA) libraries, of which 330 samples (unique 330 individuals) are from brain (283 AD, 4 ADAD, 34 CO, 3 DLB, 3 FTD, 3 ADRD) and 187 are from plasma (from unique 164 individuals) (94 AD, 29 CO, 13 DLB, 11 FTD, 17 ADRD), have been generated from Knight-ADRC participants (Table 5).
Single nuclei RNA-seq
Sample preparation
Single nuclei isolation was carried out in parietal brain samples collected from 54 Knight-ADRC participants using a previously reported protocol (Neurogenomics And Informatics (protocols.io)). Briefly, a total of 200 mg of brain tissue were manually homogenized using a Dounce homogenizer and nuclei were isolated using a density gradient. Approximately 10,000 nuclei per sample and 50,000 reads per nuclei were target using the 10X Chromium single cell Reagent Kit v346.

Data processing and quality control
Alignment and gene expression quantification was obtained using CellRanger (10X Genomics) following the directions from 10X Genomics pipelines. GRCH38 was used as reference genome. Quality control was performed in each sample individually using the Seurat package. Then, raw gene expression per sample were plotted to establish the inflexion points form the barcode-rank distribution that allowed to set threshold to exclude non-uniform regions of that distribution. Nuclei with high mitochondria were removed47 along with transcripts nor expressed in at least 10 nuclei. Doubles were removed using DoubletFinder46,47. One sample was removed due to low counts.

Data availability
Currently 53 brain samples are available corresponding to 9 healthy control brains, 31 sporadic AD, three presymptomatic AD, and eight corresponding to other dementia forms or mixed pathologies.
Spatial transcriptomics
Sample preparation
Fresh-frozen embedded brain tissue blocks from eight Kinght-ADRC participants were cryo-sectioned into 8–10 µm thick slices at optimal cutting temperature. Tissue slices were fixed following the fixation protocol provided by Vizgen and shipped to Vizgen. A panel of 260 genes were selected focusing on microglia and neurodegenerative disease related genes and pathways. Vizgen performed spatially resolved, single cell transcriptomic profiling measurements for the defined panel using Multiplexed Error-Robust Fluorescence in Situ Hybridization (“MERFISH”). Immunostaining for amyloid β (Purified anti-β-Amyloid, 17–24 Antibody, 4G8 clone, Biolegend), Phospho-Tau (Ser202, Thr205) Monoclonal Antibody (AT8) (ThermoFisher Scientific), and TDP-43 Polyclonal antibody (Proteintech) were performed on the same slides to obtain pathology information. DAPI and polyT staining were performed on the same slides to provide information for better cell segmentation.

Data processing and quality control
Vizgen performed cell segmentation using the CellPose algorithm implemented in the Vizgen Post-processing Tool (VPT) to draw cell boundaries and generates single-cell transcript expression level quantification matrix files. The same company performed initial data quality evaluation. Gene expression measured by MERSCOPE were highly correlated among processed samples (r = 0.90–0.98). In addition, the MERSCOPE measurements were highly correlations with a bulk brain cortex tissue RNA-seq data (r = 0.74–0.80). These results suggest that MERFISH measurement is highly reproducible and therefore is reliable for downstream data analysis. Then, single-cell genomics tool Seurat was used to perform initial data analyze. Cells with less than ten transcripts, or transcripts present in less than five cells were removed from analyses. An average of 105,320 (7,3725–136,388) cells were successfully measured with a median 47–194 (range 11–4344) transcripts, and a median number of 30–90 (range 10–235) genes detected. The immunostaining of protein TDP-43 did not produce expected detectable signal and the three samples were excluded from the analysis.

Data availability
Spatial transcriptomic data has been generated in a small subset of brain samples. Data is available from eight superior temporal gyrus, and one from the parietal region, corresponding to two control participants and seven AD cases (Table 5).
High throughput Proteomics, Metabolomics, and LipidomicsOverview of proteomic, metabolomic, and lipidomic dataHigh throughput proteomic, metabolomic and lipidomic datasets are becoming more common across cohorts. To date, there are several technologies available, at the Knight-ADRC, the technologies of choice have been Somalogic for proteomic data, and Metabolon for metabolomic, and lipidomic data. Proteomic, metabolomic, and lipidomic data is available in 514 parietal brain homogenates (404 unique participants), 1,079 CSF samples (933 unique participants), and 3,574 plasma samples (3,110 unique participants – Table 5, Fig. 2a,c,e).High throughput proteomics
Sample preparation
Samples were collected following the protocols used at the Knight-ADRC14,48. Briefly, approximately 50 mg of fresh frozen brain samples corresponding to the parietal lobe were homogenized using metallic beads as explained above, and protein extract were performed as reported previously14,48,49,50 prior to submission to Somalogic, Inc. CSF samples were collected the morning after an overnight fast and were processed and stored at −80 °C until use. Unfasted plasma samples were collected at the time of clinical visit and immediately centrifuged and stored at −80 °C. No preprocessing was needed for CSF and plasma samples, prior to shipping.

Data processing and quality control
Somalogic, Inc performed the measurements and the initial data normalization to (i) control for inter-plate variances using the hybridization controls for intra-plate differences and median signals, and to (ii) and control for biological variance using an external reference. Then the GHTO team performed in depth QC, deeply described elsewhere49,51,52,53. In short, scale factors are computed for each aptamer in each plate. If the scale factor difference between any pairwise plate comparison is greater than 0.5, the aptamer measurement is considered unreliable and removed from further analysis. Then, a similar process is followed computing the coefficient of variation by aptamer and plate and performing all pairwise plate comparisons. If the difference is greater than 0.15 in any case, the aptamer is considered low quality and removed. With the remaining aptamers, outlier measurements are identified based on inter-quartile range (IQR) thresholds. Finally, two consecutive missing data thresholds (65% and then 85%) for aptamer with missing data and samples with missing aptamer values are used to are removed (Appendix 3 and 4).

Data availability
Proteomic data (Table 5) for brain homogenates (n = 412 | 320 AD cases, 6 ADAD cases, 27 controls, 9 DLB, 18 FTD, and 32 ADRD) was generated using the SomaLogic SOMAscan1.3k that provides 1,305 aptamers, of which 1,300 were retained after QC. Regarding plasma (n = 3,137 | 1266 AD cases, 10 ADAD cases, 1432 controls, 35 DLB, 45 FTD, and 349 ADRD) and CSF (n = 1064 |295 AD cases, 4 ADAD cases, 642 controls, 4 DLB, 10 FTD, and 109 ADRD) samples, proteomic data was generated using SOMAscan7k platform which measures the abundance of 7,584 aptamers54, with a retention of 6,905 aptamers for plasma, and 7,006 in CSF after QC. For some participants, more than one time-point measurement (plasma and CSF) or brain region are also available (not accounted for in Table 5 and Fig. 2). Olink HT1 data for 1,064 CSF samples has been recently generated and will be available soon.
High throughput metabolomic and lipidomic data
Sample preparation
Samples are prepared as described in the proteomic data.

Data processing and quality control
Similar to the proteomic data, Metabolon, Inc performed the preliminary QC on the metabolomic and proteomic data and provided data normalized by processing batch and volume. Additional QC was then conducted by the GHTO team as described elsewhere55. The nature of metabolomic and lipidomic data was considered during QC pipeline design. For example, non-xenobiotics (metabolites and lipids innate to the human body) are expected to be present in many samples, while xenobiotics (those that are foreign to the human system) can be largely missing due to their foreign nature. Thus, a first QC round assess overall missingness, and removes samples with more than 50% of metabolomic and lipidomic measurements missing. Then non-xenobiotics analytes missingness is computed; any analyte with more than 80% missingness across samples is removed. Missing values for the remaining non-xenobiotics are imputed as the minimum value observed for the given analyte55. Xenobiotics missingness is not assesses in this step, nor any imputation performed. Then, all values are normalized using log10 transformation and those with IQR of zero or low variance (less than 0.001) are considered non-informative and removed. Finally, IQR is used to remove outlier measurements, standard deviation of the two first metabolomic and lipidomic principal components is used to remove outlier samples (Appendix 5 and 6).

Data availability
Metabolomic and lipidomic data from brain, CSF, and plasma was generated via HD4 Metabolon’s untargeted Precision Metabolomics™ LC-MS (liquid chromatography–mass spectrometry) platform. High quality data is available in 441 brain homogenates (342 AD cases, 7 ADAD cases, 29 controls, 11 DLB, 19 FTD and 33 ADRD), with 797 analytes passing QC, 3,169 plasma samples (1285 AD cases, 9 ADAD cases, 1431 controls, 37 DLB, 48 FTD and 359 ADRD) with 1,508 metabolites and lipids passing QC, and 948 CSF samples (286 AD cases, 4 ADAD cases, 575 controls, 4 DLB, 10 FTD and 69 ADRD) with 456 analytes remaining after QC (Table 5). For some participants, more than one time-point measurement (plasma and CSF) or brain region are also available (not accounted for in Table 5 and Fig. 2).
Epigenomic dataOverviewAD etiology is complex and not specific to a single genetic factor, or the dysregulation of one protein or transcript. Epigenetic changes could help explain the missing heritability not captured by genomic data and help determine functional variants in genome-wide significant loci that might led to transcriptomic, proteomic, or metabolomic changes. We have generated DNA methylation data from 444 parietal brain samples and 20 samples from whole blood.Sample preparationDNA is obtained from brain samples (50 mg) or whole blood (8 mL) using standard methods as described in the genomics section.Data processing and quality controlThe raw Illumina EPIC data is processed using the ENmix package56 followed by stringent quality control. In brief, the GHTO pipeline contains the following steps: Our stringent QC pipeline included the following steps: (i) calculation of the bisulphite conversion statistic and exclusion of samples that fail three or more control metrics; (ii) multidimensional scaling of probes on the X and Y chromosomes to confirm concordance between reported sex and genetic sex, with exclusion of mismatches; (iii) exclusion of poorly performing samples; (iv) removal of samples with more than 1% of probes with significant detection p-value (p > 0.05); (v) exclusion of samples with methylome principal component outside 2 standard deviation from the first three principal components; (vi) dye bias correction using the RELIC function in ENmix57; (vii) quantile normalization; (viii) exclusion of cross-hybridizing and SNP-related probes. Finally, methylation levels (beta values, β) at a given CpG site is calculated from the ratio of the methylated probe intensity to sum of methylated and unmethylated probe intensities. Finally, the ComBat function from the sva package is used to evaluate and remove the potential effects of technical variables (sample plate, array, and slide)58.Data availabilityMethylomic data was generated using the Illumina Infinium MethylationEPIC array that interrogates over 850,000 CpG and non-CpG sites, open chromatin, enhancers, DNase hypersensitive sites, and promoters. Parietal cortex brain methylation is available for 444 unique individuals (344 AD cases, 7 ADAD cases, 29 controls, 11 DLB, 19 FTD, and 34 ADRD) and 20 samples from whole blood (15 AD cases, and 5 controls – Table 5).

Hot Topics

Related Articles