Single-cell somatic copy number variants in brain using different amplification methods and reference genomes

To optimize the workflow for detection of large CNVs by scWGS, we compared three different WGA methods: (1) PicoPLEX, (2) PTA and (3), dMDA. We used single-nuclei isolated from the cortex of up to three different brains, two with MSA and one control (Fig. 1a). We had previously reported scWGS from other regions of the same MSA brains16. We also used non-brain control nuclei in selected experiments: fibroblasts carrying a germline SNCA triplication, and lymphocyte nuclei from NA12878. In total, we amplified and performed Illumina sequencing (paired-end) across 93 brain cells, as well as 3 fibroblasts and 5 NA12878 Β-lymphocytes (mean coverage ~0.64x, 0.17x and 0.71x respectively) (Fig. 1b, Supplementary Fig. 1a).Fig. 1: Experimental overview and scWGS preliminary analysis from human post-mortem brain samples.a Methodology overview created with BioRender (RRID:SCR_018361; BioRender.com; agreement KB25LPBTEK). b Mean depth of coverage for each WGA method. PicoPLEX vs PTA ns (adj. p > 0.99), PicoPLEX vs dMDA ** (adj. p = 0.008) and PTA vs dMDA * (adj. p = 0.04). Kruskal–Wallis test with Dunn’s multiple comparisons correction. c Bases potentially covered if sequenced deeper. P value for all pairwise comparisons <0.001. Brown-Forsythe and Welch ANOVA tests with Dunnett’s T3 multiple comparisons correction. d Median Absolute Pairwise Deviation (MAPD) scores at 500 kb bin size. P value for all pairwise comparisons <0.001 (adj. p < 0.001). Kruskal–Wallis test with Dunn’s multiple comparisons correction. For (b–d) PicoPLEX (n = 33), PTA (n = 21), dMDA (n = 39); Mean ± SD shown.PTA provides the broadest amplification, but PicoPLEX provides the most evenAlthough these are “whole” genome amplification methods, some regions may not be amplified in a given cell (locus dropouts)5. In the absence of high coverage WGS of each cell, the maximum number of bases which could be retrieved by deeper sequencing can be deduced genome-wide using Preseq30. We found that PTA provides efficient capture of the genome of brain nuclei, consistent with other data27,28 (Fig. 1c; mean ± SD: 2.84 ± 0.56 Gb), with PicoPLEX the next best performer (1.71 ± 0.48 Gb). dMDA provided the lowest breadth of coverage (0.75 ± 0.29 Gb), even lower than the report of only one-third of the genome covered with 100 million reads after dMDA26.CNV calling by read depth is hampered by uneven amplification, as regions which are over- or under-amplified could appear as CNVs. We compared a key metric, the median absolute pairwise deviation (MAPD) between adjacent bins, between the three technologies at 500 kb bin size using Ginkgo6, after adapting this widely used validated tool to hg38 (see methods). We noted clear differences across methods, with PicoPLEX performing best and dMDA worst (Fig. 1d, mean ± SD: PicoPLEX 0.15 ± 0.03, PTA 0.24 ± 0.06, dMDA 0.57 ± 0.07), which persisted after downsampling to the lowest coverage sample (Supplementary Fig. 1b), although the MAPD absolute values were slightly higher after downsampling (Supplementary Fig. 1c–e). The difference between methods is reflected by visual review of sequencing traces (Fig. 2a), and Lorenz curves, which indicate the amplification variation by plotting the cumulative fraction of reads as a function of the cumulative fraction of the genome (Fig. 2b). In MSA1, where we had used different PicoPLEX versions, and different library preparation for PTA, we verified that the MAPD was not affected by this (Supplementary Fig. 1f, g). Furthermore, the MAPD for each WGA method was similar between different brains, and non-brain samples (Supplementary Fig. 1h–j). The MAPD values we obtained for PTA are similar to the ones reported (~0.25 for bin size range 100–1000 kb)27. The uneven coverage of dMDA in particular cannot be explained by GC content variation, although there was a slight drop in coverage of high GC regions for both dMDA and PTA (Fig. 2c). Indeed, others have reported no major GC effect on MDA bias23. Although previous reports had shown superiority of PicoPLEX or MALBAC over MDA for CNV calling18,19,20,21,22,23,31,32, there had been no prior comparison to dMDA. The previous study of dMDA, conducted in lymphoblasts, reported more even genome coverage compared to MDA performed in parallel but not in droplets, although MAPD values were not reported26. As the MAPD values we obtained using dMDA were poor, we recalculated MAPD after gradually increasing the bin size up to 5 Mb, which improved the values as expected9. We also found that MAPD can be further improved by removing noise using principal component analysis (see methods)33 (Supplementary Fig. 2). CNV calling in dMDA can be performed for very large aberrations after denoising.Fig. 2: scWGA method comparison from brain analyzed by Ginkgo at 500 kb bin size.a Visual view of copy number profiles of single-nuclei amplified by PicoPLEX, PTA, and dMDA from the control (top) and MSA1 (bottom) brains. b Lorenz curves. The black lines with slope 1 represent perfect coverage uniformity. Increasing divergence of the curve of each cell from this indicates lower overage uniformity. c Effect of GC content in scWGA. For b-c PicoPLEX n = 33, PTA n = 21, dMDA n = 40.Each WGA method has propensity for different types of chimerasAll WGA methods can lead to chimeras. These could be misinterpreted as SVs, but also, if they cluster in certain parts of the genome, could impact CNV calls. We therefore aimed to compare the frequencies of key types between the three WGA methods (Fig. 3). For this comparison, PTA cells which had a different library preparation method were analyzed separately, as they had a divergent profile presumably related to this (Supplementary Fig. 3). In paired-end sequencing, the read pairs should both be pointing inwards, towards each other. Outward read pairs, indicative of tandem duplications, were most frequent in PicoPLEX, and least frequent in PTA. This observation is consistent with a previous report that over half of PicoPLEX artifacts appear as duplications17. Apparent translocations were most frequent in PTA and least frequent in dMDA. Other orientations, which include inversions, were most frequent in PTA, and hardly ever seen in PicoPLEX, but relatively common in dMDA, as previously reported in MDA, where they had comprised almost all the chimera signatures17.Fig. 3: Comparison of discordant read pairs of brain nuclei amplified with each method.a Outward pairs. Differences analyzed using Brown-Forsythe and Welch ANOVA test with Dunnett’s T2 multiple comparisons test. PicoPLEX vs PTA *** (adj. p < 0.001), PicoPLEX vs dMDA *** (adj. p < 0.001), PTA vs dMDA * (adj. p = 0.02). b Pairs on different chromosomes indicating translocations. Differences analyzed by Kruskal-Wallis test with Dunn’s multiple comparisons test. PicoPLEX vs PTA ns (adj. p > 0.99), PicoPLEX vs dMDA ** (adj. p = 0.002), PTA vs dMDA *** (adj. p = 0.02). c Pairs in other orientations. Differences analyzed by Kruskal-Wallis test with Dunn’s multiple comparisons test. PicoPLEX vs PTA *** (adj. p < 0.001), PicoPLEX vs dMDA *** (adj. p < 0.001), PTA vs dMDA ns (adj. p = 0.18. a-c PicoPLEX (n = 33), PTA (n = 10). dMDA (n = 38).Realignment to T2T-CHM13 and liftover to hg38 affects CNV callingNext, we investigated the impact of the reference genome for the analysis of single-cell CNV. We realigned all data to the T2T-CHM13 genome34 and performed liftover of the read alignments back to hg38 using levioSAM2, which improves calling on the original reference, both for small variants, and for structural variants using long reads35. We compared relevant metrics between the different alignments. We found a marginal improvement in the percentage of reads aligned, which was highest in T2T-CHM13 and partly maintained in hg38 after liftover (Supplementary Fig. 4a). For both amplification methods, the MAPD slightly improved in the T2T-CHM13 and liftover alignments compared to the original hg38 (Supplementary Fig. 4b; see also methods). The confidence score was not affected by genome changes (Supplementary Fig. 4c; see also methods). The breadth of sequencing (bases potentially covered using Preseq) was improved using T2T but was lower in the liftover genome (Supplementary Fig. 4d).To determine which brain cells were suitable for CNV calling by Ginkgo, we set strict thresholds. We discarded cells with MAPD score >0.3 as before16, although even higher values have been considered acceptable8,9. We also calculated the confidence score, which indicates the extent to which genomic segments have integer copy numbers, rather than intermediate copy number states which may indicate uneven amplification11, and retained only cells with confidence score ≥0.7. To determine the appropriate bin size, we analysed the effect of bin sizes from 10 kb to 500 kb in PicoPLEX and PTA data (Supplementary Fig 5). We used 500 kb bins for both to enable direct comparisons, which left 23 of 34 and 11 out of 25 PicoPLEX and PTA brain cells respectively (67.611% vs 44.0%; p = 0.11; Table 1). We initially compared unfiltered PicoPLEX and PTA CNV calls between the different genome alignments (Supplementary Table 1). T2T-CHM13 alignment followed by liftover to hg38 led to 59.3% fewer losses called than hg38. Liftover did not alter gains in a material way (only 2.0% less), but T2T-CHM13 gains were 45.4% higher than hg38. We reviewed hg38-specific losses and noted that they were mostly shared between cells and individuals, did not appear robust (as they appeared to be non-integer, intermediate between copy numbers 1 and 2), and were often around centromeres. We thus decided to focus on CNV-calling with hg38-liftover genome, limited to autosomes, as this is better annotated, and the results are easier to compare than T2T-CHM13.Table 1 Success rates and overview of filtered CNV calls across all PicoPLEX and PTA brain samples (hg38-liftover, 500 kb bins)Megabase-scale CNVs are detected in a fifth of brain cells using 500 kb bins in PicoPLEX and PTA dataWe proceeded to somatic CNV calling in PicoPLEX and PTA-amplified cells using 500 kb bins in both for consistency; the profiles of all cells and CNVs called are given in Supplementary Data 2. To perform stringent filtering, we first removed all calls that were shared between at least half of the cells, or any cells between different individuals. To assign a numerical threshold, we calculated the median copy number of each segment assigned by Ginkgo and plotted the distribution of segments with each copy number, from which we set copy number thresholds of 1.29 for a loss and 2.80 for a gain (methods and Supplementary Fig. 6). We confirmed that in males, excluding one cell with possible chromosome X aneusomy, CN was below that number in both PicoPLEX and PTA.This led to a total of 17 CNVs across 7 cells from the three brains (9 gains, 8 losses; Table 1; all CNVs listed in Supplementary Data 1, profiles shown in Supplementary Data 2). CNVs were found in 20.6% of cells (7/34 overall; 3/23 PicoPLEX, 4/11 PTA; p = 0.18). The proportion of cells with CNVs was around 20% in both MSA (5/24) and controls (2/10); The apparent excess of CNVs in MSA1 (5/16) compared to MSA2 (0/8) is not significant p = 0.13), and the difference could be partly due to the higher call rate in PTA, which was not used in MSA2. Our previous study showed similar CNV prevalence in the substantia nigra of these two brains16. For controls it is similar to previous estimates in cortical neurons, with 10–25% reported to carry Mb-scale CNVs7, and more recently only 2 of 52 PTA-amplified cortical neurons carrying CNVs >~5 Mb28 Among the brain cells with CNVs, four had one each, one had two, one had three, but notably one MSA1 cell had 8 (6 gains and 2 losses; cell A24; Supplementary Data 2). To ensure that CNVs are not associated with low mapping quality, we conducted the same analyses after filtering for reads with mapping quality below 10. All CNVs were detected, except a 4.8 Mb gain in A26.Examining CNVs further, one control brain cell (L11) had two large losses which essentially added up to a loss of the entire chromosome 13, with only 2.3 Mb spared, presumably an error, and this chromosome is most likely lost in its entirety. This was the only aneusomy seen in a brain cell, consistent with estimates of brain aneuploidy of 0.7–5% derived by scWGS1. This cell was excluded from further CNV analysis. CNV sizes were similar in PTA and PicoPLEX brain cells (median 5.7 Mb v 8.5 Mb, Mann–Whitney p = 0.33; Supplementary Data 1). Gains were generally larger than losses (medians 13.4 v 4.2 Mb in PicoPLEX data and 9.5 v 3.7 Mb in PTA). We noted that 6 brain CNVs (35.3%; 4 PicoPLEX, 2 PTA) were sub-telomeric. This observation is in agreement with previous work suggesting enrichment in these regions which are rich in segmental duplications, but considerably higher than the 9.1% in MSA neurons from other brain regions which we previously reported. To understand the nature of genes affected by CNVs in MSA1, we performed gene ontology analysis using PANTHER36. The most clear enrichment was for DNA demethylation but due to the sample size we cannot draw any conclusions (Supplementary Table 2, Supplementary Note 1).We also investigated whether CNV calls are supported by another algorithm for scWGS CNV calling, Copykit37, also based on circular binary segmentation, which uses hg38 as default (Supplementary Data 1). We allowed for a smaller CNV called by Copykit to be classed as supportive of a CNV called by Ginkgo if it was encapsulated by the Ginkgo CNV region. The chromosome13 loss was also called by Copykit, and most sub-chromosomal Ginkgo CNV calls in brain were supported (5/9 for PicoPLEX and 4/6 for PTA). There was no significant difference between CNV sizes depending on whether they were also reported or not by Copykit, although those detected tended to be larger (median 9.50 v 4.9 Mb; Mann-Whitney p = 0.39) For CNVs called by both, Copykit sizes were smaller (mean size 16.1 Mb for Ginkgo, 11.3 Mb Copykit, paired t-test p = 0.04). As Copykit has not been formally benchmarked to our knowledge, while Ginkgo has been found to be accurate in calling breakpoints38, we gave it the benefit of the doubt for estimating CNV sizes.We next addressed the question of whether the alignment to T2T-CHM13 would allow detection of specific gains which could not be detected in less complete assemblies, defined as gains called only in the T2T-CHM13 alignments, with at least 50% of their span being novel T2T-CHM13 sequence). There were two CNVs which passed filtering, both centromeric gains in MSA1, one from a PicoPLEX and one from a PTA cell (chromosome 3: 90.2–98.2 Mb and chromosome 6: 56.6–62.9 Mb). When repeating the analysis after filtering for mapping quality, both CNVs disappeared at MAPQ1, but the chromosome 6 centromeric gain reappeared at MAPQ10 (Supplementary Fig. 7). This call could therefore be correct, but overall we cannot confidently report additional CNVs in these new genomic regions.One single cell CNV has tentative support in bulk WGSIf a somatic CNV is limited to a single-cell, orthogonal validation is by definition impossible. A clonal CNV, arising during cell division and shared amongst cells of the same lineage, however, could be detectable in other cells, from the same or other brain regions. It may also be detectable by bulk sequencing of adequate depth. This could, however, be compromised by a number of issues, including imprecise boundaries in single-cell CNV calling, possible non-amplification of the breakpoints due to allelic drop-out, and the intrinsic limitations in SV calling if short-read sequencing is used. To investigate this in the two MSA brains, we first reviewed the presence of CNVs with similar boundaries in previously reported different regions from the same brains: substantia nigra in both (n = 99), and the pons and putamen in MSA1 (n = 70)16, but did not find any.To seek support for somatic CNV calls in the MSA1 brain, we also analyzed deep bulk short-read WGS from the cingulate cortex (where the single-cells were obtained from)39. Events which are truly single cell would be impossible to validate in bulk, but if cells share a CNV due to a common developmental origin from the same progenitor clone, their breakpoint might be visible in bulk WGS of adequate depth. As the single cell CNV boundaries are very imprecise, and any support in bulk for somatic events would likely be limited to one or very few read pairs, we visually reviewed discordant read pairs within ~1 Mb of each breakpoint for possible support. We found tentative support (one read pair) for one loss in MSA1 (Fig. 4). We also reviewed these genomic regions in bulk WGS from the adjacent cingulate white matter, as well as the cingulate cortex and white matter, and cerebellar cortex and white matter, from MSA2 (mean coverage across all 83.6x; Supplementary Table 3). We found no support in bulk WGS from the adjacent white matter of the same brain or the MSA2 regions tested (Supplementary Fig. 8).Fig. 4: Integrative Genomics Viewer (IGV) views demonstrating tentative support for a single cell deletion in MSA1 cingulate cortex bulk short read WGS.The 9.5 Mb single cell deletion (chromosome 6: 161,264,702–170,805,979; cell L21, PTA) is supported by one read pair (in red) in correct orientation, but spaced 9.8 Mb apart. Read mapping quality 60 and 6 respectively. Both reads are within LINE1 elements, as shown in blue in the bottom track.Exploration of different bin sizes allows smaller CNVs to be called with PicoPLEX and very large ones with dMDAWe also investigated whether smaller bin sizes could be used in PicoPLEX. The SNCA germline triplication (1.85 Mb), which we had also used as a positive control before, requires the use of smaller bin sizes, and was detected in all three PicoPLEX-amplified fibroblasts at 250 kb bin size, including one which failed confidence score filter, regardless of reference used (Supplementary Fig. 9). The SNCA triplication was also detected in all three fibroblasts using Copykit with 220 kb bins, although the copy number was given as 3 (rather than 4) in the one that failed confidence score. Using Ginkgo at 250 kb bins for brain cell analysis, all cells used for 500 kb CNV calling passed threshold, and we therefore performed CNV calling and similar filtering in these. We now detected CNVs in 5 more cells (two MSA1, three MSA2), although a gain in cell A26 (MSA1) at the 500 kb bin analysis, which was did not pass mapping quality 10 filtering. was now not reported (all PicoPLEX at 250 kb bin CNVs listed in Supplementary Data 3). We thus detected CNVs in 6/19 in MSA PicoPLEX cells at 250 kb bins. This is very similar to the ~30% of cells we estimated to carry somatic CNVs in other regions of the same MSA brains at the same bin size. CNV calling at 100 kb bin size would also be possible, but as only 27 of 34 cells which originally passed QC would still be suitable, this would reduce sample size further.For the 39 dMDA cells, where only very large CNVs are potentially detectable, we determined the number of cells that pass QC after denoising by the percentage of the variance explained by the first principal component at a given bin size. The number of cells passing QC was zero at 1 Mb, four at 2.5 Mb and eight at 5 Mb, although three of these were borderline and had aberrant ploidy calls (Supplementary Data 2). We therefore also excluded these and performed CNV calling and filtering at 5 Mb for the five remaining cells, and found a whole chromosome gain in one MSA2 cell (A58).

Hot Topics

Related Articles