Characterizing the allele-specific gene expression landscape in high hyperdiploid acute lymphoblastic leukemia with BASE

asCN calling evaluationBy using a built-in copy number analysis reference panel, BASE provides a solution to estimate asCN changes using WGS data from tumor samples. To evaluate the precision of BASE in asCN calling, we first collected WGS data from fifteen extensively studied cancer cell lines (Supplementary Table 1) and compared the asCN calls made by BASE with those derived from matched SNP arrays, performed by the Cancer Cell Line Encyclopedia (CCLE)19. On average, BASE detected asCN changes with a recall rate of 98–100% in the eleven cell lines with ploidy levels ranging from 1.6 to 2.7 (Table 1). For the remaining four cell lines, exhibiting ploidy levels from 2.8 to 3.3, the BASE asCN calling result initially exhibited a low overall agreement (< 5%) with CCLE calls using default settings. However, after adjusting the tumor ploidy parameter in BASE, the recall rate was restored to > 98% (Table 1).Table 1 Copy number calling results of 15 cancer cell lines and three simulated COLO 829 cancer cell lines with different tumor cell contents.Tumor samples never contain 100% tumor cells but also normal cells, e.g., infiltrating immune cells. This poses a problem for the accurate detection of asCN, as the reads from the tumor containing copy number aberrations are diluted with the reads from the normal diploid cells. To evaluate BASE’s asCN calling performance in samples with normal cell contamination, we obtained WGS data from the COLO 829 cell line and modelled normal cell contamination by mixing WGS reads from COLO 829R, a B lymphoblast cell line derived from the same patient. In this way, samples containing 75%, 50%, and 25% tumor cell content were modeled. For samples with a 75% tumor purity, BASE detected most asCN events with a calling accuracy rate of 96%. However, it is worth noting that chromosomal regions with loss of heterozygosity (LOH) in COLO 829 were not called correctly in the simulated samples with less than 50% tumor cell fraction due to the increase of heterozygosity within those segments (Table 1; Supplementary Fig. 1).Evaluation of ASE calling based on phased and unphased SNV InformationIn the evaluation of BASE’s precision in identifying allelic expression events, we analyzed three replicates of HepG2 cell line RNA-seq data, where heterozygous SNPs data had been obtained from using either a phasing or non-phasing WGS analysis pipeline20. By using the BASE, we identified 4,933 informative expressed genes shared by the ASE calling results using either phased or un-phased genomic data as input. Of these, 639 and 600 ASE genes, respectively (Table 2), were identified, with 599 ASE genes being consistently detected in both calling results (Supplementary Fig. 2); this suggests substantial equivalence between the ASE calling results based on unphased and phased data. The relationship between the ASE state and gene expression level was investigated, revealing no gene expression profile difference between ASE and non-ASE genes, indicating that our pipeline can detect ASE regardless of the gene expression level (Fig. 2a). To further BASE’s performance in ASE calling, we compared its results with those from MBASED and aSCAN, revealing a strong correlation between the methods (Fig. 2b,c). MBASED identified 5,497 informative genes, with 364 and 336 exhibiting ASE under phased and unphased conditions, respectively. aSCAN detected 5,546 informative genes, with ASE identified in 721 genes using phased data and 495 with unphased data (Table 2). Our analysis showed that over 93% of ASE genes identified by BASE were confirmed by MBASED or aSCAN with phased data, and over 75% with unphased data (Fig. 2d,e). These findings demonstrate that BASE’s ASE calling is highly consistent with the published methods. Notably, ASE detection was slightly more accurate when based on phased SNV data compared to unphased data across all three programs, underscoring the potential benefits of using phased genetic information to enhance the precision and reliability of ASE gene detection.Table 2 Overview of ASE Gene calling results across three programs in cell lines.Fig. 2a. Violin plots displaying the expression of ASE genes (top) and non-ASE genes (bottom). No significant differences in gene expression levels between ASE genes and non-ASE genes were observed (P = 0.79, Mann–Whitney two-sided test). Gene expression levels were normalized to CPM values and then log2-transformed as shown on the x-axis. The median is denoted by the center of the boxplot, while the lower and upper hinges correspond to the first and third quartiles, respectively. Whiskers represent 1.5 times the interquartile range, with data beyond this range plotted as individual points. Spearman’s rank correlation between the allelic expression ratio derived from aSCAN and BASE (b), and from MBASED and BASE (c), respectively. The Venn diagram displays the overlap of ASE genes identified using phased (d) and unphased (e) genotype data by three tools: MBASED, aScan and BASE, respectively.A similar approach was applied to three lung cancer cell lines, KNS62, NCIH2122, and SW900, with matched WGS and RNA-seq data derived from the CCLE (Supplementary Table 1). Our results show that the BASE program identifies over 59% of ASE genes detected across all programs, with more than 95% of the ASE genes called by BASE also recognized by at least one of the other published methods (Table 2). This suggests that BASE is robust in ASE gene calling across different tumor types.Application: ASE in HeH ALLTo explore the impact of CNVs on allelic expression imbalance in aneuploid tumors, we applied the BASE framework to a previously generated HeH ALL dataset (Supplementary Table 2)5,8,21. This dataset comprised 12 samples with matched diagnosis/remission WGS data, along with RNA-seq data from the diagnostic samples. The selection of these 12 samples was based on the criteria that all samples were sequenced using the Illumina system, ensuring consistency in sequencing technology. Additionally, to minimize batch effects and enhance comparability, all selected samples had RNA-seq data generated from sequencing libraries constructed using the same kit. Details of sample characteristics and NGS library preparations are described in the original articles. To examine ASE genes in HeH ALL, we considered only expressed genes (CPM ≥ 1) that had at least one heterozygous site covered by 10 or more RNA-seq reads. In total, we identified 8,210 expressed genes in the twelve HeH ALL samples, with 3,653 (44.5%) genes showing ASE. The majority of genes exhibited ASE in only one sample (2,372/3,653; 64.9%; Fig. 3a; Supplementary Table 3). No significant difference in the percentage of ASE genes was found between informative samples with < 80% and ≥ 80% blast cells (P = 0.143, two-tailed unpaired Mann–Whitney U-test; Supplementary Table 2). Additionally, we found 1,281 genes that displayed ASE in at least two HeH ALL cases, i.e., were recurrent (Fig. 3b; Supplementary Table 3). Gene ontology analysis of these recurrent ASE genes revealed enrichment of the biological processes cellular response to DNA damage stimulus (GO: 0006974; P = 3.5 × 10−7), chromatin organization (GO: 0006325, P = 6.3 × 10−7), and RNA splicing (GO: 0008380, P = 1.7 × 10−5) (Fig. 3c).Fig. 3a. Recurrence of ASE genes across 12 HeH ALL samples. b. The Circos plot shows ASE genes and copy number alterations of HeH ALL samples. From outer to inner circle: chromosome numbers, chromosome ideograms, scatter plot of recurrent ASEs genes across 12 HeH ALL samples (grey, recurrence = 2; purple, recurrence = 3; blue, recurrence = 4; orange, recurrence = 5; red, recurrence > 5), and average copy number of a particular segment of 12 HeH ALL samples. c. Gene Ontology analysis of 1,549 recurrent ASE genes in 12 HeH ALL samples. d. ASE states of mutated genes in HeH ALL. ASE genes are marked in purple. The circle, square and triangle indicate the missense, nonsense and silent mutations, respectively. The x-axis indicates the odds ratio of aggregated allelic ratios from RNA-seq compared to those from WGS. The y-axis indicates the log10 transformed binomial test P value.Proportion of ASE genes on trisomic versus disomic chromosomesTo investigate the relationship between non-random chromosomal gains and ASE in HeH ALL, we restricted our analysis to the 4,089 informative genes located within gained chromosomes (trisomy, tetrasomy with a 3:1 allele ratio, and pentasomy). We adapted the BASE framework to identify ASE genes that were associated with the chromosomal gains using binomial exact test under the null hypothesis model 1) each allele is equally expressed; or the alternative hypothesis model 2) two alleles are unequally expressed with the odds ratio based on asCN analysis result (Supplementary Fig. 3). Approximately 1,574 ASE genes were significantly correlated with gained chromosomes, with 22.8% (932/4,089) exhibiting recurrent ASE in at least two samples. Chromosomes 4, 6, 10, 14, 17, 18, and 21 were gained in more than 70% of HeH ALL samples and over 30% of the informative genes on these chromosomes showed allelic expression imbalance (Fig. 3b, Supplementary Table 4 and 5), indicating that the genomic imbalance caused by the copy number gains lead to widely unequal transcription of the two gene alleles in HeH ALL.We have previously shown that genes in the gained chromosomes X, 14, and 21 are associated with the strongest dosage effects and that gains of chromosome 21 had the highest selective advantage in HeH ALL5,8. In line with this, we observed a higher frequency of ASE in chromosomes 14 and 21 (over 48%) in contrast to the other commonly gained chromosomes (range 38.4% to 45.9%), indicating that allelic imbalance in the expression of genes on these two chromosomes may be a key hallmark in HeH ALL (Supplementary Table 5). In contrast, when examining informative genes outside of the gained chromosomes, only 4.6% (294/6,441) showed recurrent ASE, highlighting the predominant influence of copy number gain events on ASE in HeH ALL (Fig. 3b, Supplementary Table 3).To explore further whether the leukemia-related ASE events could be driven by chromosomal gains, we investigated ASE events involving known cancer driver genes reported in OncoKB (https://www.oncokb.org/cancer-genes). A total of 54 oncogenes, including four that are known to be mutated in leukemia (NSD2, CCND3, U2AF1, and JAK2)22, showed ASE in 7/12 HeH ALL samples in trisomic chromosomes, suggesting that the expression of a set of well-known leukemia-related genes could also be altered by gene dosage in HeH ALL, although other causes of ASE cannot be excluded. However, only a few leukemia-related ASE events were observed in disomic chromosomes. For example, KMT2C is frequently mutated in multiple human cancers and heterozygous germline mutations in KMT2C have been reported to be enriched in infants with leukemia23. Here we observed ASE of KMT2C in three samples with disomy 7, indicating that the selective expression of specific alleles of KMT2C may play a role in the development and progression of HeH ALL.ASE associated with somatic mutationsWe next examined ASE involving mutated genes in HeH ALL by analyzing somatic SNVs from paired tumor/normal WGS data. Only coding region mutations (annotated as “silent” “missense” or “nonsense”) were included. We observed 14 somatic mutation sites that were expressed across 12 genes. Among these, the genes FLT3 and WEE1 showed ASE in two HeH cases (Fig. 3d). FLT3 overexpression is a known phenomenon in several leukemias, such as HeH ALL, irrespective of mutational status24. In this study, we found the HeH sample L5 that had a hemizygous deletion near the FLT3 promoter region, displayed cis-dysregulation of FLT3 expression and increased expression of the mutated FLT3 allele as previously reported25. Overexpression of WEE1, a nuclear kinase that is involved in cell cycle G2-M checkpoint signaling, is associated with a poor outcome in various hematological malignancies and inhibition of WEE1 is considered as a potential therapeutic target in AML26. Here we identified a heterozygous G > T mutation in HeH sample L11, resulting in a premature stop codon and triggering nonsense-mediated decay in WEE1. In line with this, a significant allelic expression imbalance favoring the overexpression of the wild-type allele of WEE1 was observed. However, the functional consequence of this nonsense mutation in HeH is unknown and further investigations of larger cohorts are needed to elucidate the function of WEE1 mutations in HeH ALL. Collectively, these results suggest that ASE can also be associated with somatic mutations in HeH, in line with previous observations in other types of tumors2.

Hot Topics

Related Articles