Transcriptome profiling of pediatric extracranial solid tumors and lymphomas enables rapid low-cost diagnostic classification

Continued advancement in sequencing technologies have allowed for genomic and transcriptomic characterization of diverse tumor types. These tools have further advanced diagnostic capability, prognostication, and specificity of core genomic types and subtypes of tumors, making it the gold standard for tumor type confirmation and the final step in the pediatric cancer diagnostic testing cascade. Capital and operational costs for the full array of diagnostic tools requisite for accurate diagnosis of pediatric solid tumors including FISH, karyotyping, immunohistochemistry and more established short read sequencing platforms (e.g., Illumina) prohibit their use in resource limited settings, as they are either unavailable, incomplete or just not affordable by patients. The ability to use a low-cost sequencing platform such as ONT’s MinION to sequence FFPE-derived cDNA to accurately classify solid tumors is worth further development due to its potential to eliminate the need for stepwise testing and increase access to diagnostic tools in resource constrained settings, helping to bridge the existing cancer diagnostic gap.Classification accuracy and size of training dataDespite the fragmented nature of FFPE derived transcriptomes and the higher depth of sequencing required to improve certainty of mapping and therefore accuracy, we observed an overall accuracy of 95.6%, 89.7% and 97.4% for solid tumors, lymphomas and rhabdomyosarcoma subtype classification respectively, while multiplexing 12 specimens on a single MinION flow cell. Tumor types with a greater number of specimens available for training our model tended to have higher accuracies and prediction probabilities, while those with lower numbers had lower accuracies. The effect is entirely expected and clearly illustrated in Figs. 1 and 2. As an example, T-LBL, for which we had only 3 specimens for testing, showed 33% accuracy with all prediction probabilities being < 0.4. In contrast, 21 out of the 23 Burkitt lymphoma specimens tested had prediction probability > 0.5, with all of these specimens correctly classified.Biological and technical replicatesWe included in this study several replicates representing different sampling and preservation methods for pediatric solid tumors as well as technical replicates for RNA extraction, library preparation, and sequencing. We processed matched fresh frozen and FFPE samples from xenografts for rhabdomyosarcoma, Ewing sarcoma, and neuroblastoma. Among this subset, only one was incorrectly classified—a fresh frozen neuroblastoma specimen. Additionally, we observed no significant difference between prediction probabilities for fresh frozen and FFPE samples. While we would normally expect higher-quality results from fresh frozen samples than FFPE, there is a compensatory effect since our prediction model is trained on predominantly FFPE samples (87%). These results suggest that we are able to model tumor-specific features contributing to accurate diagnosis that are preserved across both fresh-frozen and FFPE samples.FOXO1 fusion and MYCN amplificationThe ability to classify tumor genomic subtypes simultaneously at the time of primary diagnosis has the potential to lead to avoidance of stepwise molecular testing, where it is available. Determining the FOXO1 fusion status of rhabdomyosarcoma is an essential distinction to make given differences in disease prognosis and treatment regimens for fusion positive and fusion negative subtypes23. Fusion positive samples reflect chromosomal translocations t (1;13) or t (2;13), which correlate with PAX7::FOXO1 and PAX3::FOXO1 fusions respectively23. FOXO1 fusions are correlated with more aggressive disease and poorer outcomes. Conversely, fusion negative specimens lack these fusions and are associated with more favorable clinical outcomes. Chemotherapeutic agent combination choices, prognostication, and treatment approaches differ by fusion status. MYCN oncogene amplification is the most important gene marker of neuroblastoma severity as it leads to unrestricted tumor growth and proliferation, indicating a poorer prognosis that requires a different treatment regimen compared to neuroblastoma without MYCN amplification24.Previously reported differentially expressed genes are not robustly recapitulated in our model in part due to our filtering of genes that are not broadly expressed across our dataset. Low-coverage transcriptome sequencing results in a relatively sparse sampling of the transcriptome and, together with our previous work17, we show that our prediction models perform better and avoid overfitting when the majority of sparsely sequenced genes are excluded. We considered the expression of MYCN itself, which is expected to correlate with genomic MYCN amplification. MYCN expression is strongly correlated with FISH-based MYCN amplification status (Supplemental Fig. S8, Supplemental Table S5), but is excluded from our model because its observed expression is zero in 14 of 31 neuroblastoma samples with known MYCN amplification status (two neuroblastoma samples are not characterized). These 14 are all negative for MYCN amplification. This clear example of exclusion of a very strong marker gene based on the architecture of our model leaves the possibility of improving the model in the future if features like these can be included without contributing to over-fitting. In fact, a trivial heuristic model that stratifies our neuroblastoma samples by MYCN expression (normalized by expression of a housekeeping gene NAGK25, where ≥ 5 is considered MYCN amplification) produces slightly better aggregate results—90% accuracy—than our machine learning model.Cost effectivenessVery low capital cost coupled with the ability to run multiplexed barcoded samples at multiple cost scales suggests that whole transcriptome sequencing of FFPE specimens for solid tumor diagnosis has the potential to reduce health costs and shorten time to complete diagnosis. Capital costs, including the MinION sequencer, operating computer, and basic equipment such as a PCR machine, total less than $5000 USD. Multiplexed, up to 12 samples can be run on one consumable MinION flow cell ($500–$1000) while ensuring adequate depth and throughput for each different specimen, bringing the cost of classifying each specimen to just under $100 including reagents. We utilized the higher capacity P2 sequencer ($10,000) for retrospective sequencing of up to 96 samples at once, but no differences other than throughput were observed across platforms. We previously established that suitable data is produced by a single flongle flow cell (~ $100)17, allowing for economies of scale and turnaround times to be matched to clinical needs. While the per-nucleotide sequencing costs of traditional next-generation sequencing-by-synthesis platforms (notably, Illumina) continue to drop and are typically lower than ONT sequencing, the capital costs for machines that achieve this economy of scale is orders of magnitude higher, and to achieve a similar cost point per sample, would require multiplexing many hundreds of samples simultaneously. The ability to run small batches with a short turnaround time is a critical consideration for potential molecular diagnostics applications. In-context implementation studies will be necessary to firmly establish the practical cost of this approach relative to standard of care molecular diagnostics, but the establishment of a nanopore sequencing-based solid tumor diagnosis assay has the potential to obviate the need for other cytologic and chromosomal tests in areas where they are unavailable, at a fraction of the cost.Quality control, validation, and implementationDeveloping an implementation strategy at LMIC sites will allow for validation and setting of QC parameters for standardization of procedures while testing the robustness of this approach in diverse laboratory conditions. This will involve setting parameters such as prediction probability cutoffs that maximize accuracy, the minimum read N50 (50th percentile of cDNA read lengths) required, the read alignment rate, proportion of aligned reads, and Shannon entropy cutoff that are maximally discriminative for classification accuracy. Refining these QC criteria along with expansion of the training dataset promises to increase the accuracy and calibrated prediction probabilities of this approach in subsequent studies. Subsequent validation of the proposed approach and machine learning model will require additional cross-validation and independent validation cohort to assess possible overfitting/biases in this model, its extensibility to independent datasets, and potential variation in preparation and sequencing methodology. Only leave-one-out cross-validation was feasible in this study due to the limited sample size, especially in under-represented tumor types (ex. T-LL, DSRCT), however expansion of the training dataset will permit additional validation under more robust cross-validation splits.Future directionsThis approach requires extensive knowledge in bioinformatics and genomics to operationalize in a routine clinical setting. This can be overcome in the future by integrating informatic processing and classification into user-friendly local or cloud computing infrastructure. Technical training is required for procedures including RNA extraction, RT-PCR, library preparation, and sequencing is setting with limited molecular biology experience. Further implementation and validation studies in resource-limited settings will help clarify technical and informatic barriers to adoption. Ongoing and future work in LMIC with additionally serve as orthogonal data to validate the performance of our proposed machine learning-based classifier. Sequencing additional normal/non-tumor tissues, especially infection-related growths commonly observed in low-resource settings, will improve our model’s ability to distinguish malignant from non-malignant tissue in clinically-relevant contexts. Continuous integration of additional sequenced samples into our machine learning model will continue to improve classification confidence and accuracy, especially across rarer tumor types.Our results show that whole transcriptome sequencing-based classification of pediatric extracranial solid tumors and lymphomas may be applicable and practical in settings where the full spectrum of tests required for pediatric solid tumor diagnosis is inaccessible. Nanopore sequencing platforms represent a cost-effective and accessible technology to enable molecular cancer diagnostics in low-resource settings. We further demonstrated that RNA can be effectively extracted from FFPE specimens—the primary diagnostic sample available in many LMICs—and efficiently sequenced on nanopore platforms. The resulting expression profiles can discriminate common pediatric solid tumor and lymphomas, permitting timely diagnosis and assignment of appropriate treatment regimen that may correspondingly improve cancer outcomes and help bridge the cancer disparity gap between LMICs and HICs.

Hot Topics

Related Articles