CNVoyant a machine learning framework for accurate and explainable copy number variant classification

CNVoyant’s target audience comprises individuals determining the role of a CNV in a suspected genomic disorder. This diverse group of clinical genomicists includes variant scientists, geneticists, clinical molecular geneticists, and laboratory directors. CNVoyant is designed to require minimal bioinformatics expertise and enables easy interpretation of results. It accepts a list of CNVs, specified by chromosomal coordinates and variant type (deletion or duplication), and returns probabilities for pathogenic, VUS, and benign clinical significance. For each CNV entry, CNVoyant classifies the variant by determining the maximum probability among the clinical significance classes.The capacity of CNVoyant to classify the clinical significance of CNVs was tested in 21,574 CNVs curated from DECIPHER after training on 52,176 CNVs published in ClinVar (Fig. 1). This approach is consistent with previously reported comparisons of pathogenicity, where a set of CNVs are examined in a general context rather than focusing on individual patients. This also aligns with the previously cited ACMG technical guidelines, that recommend uncoupling CNV pathogenicity classification from the implications for a specific patient24. Features were generated to capture information related to genomic position, variant composition, overlapping functional annotation, population frequency, conservation, and dosage sensitivity. Finally, to demonstrate the efficacy of CNVoyant in a real-world clinical diagnostic setting, CNVoyant predictions were obtained from five genomic disease patients with diagnostic CNVs.Fig. 1CNVoyant development framework. The final CNVoyant models are a result of the illustrated machine learning pipeline and are designed to predict the pathogenicity of copy number variations (CNVs). The training set is comprised of 52,176 CNVs (24,965 duplications, 27,211 deletions) parsed from the January 2023 version of ClinVar, and the test set is comprised of 21,574 CNVs (10,509 duplications, 11,065 deletions) from DECIPHER v11.18. Features are generated from annotations related to genomic position, variant composition, clinical significance, and dosage sensitivity. Two models were trained to classify deletion and duplication events independently. Training data for each CNV type was partitioned into 5 cross folds. Accuracy metrics observed in each fold were utilized to (1) select the optimal architecture from 29 candidates, (2) select an optimal set of hyperparameters from 10,000 permutations, and (3) calibrate outputted probabilities to class distributions in the training data. The resulting models were used to generate probabilities of benign significance (Pr (Benign)), VUS (Pr (VUS)), and pathogenic significance (Pr (Pathogenic)) for CNVs in the test set. A clinical significance prediction is also provided by taking a maximum over the set of benign, VUS, and pathogenic probabilities. The CNVoyant output generated from the test set was later used for benchmarking.Training dataset curationCNVoyant is trained on CNVs included in the January 2023 XML release of ClinVar8. This XML file was parsed, and extracted variants were limited to CNVs (variant type of “copy number gain” or “copy number loss”) that did not have duplicated variant positions. The Reference ClinVar Accession Number (RCV) entry was chosen to represent each CNV to avoid training on duplicates that can arise in cases of multiple submitters. 40,837 of the extracted CNVs were aligned to the GRCh37 reference genome, all of which required a genomic coordinate mapping via the UCSC liftOver command line tool32 to be combined with the 12,641 CNVs that were aligned to GRCh38. Following liftOver, 1,126 variant entries were identified as duplicates of entries originally aligned to GRCh38, and they were omitted. 850 CNVs were omitted due to ambiguous clinical significance labeling or conflicting clinical significance annotation in entries with matching genomic coordinates. 20 variants were removed for having a size of less than 50 bp. The remaining ClinVar CNVs with at least one pathogenic or likely pathogenic designation were labeled as pathogenic for training purposes. Non-pathogenic variants with at least one VUS designation were labeled as VUS, and remaining variants containing only benign or likely benign classifications were labeled as benign. Altogether, 52,176 CNVs were included in model training (Fig. 2a, Table 1; DEL: 6,886 pathogenic, 7,191 VUS, 13,134 benign; DUP: 3,028 pathogenic, 10,792 VUS, 11,145 benign).Fig. 2Training and test set curation. CNVoyant was trained with copy number variants (CNVs) curated from ClinVar and tested on variants curated from DECIPHER. The flowcharts indicate the reasoning for omitting 2,002 variants from the training set (a) and 7,809 variants from the test set (b). For ClinVar, 6 CNVs were mapped to contigs other than autosomes or sex chromosomes, 1,126 had matching genomic coordinates and clinical significance, 572 had ambiguous clinical significance labels, 278 variants had matching genomic coordinates and conflicting clinical significance labels, and 20 spanned less than 50 base pairs. For DECIPHER, 712 CNVs had variant types other than “duplication” or “deletion”, 5,138 had matching genomic coordinates and clinical significance, 1,003 had matching genomic coordinates and conflicting clinical significance labels, 118 overlapped with values in the training set, and 38 spanned less than 50 base pairs.Table 1 Distribution of variant type and clinical significance in training and test sets.Testing dataset curationA test set of CNVs was curated from the web interface of the v11.18 release of the DECIPHER database33. All 29,453 DECIPHER CNVs are aligned to the GRCh38 reference genome and have been assigned clinical significance following manual review. 712 variants had variant types other than CNV deletions or duplications and as such were removed from the test set. The remaining 28,132 variants were lifted to GRCh37 to accommodate the expected input for comparator algorithms; 800 failed to map to an autosomal or sex chromosome contig and were omitted. 1,003 entries were removed for having shared genomic coordinates and conflicting clinical significance annotation, while 5,138 entries were removed for having duplicated genomic coordinates and clinical significance annotation. To guard against data leakage between train and test sets, 118 variants whose coordinates matched a ClinVar submission were omitted from the test set. 38 variants were removed for having a size of less than 50 bp. 21,574 CNVs were included in the final test set, which was used to benchmark CNVoyant against comparator algorithms (Fig. 2b; DEL: 6,183 pathogenic, 3,785 VUS, 1,097 benign; DUP: 3,360 pathogenic, 5,595 VUS, 1,554 benign).CNVoyant feature selectionCNVoyant utilizes 17 features to classify the clinical significance of candidate CNVs. These features were selected based on publicly available, peer-reviewed resources and were chosen for their relevance and proven utility in the context of clinical significance classification for RGDs. Several feature distributions are right-skewed in the training data (counts of overlapping genes, exons, promoter regions, diseases, pathogenic ClinVar SNVs/indels, and bp length). These features are log-transformed to reflect more normal distributions. All features are normalized via the sklearn preprocessing.MinMaxScaler Python class prior to training or prediction34.Genomic position (2 features)

Centromere distance: The number of bp separating the centromere from the candidate CNV. Distance from the centromere to the CNV is determined by selecting the CNV boundary closest to the centromere: the end coordinate on the P arm or the start coordinate on the Q arm.

Telomere distance: The number of bp separating the telomere from the candidate CNV. Distance from the telomere to the CNV is calculated by selecting the CNV boundary furthest from the centromere: the start coordinate on the P arm or the end coordinate on the Q arm.

CNV composition (2 features)

GC Content: Percentage of nucleotides in the genomic region encompassed by the candidate CNV that are guanine or cytosine.

BP Length: Total bp spanning the candidate CNV.

Functional annotation (5 features)

Count of genes: The gene count is defined as the total number of genes that overlap the candidate CNV. Genes overlap the candidate CNV if at least one bp is shared between the candidate CNV’s location and the gene’s genomic coordinates drawn from the RefSeq database1. All overlap calculations were made via the Bedtools intersect function35.

Count of diseases: The disease count is defined as the total number of diseases associated with the gene (s) that overlap the candidate CNV. This allows genes with more disease associations to be distinguished from genes with only a single or no disease association. Disease-gene associations are referenced from the curated annotations provided by Online Mendelian Inheritance in Man (OMIM)36.

Count of exons: Overlapping exon count is calculated by summing exons, across all genes, that overlap the candidate CNV. The exonic boundaries were padded by 10 base pairs to account for canonical splice regions.

Count of promoter regions: CNVs may overlap the promoter region of a gene rather than the gene itself. To address this, we include a count of promoter regions, defined as the interval between the transcription start site (TSS) and 1,000 base pairs upstream of the TSS.

Count of ClinVar pathogenic SNVs and indels: A sum of overlapping pathogenic SNV/indels is included to capture potentially relevant pathogenicity interpretation. To obtain a set of overlapping pathogenic SNV/indels, ClinVar37 is intersected with the candidate CNV and limited to variants interpreted as having “Pathogenic” or “Likely Pathogenic” significance.

Population frequency (1 feature)

GnomAD SV popmax: To estimate population frequency, we identify the highest frequency across all gnomAD SV (V4)38 entries that match the CNV’s variant type (deletion or duplication) and exhibit at least 50% reciprocal overlap in genomic coordinates. This means that the candidate CNV and the gnomAD SV entry it overlaps with must share at least half of their span, ensuring a significant genomic coverage overlap between the observed CNV and those described in gnomAD SV.

Conservation (2 features)

PhyloP: To estimate the conservation of a candidate CNV, PhyloP scores are referenced. PhyloP scores are available at single nucleotide resolution. Single nucleotide scores are highly variable and are thus correlated with the size of the candidate CNV. To mitigate this correlation, we employ a centered moving average that considers all scores within a specified reading frame (Supplemental Fig. 1). This effectively smooths the otherwise volatile conservation score curve. The maximum value of this smoothed curve within the genomic coordinates of the candidate CNV is returned as the PhyloP feature. A maximum was chosen considering that higher PhyloP scores indicate higher estimated conservation.

phastCons: The same procedure used to calculate the PhyloP feature was used to estimate conservation according to the phastCons score. Similar to PhyloP, the maximum was chosen as the aggregate function considering that higher phastCons scores indicate higher estimated conservation.

Dosage sensitivity (5 features)

HI Score: Dosage sensitivity was estimated by overlapping the candidate CNV with a curated set of dosage sensitive regions described by ClinGen39. Manually curated HI scores are available for all curated regions. HI scores were one-hot encoded to handle each unique value and split into binary features. In the case of multiple overlapping regions, HI score features were summed.

TS Score: As with HI Score, one-hot encoded binary features were generated from the curated TS scores of annotated dosage sensitive regions. The sum is also taken across all binary features when multiple dosage-sensitive regions overlap with the candidate CNV.

HI Index: The minimum HI Index score40 observed in overlapping dosage-sensitive regions is obtained to represent the HI Index value for the candidate CNV. HI Index estimates the probability of loss intolerance, where lower values predict haploinsufficiency.

pLI: The maximum pLI score3 observed in overlapping dosage-sensitive regions is obtained from gnomAD to represent the pLI value for the candidate CNV. pLI estimates the probability of loss intolerance, where higher values predict haploinsufficiency.

LOEUF: The minimum LOEUF score41 observed in overlapping dosage-sensitive regions is obtained to represent the LOEUF value for the candidate CNV. LOEUF estimates the probability of loss intolerance, where lower values predict haploinsufficiency.

Training procedure and model selectionFor model training, features were generated for CNVs included in the ClinVar training set before partitioning into deletion and duplication sets. Separate multi-class models were trained for duplication and deletions, each predicting whether a candidate CNV has benign, uncertain, or pathogenic clinical significance. 29 common ML architectures were tested via fivefold cross-validation for each CNV type before selecting the top performers according to multi-class F1 score (Supplemental Table 1). Following architecture selection, fivefold cross-validation was again employed to hyperparameter tune and find the most accurate model. 10,000 sets of hyperparameter values were tested for each CNV type via the sklearn.model_selection.RandomizedSearchCV class. The top-performing models were calibrated to class distributions in the training set via the isotonic method implemented in the sklearn.calibration.CalibratedClassifierCV class.Comparator algorithm selectionCNVoyant predictions were compared to five published ML-based CNV pathogenicity classifiers, X-CNV27, TADA28, dbCNV29, StrVCTVRE30, and ISV31, as well as the algorithmic implementation of the ACMG technical standards for CNV interpretation, ClassifyCNV26. The test set was passed to CNVoyant and all comparator algorithms to test for generalizability and generate accuracy metrics for benchmarking. Given that X-CNV, TADA, dbCNV, and ClassifyCNV take GRCh37-aligned variants as input, UCSC liftOver was again called to lift DECIPHER variants from GRCh38 to GRCh37. X-CNV pathogenic probabilities yielded only three unique values across the entirety of the test set, which was unexpected given the generation of a relatively heterogeneous feature set. Rather than attempting to amend the source code to produce more continuous output, X-CNV was omitted from benchmarking.Measuring performanceTo generate granular performance data in the test cohort, accuracy metrics are reported for each of the three CNV prediction classes: benign, VUS, and pathogenic. The classification probability for each class was utilized to sort the list of CNVs and generate precision-recall (PR) and receiver operating characteristic (ROC) curves. The area under these curves (PR AUC, ROC AUC) is referenced to measure model accuracy, in addition to the average F1 score and overall accuracy for multi-class predictions. F1 and overall accuracy are only reported for CNVoyant, dbCNV, and ClassifyCNV, as these algorithms are the only three that provide multi-class output. dbCNV provides likely pathogenic and likely benign classification designations in addition to pathogenic, VUS, and benign designations. Likely pathogenic predictions were mapped to pathogenic classification and likely benign predictions were mapped to benign classification to ensure a fair comparison to CNVoyant and ClassifyCNV.TADA, ISV, and StrVCTVRE all output a single score to estimate CNV pathogenic probability. The complement of the pathogenic probability score (1-Pr (pathogenic)) was calculated to estimate benign significance scores for these comparator algorithms. The ClassifyCNV output is a score rather than a probability, but the complement was still chosen to represent benign significance, as higher scores represent a higher pathogenicity. CNVoyant is the only algorithm that provides benign and VUS probabilities; these values were used in plotting corresponding benign and VUS classification curves. dbCNV does not provide probabilities or a continuous confidence score for classification, so there is no value to reference in plotting ROC and PR curves. As such, dbCNV was not included in the ROC and PR curve comparisons.Subgroup analyses was conducted to evaluate the relative performance of CNVoyant when candidate CNVs are partitioned by either bp length or their overlap with known segmental duplications (SD). This approach grants a more granular assessment of CNVoyant’s classification accuracy across different genomic contexts, assessing whether CNVoyant maintains its predictive power across short and long CNVs. Pathogenic CNVs can be relatively small if the overlapping sequence holds critical biological significance, as is observed in cases of exon deletion. As such, it is important to ensure that CNVoyant does not make less accurate predictions in one group versus the other. In addition to partitioning subgroup analysis by bp length, CNVoyant accuracy was compared between CNVs overlapping with SDs and those not. SDs are regions within the genome that share at least 90% of their genomic sequence42. Clinically, these regions introduce complexity during significance interpretation43. As such, CNVoyant prediction accuracy was compared, and feature value distributions were generated between these groups to understand characteristic differences between CNVs overlapping with SDs and those not.Deciphering feature influence on CNV classification with SHAP analysisSHAP (SHapley Additive exPlanations) values offer a qualitative analysis tool for understanding how each feature influences the clinical significance prediction for distinct CNV classes. For CNVoyant, we generated SHAP beeswarm plots across all classes to visualize the effect of training features on model prediction44. These plots rank features by their importance and use color coding to depict the direction of their influence on the model’s output. Each point on a plot represents a feature’s SHAP value for an individual observation, quantifying its contribution to moving the model’s prediction from the base value—the dataset’s average prediction—toward the actual prediction.Application of CNVoyant in diagnostic CNV casesTo demonstrate the capacity to prioritize likely pathogenic CNVs, CNVoyant predictions were obtained for CNVs detected from five patients. Patients with diagnostic CNVs were selected from the rapid genome sequencing (rGS) study in the Institute for Genomic Medicine at Nationwide Children’s Hospital45. Inclusion criteria for the rGS study required that a medical geneticist determine that rapid genome sequencing could inform medical care. Out of 218 patients, four patients had one diagnostic CNV and one patient had two diagnostic CNVs. CNV calls were generated by the GATK GermlineCNVCaller46 and manually validated via a locally developed, clinically validated CNV user interface. To measure the performance of CNVoyant, we report the prioritized rank of the known diagnostic variant(s) for each case, along with the predicted probabilities of benign, uncertain, and pathogenic clinical significance.

CNVoyant a machine learning framework for accurate and explainable copy number variant classification

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Facilitating integrative and personalized oncology omics analysis with UCSCXenaShiny

Hot Topics

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics

GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes | BMC Bioinformatics

Popular Articles

SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

DiscovEpi: automated whole proteome MHC-I-epitope prediction and visualization | BMC Bioinformatics

Utilization of a natural language processing-based approach to determine the composition of artifact residues | BMC Bioinformatics