CNVDeep: deep association of copy number variants with neurocognitive disorders | BMC Bioinformatics

Associations of the regions with the target disordersOur model was trained using ~ 195,500 CNVs from patients and healthy individuals (nearly 60 percent from patients and 40 percent from healthy). We use the start and end points of the cases and controls to build the smallest possible regions for investigating possible associations with disease (for each chromosome and type separately). This will create a list of regions with the help of CNV boundaries for each chromosome. The regions are depicted in Fig. 2. As a result, many of the problems discussed in the Background section will be resolved. Then, we compute the amount of overlap of the CNVs of an individual (healthy or patient) with the regions. Each individual has a label of one if he is a patient or zero if he is not ill. This step will convert the case–control study into a format suitable for feeding into our model. In the next step, we have a multi-layer perceptron to train. For training each target disease, we first use the CNVs for all brain disorders in the pretraining. In the fine-tuning phase, we only use the CNV data for the target disease (with labels of the target disease). In the second phase, the training involves adding a regularization term, Group LASSO, to the first layer of the MLP. Using this term, we can identify possible disease-causing regions. The details are discussed in the Method section.Fig. 2We build the regions with the help of the starts and ends of the CNVs in cases and controls. To create the regions, we sort the starts and ends of the case/control CNVs in chromosomes and create the regions with these main points. In the figure, the blue line represents case CNVs, and the green represents control CNVs. Three regions are formed with (start_CNV_case, start_CNV_control), (start_CNV_control, end_CNV_case), (end_CNV_case, end_CNV_control)Comparison with machine learning methodsWe selected some of the machine learning methods and some evaluation benchmarks to evaluate the algorithm’s performance from the machine learning viewpoint. The three chosen methods for comparison are described below.The permutation feature importance algorithm [16] utilizes the shrinkage in a model performance once a feature value is randomly scrambled. The random forest algorithm [17] employs bagging and feature randomness with multiple decision trees. In Gradient Boosting [18], each classifier advances its predecessor by reducing the miscalculations. It fits a more accurate classifier to the residual errors of the last precursor. The results for ROCAUC and accuracy are reported in Table 1 and Fig. 3. The procedure is as follows: we fed the data of each disorder to every method (we assign label one to cases and label zero to controls), and after that, we evaluate the accuracy of the results (and also ROC AUC). CNVDeep achieves better results than other methods (for every disease, Table 11 lists the top regions discovered by CNVDeep).Table 1 Comparison with Different Machine Learning Methods in terms of machine learning criteriaFig. 3AUC curves; yellow curves are for CNVDeep, red ones or random forest, green for gradient boosting, and blue for permutation feature importance; the diameter is the random association (Y = X). The top left chart is for SCZ, the top right is for ASD, and the bottom chart is for DDOverrepresentation of brain-enriched genes in the candidate regionsBrain disorders are the target diseases for which we seek CNV associations; a deficiency in brain development characterizes this group. As a result, genes that overlap with candidate regions may be overrepresented in the brain [19]. We used the set of brain-enriched genes provided in [10] to measure the percentage of brain-enriched genes that overlap with the candidate regions. Some brain-enriched examples are GABRG3 and GABRA5 duplications for ASD, FAM178B, ANKRD39 deletions for SCZ and SNHG14, and DIP2C duplications for DD. We compare the percentages of coding and noncoding genes for each disease to those found in previous studies. We compared our results to the most extensive study on developmental delay [11], the state-of-the-art results on ASD, and the most commonly used CNV tool (PLINK). They all covered lower percentages of brain-enriched genes than our list. Table 2 lists the results.Table 2 Comparison of the brain enrichment of various models in coding and noncoding genes. The method is compared with highly-cited and state-of-the-art methods for each datasetAmong the chromosomes, the 22nd chromosome possesses the most significant number of brain-enriched genes for brain disorders. Some regions we identified overlap with many brain-enriched genes (coding or noncoding). They are listed in Table 3.Table 3 Some regions overlap with many coding and noncoding brain-enriched genes. The column #Coding_OV is the number of brain-enriched coding genes overlapped with the region. Noncoding_OV is the number of brain-enriched noncoding overlapped onesThe analysis of the homolog of the genes in mouse associated with nervous system phenotypesThe study of animal models helps us understand disease mechanisms in similar creatures. Mutant mouse models with phenotypic defects in the nervous system are among the models available for exploring neurocognitive disorders.Our proposed method achieves better results than the other significant methods on these datasets; the details of the results are presented in Table 4. In our method, the overlap of coding genes with the candidate regions is associated with a higher percentage of gene homologs with nervous system traits.Table 4 Comparison of the fractions of the overlaps with mouse mutant genes with nervous system phenotypes. Here, we seek the percentage of gene homologs that cause nervous system phenotype in mice. The tools are state-of-the-art and highly cited papers. The percentage is reported separately by variation typeSome regions overlap with numerous mouse mutant genes, such as the ones listed in Table 5. Notably, some genes overlap much with the candidate regions; examples are GABRA5 and DSCAM for ASD. Within the chromosomes, the 22nd chromosome contains most of the genes with such characteristics for ASD, SCZ, and DD.Table 5 Regions that have much more overlap with the mouse mutant genes. #OV represents the number of genes that overlap with the region and cause nervous system phenotypes in micePhenotypes associated with the candidate regionsTo analyze phenotypes associated with the candidate regions of each disease, we can use the DECIPHER [15] data source, which contains genotype–phenotype information for ~ 12,600 patients and ~ 16,600 CNVs with ~ 2,600 phenotypes. Specifically, for each region-phenotype pair, we compute the fraction of patients (with that phenotype) whose CNVs overlap the target region and compare it with the natural expectation. For ASD disease, 1,748 patients with 1,031 phenotypes overlapped with significant regions. The number of overlapped patients for DD was 2,434, with 1,283 phenotypes. For SCZ, these numbers were 976 patients with 688 phenotypes. A heatmap shows the relationship between phenotypes and candidate regions for each target disease. Figures 4, 5, and 6 show the results for ASD, DD, and SCZ, respectively. The detected regions are in the rows, and DECIPHER phenotypes are in the columns. The bold points are regions with overrepresented phenotypes.Fig. 4The heatmap for DD. The top labels represent DECIPHER phenotypes, and the left labels are candidate regions for developmental delay. The bolder the dots, the stronger the relationship between region and phenotype. Some associated phenotypes are seizures, abnormal facial shape, and specific learning disabilitiesFig. 5The heatmap for ASD. The left labels are candidateregions for autism. The top labels are DECIPHER phenotypes. Some significant phenotypes for ASD are behavioral abnormality, intellectual disability, and cognitive impairmentFig. 6The heatmap for SCZ. The horizontal and vertical labels are the same as the previous heatmaps. Some of the highlighted phenotypes are autistic behavior and abnormal social behaviorAs shown in the heatmaps, among the phenotypes in the DECIPHER data source, some examples of ASD disease include ‘intellectual disability,’ ‘global developmental delay,’ ‘delayed speech and language development,’ ‘autism,’ ‘seizures,’ ‘microcephaly,’ ‘obesity,’ ‘muscular hypotonia,’ ‘short stature,’ ‘behavioral abnormality,’ ‘cognitive impairment,’ and ‘autistic behavior’; for developmental delay (DD), ‘intellectual disability,’ ‘delayed speech, and language development,’ ‘autism,’ ‘seizures,’ ‘microcephaly,’ ‘behavioral abnormality,’ ‘short stature,’ and ‘obesity,’ and for SCZ, ‘intellectual disability,’ ‘global developmental delay,’ ‘delayed speech and language development,’ ‘microcephaly,’ ‘autism,’ ‘seizures,’ ‘short stature,” ‘behavioral abnormality,’ and ‘cognitive impairment,’ were highlighted as associated phenotypes.Besides, some regions have the most associations with phenotypes. For ASD, deletion in a region in 16p11.2Footnote 1; For DD, deletion in a subregion in 15q11.2Footnote 2; and for SCZ, deletion in a subregion in 15q11.2.Footnote 3Genes common to all three disorders and those overrepresented in only one genderNext, we conduct a cumulative analysis to identify the regions shared by all target disorders and the associated genes. According to our investigation, considering the type of variation (deletion or duplication), some of the genes common in the three disorders are deletions in PRKAB2, CRKL, GJA5, and SLC7A4 and duplications in FAM57B and BCL7B. Some genes common in ASD and DD are deletions in GTF2IRD1, SNAP29, AC083884, and duplication in ACP6; common in ASD and SCZ are duplications in BCL7B, GDPD3, TMEM219, and PRKAB2, and deletion in TANGO2, and common between SCZ and DD are deletions in CDC45, FBXO45, LINC00624, and duplication in WBSCR22.We performed another analysis for each target disorder using the datasets where their gender was available. We compared the percentage of males and females who were patients and had variation in that region. Accordingly, for ASD, the region duplication in 16p11.2, in subregion from 30,194,353 to 30,199,805, is dominated by males, and females dominate duplication in 21q22.13 in the exact subregion from 38,735,314 to 38,909,325. Finally, for the DD, the following list can be proposed for males and females:

Male: Deletion in 3q29, in the exact region, starts from 197,072,247 to 197,300,214.

Female: Duplication in 1q21.1 in the exact region starts from 146,852,473 to 146,989,699.

Female: Deletion in 15q11.2, the subregion starts from 22,833,499 to 22,873,941.

Gene ontology analyses of the candidate regionsTo conduct gene ontology analyses on the overlapped genes, we used WebGestelat [20].Several analyses were performed, including gene ontology, human phenotype ontology, and disease terms (DisGeNet and GLAD4U), and several brain codes were used as background genes. The other parameters were the ones present on the website.Footnote 4 Tables 6, 7, 8 report the results for each target disease. In these tables, FDR stands for False Discovery Rate. For ASD, some of the results, such as autistic behavior and autism, were trivial. Other nontrivial results were obsessive–compulsive behavior, axon development, cognition, regulation of membrane, abnormal social behavior, and hyperactivity, some of which were also mentioned in [21].Table 6 ASD Analyses Results. Three types of analyses were performed on ASD candidate genes using WebGestelat. This table highlights obsessive–compulsive behavior, axon development, and cognitionTable 7 DD Analyses Results. Three types of analysis of candidate genes using the WebGestelat web source are available. Some highlighted terms are axon development, synapse structure or activity regulation, and Failure to thrive in infancyTable 8 SCZ Analyzes Results. The results of two types of analyses are listed in this tableResults of the DD analysis include obsessive–compulsive behavior, cognition, neuron projection organization, regulation of membrane potential, regulation of neuron projection development, regulation of synapse structure or activity, positive regulation of signaling receptor activity, and axon development, as exhibited in [22].Statistical analysisWe also conducted an independent analysis of the regions of different chromosomes. We used Fisher’s exact test (Table 9) to evaluate each region’s relative amount of case and control overlaps. The threshold was determined using 100,000 random permutations of case and control labels to ensure the results were not produced randomly. The sample diagrams for the three chromosomes are shown in Fig. 7.Table 9 The matrix for computing Fisher’s exact test; we should have four numbers for each region to calculate the p-value of case/control and overlaps/nonoverlapsFig. 7P-Values for three chromosomes; the Y-Axis is –log(10) (P-Value). The X-axis is the chromosome coordinates in the base pairAnalysis with synthetic dataThe three datasets of available disorders were used to design a new dataset. A random sample of 25,000 patients from cases and 20,000 healthy individuals from controls was selected.Let src_cnv be (src_ch, src_type, src_strt, src_end) for one of the three data sources. Each patient and healthy individual was subjected to a random perturbation to produce new_cnv = (new_ch, new_type, new_strt, new_end), where:$$new\_ch = src\_ch$$
(1)
$$new\_type = \left\{ {\begin{array}{*{20}c} {del,} & {p = .5} \\ {dup,} & {p = .5} \\ \end{array} } \right.$$
(2)
$$new\_strt = \left\{ {\begin{array}{*{20}l} {src\_strt – 10kbp,} \hfill & {p = {1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-0pt} 3}} \hfill \\ {src\_strt,} \hfill & {p = 1/3} \hfill \\ {src\_strt + 10kbp,} \hfill & {p = 1/3} \hfill \\ \end{array} } \right.$$
(3)
$$new\_end = \left\{ {\begin{array}{*{20}l} {src\_end – 10kbp,} \hfill & {p = {1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-0pt} 3}} \hfill \\ {src\_end,} \hfill & {p = {1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-0pt} 3}} \hfill \\ {src\_end + 10kbp,} \hfill & {p = {1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-0pt} 3}} \hfill \\ \end{array} } \right.$$
(4)
In this case, p is a random variable with a discrete uniform distribution. The new CNV is constructed in such a manner that the chromosome number will match the source CNV, the type of variation will be random deletion or duplication, and 10 k basepairs will be randomly perturbed at the start and end of the CNV in comparison with the source CNV. To produce these new CNVs, the CNVs for an individual should not overlap.Table 10 shows the results of evaluating our dataset using machine learning criteria and measuring the percentage of brain-enriched and mouse-mutant genes.Table 10 Performance Percentage for Synthetic Data

Hot Topics

Related Articles