FindCSV: a long-read based method for detecting complex structural variations | BMC Bioinformatics

The design of FindCSVThe detection of CSVs faces two significant challenges, as discussed in the previous section. The first challenge is the absence of a precise definition for CSVs. To address this issue, we propose the following definition: Multiple SVs from the same parent with breakpoint distances less than 1000bp are considered CSVs. As shown in Fig. 4, When there are no other SVs near an SV, it is a simple SV. when two SV breakpoints from either the father’s or mother’s genome are within a distance of less than 1000 bp, these two SVs are considered as CSVs. Otherwise, these two SVs are considered as two independent simple SVs. However, if the distance between SVs is less than 1000 bp but originates from both parents, they are considered as multi-allelic SVs. The second challenge lies in the diversity of CSVs, which makes it difficult for detection methods to identify and report the correct breakpoints accurately. Although there are theoretically infinite types of CSVs, the most common types in real data are DEL+INS and DEL+INV.To overcome the second challenge, we employ the following approach: 1. Identification of candidate SV regions. 2. Determination of parental allele presence. 3. Clustering of parental reads. 4. Generation of consistent sequences. 5. Remapping of sequences. 6. Analysis of remapping results. Furthermore, to address false positive SVs, we introduce a deep learning model that screens and filters out such SVs. The detailed steps of this proposed method, including the identification of candidate SV regions, clustering of parental reads, generation of consistent sequences, remapping, and analysis of remapping results, are outlined in the next section.Overview of FindCSVThe input for FindCSV comprises two components: (i) a sorted long-read BAM file and (ii) a FASTA file. FindCSV primarily consists of three main steps: (1) Identify candidate SV regions, (2) Cluster reads and construct consensus sequences, and (3) Filter and output SVs. Figure 5 illustrates the steps involved in FindCSV, while the detailed steps are explained in the following sections.Identify candidate SV regionsThe first step of FindCSV involves identifying candidate SV regions, where SVs may exist. In contrast to other methods, this method allows us to focus the subsequent analysis on these candidate regions. This streamlines the detection process and improves efficiency, compared to analyzing the entire genome.Parameter estimationBefore detecting candidate SV regions, FindCSV performs preprocessing on the BAM file by estimating the average coverage, denoted as \(\bar{C}\). This estimation is achieved by randomly selecting 1000 positions from the reference genome and calculating the average coverage of these positions using the following formula:$$\begin{aligned} \bar{C}=\frac{1}{1000}\sum _{i=1}^{1000} C_{i} \end{aligned}$$In certain cases, certain regions in the genome exhibit exceptionally high coverage, sometimes exceeding dozens of times the average coverage. Consequently, it becomes challenging to accurately detect SVs in these regions. Fortunately, these high-coverage regions contain very few SVs. Leveraging this observation, FindCSV implements a filtering mechanism to discard these high-coverage regions. Specifically, if a region has coverage that surpasses ten times the average coverage, it is filtered out and excluded from further analysis. By filtering out these high-coverage regions, FindCSV aims to mitigate the complexity introduced by read similarity between different regions, focusing its efforts on regions with more reliable read assignments and a higher likelihood of containing SVs.Detect candidate SV regionsTo identify candidate SV regions, FindCSV follows a two-step approach. In the first step, it searches for all potential SVs and generates a list of candidate SVs. The second step involves merging these candidate SVs to obtain candidate SV regions. FindCSV applies three criteria to filter out alignments that are deemed useless, retaining only those alignments that satisfy the following conditions:

1

The MAPQ of alignment is greater than 20.

2

The alignment is primary.

3

The alignment length is greater than 1000.

By applying these filtering criteria, FindCSV ensures that only alignments that meet specific quality and length requirements are considered as potential SVs, enhancing the reliability and accuracy of the subsequent steps in the SV detection process. To detect candidate SVs, FindCSV employs two complementary approaches: one based on CIGAR strings and the other one based on split reads. These approaches complement each other and enhance the accuracy of SV detection.CIGAR strings FindCSV examines the CIGAR strings of aligned reads to identify potential SVs. The CIGAR string represents the alignment pattern between the read and the reference genome. By analyzing variations in the CIGAR strings, such as insertions, deletions, or mismatches, FindCSV can identify regions where SVs may be present.Split reads FindCSV utilizes split reads to detect candidate SVs. Split reads are reads that span an SV breakpoint, with a portion of the read aligning to one genomic location and another portion aligning to a different location. By identifying split reads, FindCSV can infer the presence of SVs and determine their breakpoints. More detailed steps are outlined in Sect. 3 of the supplementary file.By combining these two approaches, FindCSV maximizes its ability to detect candidate SVs. A length-6 tuple is used to record the characteristics of each candidate SV, which consists of the following elements: .(Chr, S, E, L, T, R). Here is the breakdown of each element in the tuple: Chr refers to the chromosome name, S and E represent the start and end positions of the SV respectively, L denotes the SV length, T indicates the type of SV (DEL, INS, INV), and R specifies the name of read where the SV is located.After obtaining the tuple for each candidate SV, FindCSV proceeds to merge tuples with closer distances to generate candidate SV regions. This merging process is performed according to the following approach: (i) For each tuple, FindCSV creates a length-4 tuple (Chr, S, E, 1). This new tuple serves as the initial region, where the value ‘1’ represents the initial read count in this region. (ii) Let region1 = (Chr1, S1, E1, 1) and region2 = (Chr2, S2, E2, 1) be two initial regions. If these two regions are located on the same chromosome and the distance between their breakpoints is less than 1000bp, FindCSV merges them into a single region. The merged region is recorded as (Chr1, min(S1, S2), max(E1, E2), 2), where the value ‘2’ represents the combined read count of the two initial regions. (iii) This merging process continues iteratively until no further regions can be merged. (iv) To eliminate the effects of noise, FindCSV retains only the regions with final values greater than three. These filtered regions are considered as the candidate SV regions. By merging tuples with closer distances, FindCSV aggregates candidate SVs into candidate SV regions, providing a more comprehensive and accurate representation of the SVs present in the analyzed data.Convert regions to imagesIn this section, the regions obtained from the previous steps are converted into images. To gather more comprehensive information about SVs, FindCSV introduces a parameter called “flank_len”, denoted as F. For each candidate SV region (Chr, S, E), FindCSV extracts all CIGAR strings within an extended region (Chr, S-F, E+F). The extended region ensures that an adequate genomic context around the SV breakpoints is included. The value of F is determined as the minimum of (200, 2*(E-S)), where 200 represents a fixed length, and 2*(E-S) ensures that the extended region covers twice the length of the SV region.To represent the CIGAR strings of each candidate SV region, FindCSV utilizes a five-color image. In this image representation, each line corresponds to a read, and each column corresponds to a position on the reference genome. The conversion process involves converting each character of the CIGAR strings into a pixel, following specific rules:

1

The M (Match) of the CIGAR string with a plus strand (+) is represented as a yellow pixel.

2

The Match of the CIGAR string with a minus strand (-) is represented as a blue pixel.

3

The D (DEL) of the CIGAR string is represented as a black pixel.

4

The I (INS) of the CIGAR string is represented as a red pixel.

5

The X (Mismatch) of the CIGAR string is represented as a green pixel.

By converting each character of the CIGAR strings into pixels following these rules, FindCSV creates a five-color image that visually represents the variations and alignment patterns of the reads within the candidate SV region. However, there is a challenge when representing INSs in the image because they do not occupy a position on the reference genome. This means that if FindCSV uses columns to represent positions on the reference, the insertions will not be displayed in the image, and their information will be lost. To address this issue, FindCSV implemented the following method: In the BAM file, the majority of positions are occupied by matches (M). FindCSV leverages this fact to replace a portion of the matches with insertions (I) based on the insertion lengths. To avoid excessive information loss for matches (M), for every 10bp insertions (I), FindCSV replaces one match (M) on the reference with an insertion (I). For example, if the length of an insertion is 50 base pairs, FindCSV identifies five consecutive match characters (M) on the reference genome near the insertion and replaces them with five insertion characters (I). By incorporating these modifications, FindCSV ensures that the insertions are represented in the resulting image. After the replacement process, the region is converted into an image, which is referred to as an SV image. In the SV image, a vertical red line is used to indicate the presence of an insertion, providing a visual representation of the inserted sequence. Figure 6 illustrates the SV images representing three different types of SVs, showcasing the ability of FindCSV to visually capture and differentiate various types of SVs.Train LetNet modelIn this section, FindCSV employs a Convolutional Neural Network model to filter the candidate SV regions. Specifically, FindCSV selects the LeNet model as the filtering model. The LeNet model consists of three convolutional layers, two subsampling layers, and two fully connected layers. These layers are designed to extract features and learn representations from the input data. The detailed parameters of the seven layers in the LeNet model are provided in Fig. 7.To accommodate the fixed input image size requirement of the LeNet model, it is necessary to normalize the previously obtained SV images. This normalization process involves using the resize function from the Python library to resize the images to a uniform size of 224*224*3 pixels. The training dataset plays a crucial role in training the LeNet model. FindCSV utilizes the HG002_SVs_Tier1_v0.6 dataset, which is widely accepted as a benchmark dataset in SV research. However, this benchmark dataset only includes two types of SVs: DELs and INSs. To improve the performance of the model, FindCSV augments the training data by simulating additional types of SVs. Specifically, FindCSV generates 4000 INVs and 3258 regions with no SVs (Abbreviated as noSVs). The noSVs are randomly selected regions from the reference genome, ensuring that there are no SVs present in those regions. The generation process for simulating INVs is slightly more complex. First, a region is randomly selected from the reference genome, and it is confirmed that there are no SVs in that region. Then, the region is reversed to create an inverted mutation. Finally, the sequencing data is remapped onto the inverted reference genome to obtain the mapping results for the simulated INVs. This way, FindCSV obtains a training dataset comprising 20,000 regions, including four types: 5463 DELs, 7279 INSs, 4000 INVs, and 3258 noSVs.Subsequently, FindCSV converts these regions into two types of images based on ONT data and HiFi data. This results in a total of 40,000 images in the training dataset. FindCSV inputs these normalized SV images into the LeNet model and trains it using a ten-fold cross-validation approach. The dataset of 40,000 images is divided into ten groups, each containing 4000 images. The training and validation data are alternated, with nine groups used for training and one group used for validation in each iteration. The training process is performed for 5000 iterations, and the accuracy of the model is recorded. Finally, the model with the highest accuracy is retained for further use.After training the model, FindCSV applies the trained LeNet model to the SV images of candidate SV regions obtained in the previous steps. The model is utilized to determine the probability of an SV occurring within each region. The label with the highest probability is selected as the judgment result. If the model determines that there is no SV present in a particular region, FindCSV discards the candidate SV region. However, if the model identifies a DEL, INS, or INV, FindCSV proceeds to calculate the precise length and breakpoints of the SV in the subsequent step.Cluster reads and construct consensus sequencesAfter obtaining the filtered candidate SV regions in the previous steps, FindCSV proceeds to cluster the reads within these regions based on their alignments. The goal of this clustering step is to identify potential multi-allelic SVs originating from both parents. By clustering the reads from both parents, FindCSV aims to distinguish different alleles and their corresponding alignments within the SV region. This clustering allows for the identification of variations that may exist between the two parental genomes. Once the clustering is performed and the reads are grouped accordingly, FindCSV constructs a consensus sequence based on the reads within each cluster. The consensus sequence represents the most likely sequence representation of the alleles present within the SV region. Finally, FindCSV obtains the final SV by remapping the consensus sequence onto the reference genome. This remapping process aligns the consensus sequence to the reference genome, providing the precise location and sequence information of the identified SV.Cluster reads in candidate SV regionFindCSV extracts all reliably aligned reads for each candidate SV region and refines them to obtain a candidate SV. However, in candidate SV regions, particularly in repeat regions (Chr, S, E), some reads may have incorrect alignments. Figure 8 illustrates examples of read alignments in repeat regions, and it can be observed that the alignment of Read4, Read6, Read7 in Fig. 8 leads to two small false positive SVs, which can impact the overall detection results. To mitigate the impact of wrong alignments, FindCSV employs a filtering strategy and retains only two types of long reads. The first type includes reads that span the entire region (Chr, S-1000, E+1000), such as \(Read1-Read4\) in Fig. 8. The second type comprises reads that have two or more alignments, such as Read5 in the figure. By retaining these two types of long reads, FindCSV ensures that the majority of the retained reads are less susceptible to incorrect alignments caused by repeat regions. In Fig. 8, the reads (\(Read1-Read5\)) within the green and blue boxes are considered reliable reads and are retained for further analysis. Conversely, the reads (\(Read6-Read7\)) within the red box are deemed unreliable reads and are filtered out. This filtering process, which removes unreliable reads, enables FindCSV to achieve improved results, particularly in repeat regions where incorrect alignments can be more prevalent.For reliable reads, FindCSV performs a merging process if a read contains more than one SV within the candidate SV region. This merging process aims to consolidate multiple SVs into a single, correctly identified SV. For instance, in the case of Read4 depicted in Fig. 8, which exhibits two small false positive DELs, FindCSV endeavors to merge them into a single, accurate DEL. Suppose a read contains two SVs, namely SV1 (Chr, S1, E1, L1) and SV2 (Chr, S2, E2, L2). FindCSV merges these SVs to create a merged SV, denoted as SV_m: (Chr, min(S1, S2), max(E1, E2), L1 + L2). By combining the start positions (S1, S2), end positions (E1, E2), and lengths (L1, L2), FindCSV forms a merged SV with adjusted start and end positions and an updated length. In addition, even for reliable reads that do not contain any SVs within the candidate SV region, FindCSV still records them as (Chr, 0, 0, 0) to maintain a comprehensive record of all the reliable reads. Subsequently, FindCSV proceeds to cluster the reads based on the length of the SVs within each candidate SV region. For a candidate SV region with N reads, FindCSV calculates the average SV length, denoted as \(\bar{L}\). The formula for calculating the average SV length is as follows:$$\begin{aligned} \bar{L}=\frac{1}{N}\sum _{i=1}^{N} L_{i} \end{aligned}$$FindCSV classifies the candidate SV regions into two types, namely long SVs and short SVs, based on the average SV length (\(\bar{L}\)) exceeding or being less than 500bp. The clustering methods employed for these two types of regions are as follows:Long SVs (\(\bar{L}>\) 500 bp) Sort the candidate SV region’s SVs based on SV length. Each SV is recorded as an individual cluster, with the average length of all SVs in the cluster set as the cluster_length. If the difference in cluster_length between two clusters is less than 20%, the two clusters are merged into a single cluster. The above merging process is repeated for the remaining clusters. The clusters are continuously compared, and if the difference in cluster_length is below the specified threshold, they are merged. This iterative merging continues until no further clusters can be merged based on the given criteria.Short SVs (\(\bar{L} \le\) 500 bp) For short SVs, the influence of sequencing noise on SV length is significant. Consequently, distinguishing between two short SVs of similar length can be challenging. To overcome this issue, FindCSV utilizes hierarchical clustering to separate these reads and identify distinct SVs within the cluster. Specifically, if a bimodal distribution is observed for all SV lengths within a given cluster, it indicates the presence of two heterozygous SVs. The detailed steps for this approach can be found in Sect. 6 of the supplementary file.Construct consensus sequencesFollowing the aforementioned steps, FindCSV proceeds to perform read clustering for the candidate SV regions. The objective of this step is to construct a consensus sequence based on the reads within each cluster. This process aims to reduce noise inherent in third-generation sequencing data and improve mapping results. The specific steps to get consensus sequences are shown in Sect. 9 of the supplementary file.Once the consensus sequence is obtained, FindCSV performs remapping of the consensus sequence to the reference genome. This remapping process enables the detection of SVs in the new mapping results. If two or more SVs are identified in the new mapping result, and their breakpoint distance is less than 1000 bp, FindCSV merges these SVs into a single CSV. The breakpoint positions and types of each SV are recorded for further analysis.Filter and output SVsIn the previous step, FindCSV generates a set of candidate SVs and proceeds to filter them based on their lengths. Only candidate SVs with lengths greater than 30 bp are retained for further analysis. However, due to limitations in sequencing technology, it has been observed that there are many false SVs in simple repeat regions. To address this issue, FindCSV implements a stricter criterion to remove false SVs specifically in simple repeat regions. The determination of whether a region is a simple repeat region is described in Sect. 7 of the supplementary file. When a candidate SV is located in a simple repeat region, FindCSV examines the SV lengths of all reads at that location. If it is found that more than half of the reads have SV lengths greater than 40 bp, the candidate SV is considered valid and retained for further analysis. By employing this stricter criterion, FindCSV aims to mitigate the impact of false SVs in simple repeat regions, ensuring that only reliable SVs are included in the final results.Fig. 1The figure illustrates the recall, precision, and F1-score of FindCSV, DeBreak, cuteSV, Sniffles, SVcnn, and SVision across the CHM13, HG002, and HG00733 datasetsFig. 2The figure illustrates the F1-score of FindCSV, Sniffles, and SVision about CSVs on the ONT and HiFi datasetsFig. 3The figure illustrates the F1-score of FindCSV, Sniffles, and SVision about CSVs on the simulated datasetsFig. 4The figure illustrates the differences between simple SV, CSVs, and multi-allelic SVsFig. 5The overview of FindCSV. The input consists of a sorted BAM file and a FASTA file. FindCSV mainly consists of three main steps: (1) Identify candidate SV regions, (2) Cluster reads and construct consensus sequence, and (3) Filter and output SVsFig. 6The alignments in (Chr, S-F, E+F) are converted into an image. Each read occupies a row in the image. Yellow and blue pixels represent matches, black pixels represent deletions, red pixels represent Insertions and green pixels represent mismatchsFig. 7The detailed parameters of the seven layers in the LeNet modelFig. 8In the provided figure, we present the alignments of seven distinct reads. The reads enclosed within the green and blue boxes will be classified as reliable reads and will be retained for further analysis. Conversely, the reads encompassed by the red box will be identified as unreliable reads and will be filtered out from subsequent processingTable 1 The run time and memory of different methodsFinally, for each candidate SV that is retained after the filtering process, FindCSV performs the following calculations: the reliable read count (\(R_c\)) and the support read count (\(R_s\)). These metrics are then used to calculate the support rate, denoted as Rate, for each candidate SV.$$\begin{aligned} Rate = R_s/R_c*100\% \end{aligned}$$If the support rate of a candidate SV is determined to be greater than 20% and the number of supporting reads exceeds 3 (default threshold), FindCSV considers these SVs as true positive SVs and includes them in the final output. The output SVs are then saved in a Variant Call Format file.Performance measureTo assess the performance of different detection methods, this paper employs three evaluation metrics: Recall, Precision, and F1-score. These metrics provide quantitative measures of the method’s ability to accurately detect and classify true positive and false positive results. All three measurements range between 0 and 1, and they are defined as follows:$$\begin{aligned} & \text{Recall} = \frac{TP}{TP+FN} ~\\ & \text{Precision} = \frac{TP}{TP+FP} ~\\ & {\text {F1-score}} = \frac{2*recall*precision}{recall+precision} \end{aligned}$$Please note that in the context of evaluating the performance of a detection method, the following definitions are used: TP (True Positives) refers to the number of SVs that are correctly identified by the method and also appear in the benchmark dataset. TP+FN corresponds to the total number of SVs present in the benchmark dataset, regardless of whether they are detected by the method or not. TP+FP represents the total number of SVs predicted by the method, including both true positives and false positives.

Hot Topics

Related Articles