Maptcha: an efficient parallel workflow for hybrid genome scaffolding | BMC Bioinformatics

In this section, we describe in detail all the steps of our Maptcha algorithmic framework for hybrid scaffolding. Let \(\mathcal {C}=\{c_1,c_2,\ldots c_n\}\) denote a set of n input contigs (from prior assemblies). Let \(\mathcal {L}=\{r_1,r_2,\ldots r_m\}\) denote a set of m input long reads. Let |s| denote length of any string s. We use \(N=\Sigma _{i=1}^n |c_i|\) and \(M=\Sigma _{i=1}^m |r_i|\). Furthermore, for contig c, let \({\bar{c}}\) denote its reverse complement.Problem statement: Given \(\mathcal {C}\) and \(\mathcal {L}\), the goal of our hybrid scaffolding problem is to generate a set of scaffolds \(\mathcal {S}\) such that a) each scaffold \(S\in \mathcal {S}\) represents a subset of \(\mathcal {C}\) such that no two subsets intersect (i.e., \(S_i\cap S_j=\emptyset \)); and b) each scaffold \(S\in \mathcal {S}\) is an ordered sequence of contigs \([c_1,c_2,\ldots ]\), with each contig participating in either its direct form c or its reverse complemented form \({\bar{c}}\). Here, each successive pair of contigs in a scaffold is expected to be linked by one or more long reads \(r\in \mathcal {L}\). Intuitively, there are two objectives: i) maximize recall—i.e., to generate as few scaffolds as possible, and ii) maximize precision—i.e.,the relative ordering and orientation of the contigs within each scaffold matches the true (but unknown) ordering and orientation of those contigs along the target genome.Algorithm: The design of the Maptcha scaffolding algorithmic framework is broken down into three major phases.

contig expansion: In the first phase, using the contigs as seeds, we aim to extend them on either end using long reads that align with those contigs. This extension step is also designed to detect and connect successive pairs of contigs with direct long read links. This yields the first generation of our partial scaffolds.

longread island construction: Note that not all long reads may have contributed to these partial scaffolds, in particular those long reads which fall in the gap regions of the target genome between successive scaffolds. Therefore, in the next phase, we detect the long reads that do not map to any of the first generation partial scaffolds, and use them to build partial scaffolds corresponding to these long read island regions. This new set of partial scaffolds corresponds to the second generation of partial scaffolds.

link scaffolds with bridges: Finally, in the last phase, we aim to link the first and second generation scaffolds using long reads that serve as bridges between them. This step outputs the final set of scaffolds.

This three phase approach has the following advantages. First, it provides a systematic way to progressively combine the sequence information available from the input contigs (which typically tend to be more accurate albeit fragmented, if generated from short reads) to the input long reads (which may be significantly larger in number), in an incremental fashion. Next, this incremental approach also could reduce the main computational workload within each phase that is required for mapping long reads. More specifically, we choose to align long reads either to the contigs or to the generated partial scaffolds wherever possible, and in the process restrict the more time consuming long read to long read alignments only to the gap regions not covered by any of the contigs or partial scaffolds. In this paper, we use the JEM-mapper, which is a recently developed fast (parallel) and accurate sketch-based alignment-free long read mapping tool suited for hybrid settings [43, 45]. Finally, by decoupling the contig ordering and orientation step (which is a graph-theoretic problem) from the scaffold generation step (which is an assembly problem), we are able to efficiently parallelize the scaffold generation step. This is achieved through a batching step that splits the input sequences into separate batches to allow the use of any existing standalone long read assembler to generate the final sequence scaffolds. Our framework is capable of leveraging any off-the-shelf long read mapping tool. In this paper, we use Hifiasm [8], which is one of the most widely used state-of-the-art long read assembly tool, as our standalone assembler.In what follows, we describe the details of our algorithm for each of the three major phases of our approach. Figure 2 provides an illustration of all the main steps within each of these three phases.Fig. 2A detailed illustration of the Maptcha pipeline showing the different phases and their stepsPhase: contig expansionThe goal of this phase is to enhance contigs by incorporating long reads that have been aligned with them. This process allows for the extension of contigs by connecting multiple ones into a scaffold using the long reads aligned to them, thereby increasing the overall length of the contigs. This is achieved by first mapping the long reads to contigs to detect those long reads that map to contigs, and then use that information to link contigs and extend them into our first generation of partial scaffolds (panel I in Fig. 2).We use the following definition of a partial scaffold in our algorithm: A partial scaffold corresponds to an ordered and oriented sequence of an arbitrary number of contigs \([c_i,c_j,c_k,\ldots ]\) such that every consecutive pair of contigs along the sequence are linked by one or more long reads.Step: Mapping long reads to contigs: For mapping, we use an alignment-free, distributed memory parallel mapping tool, JEM-mapper because it is both fast and accurate [45, 46]. JEM-mapper employs a sketch-based alignment-free approach that computes a minimizer-based Jaccard estimator (JEM) sketch between a subject sequence and a query sequence. More specifically, in a preprocessing step, the algorithm generates a list of minimizing k-mers [47, 51] from each subject (i.e., each contig) and then from that list computes minhash sketches [4] over T random trials (we use \(T=30\) for our experiments). Subsequently, JEM sketches are generated from query long reads. Based on these sketches, for each query the tool reports the subject to whom it is most similar. For further details on the methodology, refer to the original paper by Rahman et al. [45].One challenge of using a mapping tool is that the subject (contigs) and query (long reads) sequences may be of variable lengths, thereby resulting possibly in vastly different sized ground sets of minimizers from which to draw the sketches. However, it is the minimizers from the aligning region between the subject and query that should be ideally considered for mapping purposes. To circumvent this challenge, in our implementation we generate sketches only from the two ends of a long read. In other words, our mapping step maps each long read to at most two contigs, one corresponding to each end of that long read. Note that this implies a contig may potentially appear in the mapped set for multiple long reads (depending on the sequencing coverage depth). In our implementation, we used a length of \(\ell \) base pairs (\(\ell =2Kbp\) used in our experiments) from either end of a long read for this purpose. The intuitive rationale is that since we are interested in a scaffolding application, this approach of involving the ends of long reads (and their respective alignment with contigs) provides a way to link two distantly located contigs (along the genome) through long read bridges.Using this approach in our preliminary experiments, we compared JEM-mapper with Minimap2 and found that JEM-mapper yielded better quality results for our test inputs (results summarized in the supplementary section Figure S2).Step: Graph construction: Let \(\mathcal {M}\) denote the mapping output, which can be expressed as the set of 2-tuples of the form \(\langle c,r\rangle \)—where long read r maps to a contig c—output by the mapper. We use \(L_c\subseteq \mathcal {L}\) to denote the set of all long reads that map to contig c, i.e., \(L_c=\{r \;|\; \langle c,r\rangle \in \mathcal {M}\}\). Informally, we refer to \(L_c\) as the long read set corresponding to contig c.Using the information in \(\mathcal {M}\), and in \(L_c\) for all \(c\in \mathcal {C}\), we construct an undirected graph G(V, E), where:

V is the vertex set such that there is one vertex for every contig \(c\in \mathcal {C}\); and

E is the set of all edges of the form \((c_i,c_j)\), such that there exists at least one long read r that maps to both contigs \(c_i\) and \(c_j\) (i.e., \(L_{c_i}\cap L_{c_j}\ne \emptyset \)).

Intuitively, each edge is the result of two contigs sharing one or more long reads in their mapping sets. In our implementation, we store the set of long read IDs corresponding to each edge. More specifically, along an edge \((c_i,c_j)\in E\), we also store its long read set \(L_{i,j}\) given by the set \(L_{c_i}\cap L_{c_j}\). The cardinality of set \(L_{i,j}\) is referred to as the “support value” for the edge between these two contigs. Since the vertices of G correspond to contigs, we refer to G as a contig graph.Next, the graph G along with all of its auxiliary edge information as described above, are used to generate partial scaffolds. We perform this in two steps: a) first enumerate paths in the contig graph that are likely to correspond to different partial scaffolds (this is achieved by our wiring algorithm that is described next); and b) subsequently, generate contiguous assembled sequences for each partial scaffold by traversing the paths from the previous step (this is achieved by using a batch assembly step described subsequently).Step: Wiring heuristic: Recall that our goal is to enumerate partial scaffolds, where each partial scaffold is a maximal sequence of contiguously placed (non-overlapping) contigs along the target genome. In order to enumerate this set of partial scaffolds, we make the following observation about paths generated from the contig graph G(V, E). A partial scaffold \([c_i, c_{i+1}, \ldots , c_{j}]\) can be expected to be represented in the form of a path in G(V, E). However, it is important to note that not all graph paths may correspond to a partial scaffold. For instance, consider a branching scenario where a path has to go through a branching node where there are more than one viable path out of that node (contig). If a wrong decision is taken to form paths out of branching nodes, the resulting paths could end up having chimeric merges (where contigs from unrelated parts of the genome are collapsed into one scaffold). While there is no way to check during assembly for such correctness, we present a technique we call wiring, as described below, to compute partial scaffolds that reduce the chance of false merges.The wiring algorithm’s objective is one of enumerating maximal acyclic paths in G—i.e., maximality to ensure longest possible extension of the output scaffolds, and acyclic to reduce the chance of generating chimeric errors due to repetitive regions in the genome (as illustrated in Fig. 5). This problem is trivial if each vertex in V has at most two neighbors in G, as it becomes akin to a linked list of contigs, each with one predecessor contig and one successor contig. However, in practice, we expect several branching vertices that have a degree of more than two (indicating potential presence of repeats). Therefore, finding a successor and/or a predecessor vertex becomes one of a non-trivial path enumeration problem that carefully resolves around branching nodes.Algorithm: Our wiring algorithm is a linear time algorithm that first computes a “wiring” internal to each vertex, between edges incident on each vertex, and then uses that wired information to generate paths. First, we describe the wiring heuristic.Step 1: Wiring of vertices: For each vertex \(c\in V\) that has at least degree two, the algorithm selects a subset of two edges incident on that vertex to be “wired”, i.e., to be connected to form a path through that vertex, as shown in Fig. 3. The two edges so wired determine the vertices adjacent on either side of the current vertex c.To determine which pair of edges to connect, we use the following heuristic. Let \(L_i\) denote the set of long read IDs associated with edge \(e_i\). We then (hard) wire two distinct edges \(e_i\) and \(e_j\) incident on a vertex c, if \(L_i\cap L_j\ne \phi \) and it is maximized over all possible pairs of edges incident on c, i.e., \(\arg \max _{e_i,e_j\in \mathcal {E}(c)} |L_i\cap L_j|\), where \(\mathcal {E}(c)\) denotes all edges incident on c.The simple intuition is to look for a pair of edges that allows maximum long read-based connectivity in the path flowing through that vertex (contig). This path has the largest support by the long read set and is therefore most likely to stay true to the connectivity between contigs along the target genome. All other possible paths through that vertex are ignored. The resulting wired pair of edges \(\langle e_i,e_j\rangle \) generated from each vertex c is added in the form of wired edge 3-tuple \(\langle c_i, c_j, c \rangle \). We denote the resulting set as \(\mathcal {W}\).There are two special cases to consider here. First, if no pair of edges incident on a vertex c have long reads in common (i.e., \(L_i\cap L_j=\phi \) for all pairs of edges incident), then there is no evidence of a link between any pair of edges on that contig. Therefore, our algorithm would not wire any pair of edges for that contig. In other words, if a walk (step 2) should reach this vertex (contig), such a walk would terminate at this contig.As another special case, if a vertex c has degree one, then the wiring task is trivial as there exists only one choice to extend a path out of that contig, \(c_e\), along the edge e attached to that vertex. We treat this as a special case of wiring by introducing a dummy contig \(c_{d}\) to each such vertex with degree one, and adding the tuple \(\langle c_d, c_e, c \rangle \) to \(\mathcal {W}\).Note that by this procedure, each vertex c has at most one entry in \(\mathcal {W}\). To implement this wiring algorithm, note that all we need is to store the set of long read IDs along each edge. A further advantage of this approach is that this is an independent decision made at each vertex, and therefore this step easily parallelizes into a distributed algorithm that works with a partitioning of the input graph.Fig. 3Illustration of the wiring heuristic, shown centered at a contig vertex \(c_i\). On either side of \(c_i\) are shown other contigs (\(c_1\) through \(c_k\)) that each have at least one long read common with \(c_i\). These long read sets shared between any contig (say j) and \(c_i\) are denoted by \(L_{i,j}\) (same as \(L_{j,i}\)). Out of all possible pairwise connections between the incident edges, the wiring heuristic will select only one edge pairStep 2: Path enumeration: In the next step, we enumerate edge-disjoint acyclic paths using all the wired information from \(\mathcal {W}\). The rationale behind the edge-disjoint property is to reduce the chance of genomic duplication in the output scaffolds. The rationale for avoiding cycles in paths is two-fold—both to reduce genomic duplication due to repetitive contigs, as well as to reduce the chance of creating chimeric scaffolds.The path enumeration algorithm (illustrated through an example in Fig. 4) works as follows.

(i)

Initialize a visit flag at all vertices and set them to unvisited.

(ii)

Initialize a work queue Q of all vertices with degree one (e.g., \(c_a\), \(c_e\), \(c_f\), \(c_g\) and \(c_h\) in Fig. 4).

(iii)

For each vertex \(c\in Q\), if c is still unvisited, dequeue c, start a new path at c (denoted by \(P_c\)), and grow the path as follows. The edge e incident on c connects c to another vertex, say \(c_1\). Then \(c_1\) is said to be the successor of c in the path and is appended to \(P_c\). We now mark the vertex c as visited. Subsequently, the algorithm iteratively extends the path by simply following the wired pairing of edges at each vertex visited along the way—marking each such vertex as visited and stitching together the path—until we arrive at one of the following termination conditions:

Case a)

Arrive at a vertex which has chosen a different predecessor vertex: See for example path \(P_1\) truncated at \(c_b\) because the wiring at \(c_b\) has chosen a different pair of neighbors other than \(c_a\) based on long read support, i.e., \(\mathcal {W}\) contains \(\langle c_g, c_c, c_b\rangle \). In this case, we add the vertex \(c_b\) at the end of the current path \(P_1\) and terminate that path.

Case b)

Arrive at a vertex that is already visited: This again implies that no extension beyond this vertex is possible without causing duplication between paths, and so the case is handled the same way as Case a by adding the visited vertex as the last vertex in the path and the path terminated.

Case c)

Arrive at a degree one vertex: This implies that the path has reached its end at the corresponding degree one contig and the path is terminated at this contig.

More examples of paths are shown in Fig. 4.

Fig. 4Edge-disjoint acyclic paths generated from walking the contig-contig graph. Also shown below are the likely alignments of the individual paths to the (unknown) target genome \(\mathcal {G}\). Here, since the contig \(c_b\) appears in two paths, it is likely to be contained in a repetitive region (X, \(X^\prime \)) as highlightedProvable properties of the algorithmThe above wiring and path enumeration algorithms have several key properties.

Prop1

Edge disjoint paths: No two paths enumerated by the wiring algorithm can intersect in edges.

Proof
This is guaranteed by the wiring algorithm (step 1), where each vertex chooses only two of its incident edges to be wired to build a path. More formally, by contradiction let us assume there exists an edge e that is covered by two distinct paths \(P_1\) and \(P_2\). Then this would imply that both paths have to pass through at least one branching vertex c such that there exist \(\langle e_1, e, c\rangle \in \mathcal {W}\) and \(\langle e_2, e, c\rangle \in \mathcal {W}\) (for some \(e_1\ne e_2\ne e\) all incident on c). However, by construction of the wiring algorithm (step 1) this is not possible. \(\square \)

Prop2

Acyclic paths: There can be no cycles in any of the paths enumerated.

Proof
This is guaranteed by the path enumeration algorithm described above (step 2). More specifically, the termination conditions represented by the Cases (a) and (b) clip any path before it forms a cycle. By not allowing for cycles, our algorithm prevents including the same contig more than once along a scaffold. This is done so as to prevent chimeric misassemblies of a repetitive contig (for example, repetitive regions X and \(X^\prime \) illustrated in Fig. 4. \(\square \)

Prop3

Deterministic routing: The path enumeration algorithm is deterministic and generates the same output set of paths for a given \(\mathcal {W}\) regardless of the order in which paths are generated.

Proof
This result follows from the fact that the wiring heuristic at each vertex is itself deterministic as well as by the conditions represented by Cases (a) and (b) to terminate a path in the path enumeration algorithm. More specifically, note that each vertex contributes at most one hard-wired edge pair into \(\mathcal {W}\) and none of the other edge pair combinations incident on that vertex could lead to paths. Given this, consider the example shown in Fig. 4, of two paths \(P_1\) and \(P_2\) converging onto vertex \(c_b\). Note that in this example, \(\langle c_g, c_c, c_b\rangle \in \mathcal {W}\). The question here is if it matters whether we start enumerating \(P_1\) first or \(P_2\) first. The answer is no. In particular, if \(P_1\) is the first to get enumerated, then termination condition Case (a) would apply to terminate the path to end at \(c_b\). Therefore, when \(P_2\) starts, it will still be able to go through \(c_b\). On the other hand, if \(P_2\) is the first path to get enumerated, then \(c_b\) will get visited and therefore termination condition Case (b) would apply to terminate \(P_1\) at \(c_b\) again. So either way, the output paths are the same. A more detailed example for this order agnostic behavior is shown in S3. This order agnostic property allows us to parallelize the path enumeration process without having to synchronize among paths. \(\square \)
As a corollary to Prop1 (on edge disjoint paths) and Prop2 (on acyclic paths), we now show an important property about the contigs from repetitive regions of the genome and how the wiring algorithm handles those contigs carefully so as to reduce the chances of generating chimeric scaffolds.
Corollary 1
Let \(c_x\) be a contig that is completely contained within a repetitive region. Then this contig can appear as a non-terminal vertexFootnote 1 in at most one path output by the wiring algorithm.

Proof
Consider the illustrative example in Fig. 5, which shows a contig \(c_x\) that maps to a repeat X and its copy \(X^\prime \). In particular, if there is a trail of long reads linking the two repeat copies (from \([c_x, c_2, \ldots c_{k}, c_x]\)), then it could generate a cycle in the graph G. However, based on Prop2, the cycle is broken by the path enumeration algorithm and therefore \(c_x\) is allowed to appear as a non-terminal vertex only in at most one of the paths that goes through it. Even if there is no trail of long reads connecting the two repeat regions, the same result holds because of the edge disjoint property of Prop1. \(\square \)
An important implication of this corollary is that our algorithm is careful in using contigs that fall inside repetitive regions. In other words, if a contig appears as a non-terminal vertex along a path, then its predecessor and successor contigs are those to which this contig exhibits maximum support in terms of its long read based links. While it is not possible to guarantee full correctness, the wiring algorithm uses long read information in order to reduce the chances of repetitive regions causing chimeric scaffolds.Fig. 5A case of repeats (\(X,X^\prime \)) causing cycles branching around contigsAlgorithm 1Step: Parallelized contig batch assembly:As the next step to wiring and path enumeration, we use the paths enumerated to build the output sequence (partial) scaffolds from this phase. To implement this step in a scalable manner, we make a simple observation that the paths enumerated all represent a disjoint set of partial scaffolds. Therefore, we use a partitioning strategy to partition the set of paths into fixed size batches (each containing s contigs), so that these independent batches can be fed in a parallel way, into a standalone assembler that can use both the contigs and long reads of a batch to build the sequences corresponding to the partial scaffolds. We refer to this parallel distributed approach as contig batch assembly.The assembly of each batch is performed in parallel using any standalone assembler of choice. We used Hifiasm [8] for all our experiments. By executing contig-long read pairs in parallel batches, this methodology yields one or more scaffolds per batch, contributing to enhanced scalability in assembly processes. Furthermore, the selective utilization of long reads mapped to specific contig batches significantly reduces memory overhead, mitigating the risk of misassemblies that might arise from using the entire long read set which is evident in the results.This strategy not only reduces memory utilization but also minimizes the potential for misassembly errors that could occur when unrelated sequences are combined.Phase: longread island constructionThe first phase of contig expansion, only focuses on expanding contigs using long reads that map on either side. This can be thought of a seed-and-extend strategy, where contigs are seeds and extensions happen with the long reads. However, there could be regions of the genome that are not covered by this contig expansion step. Therefore, in this phase, we focus on constructing “longread islands” to cover these gap regions. See Fig. 1 for an ilustration of these long read islands. This is achieved in two steps:

9a)

First we detect all long reads that do not map to any of the first generation partial scaffolds (generated from the contig expansion step). More specifically, we give as input to JEM-mapper the set of all unused long reads (i.e., unused in the partial scaffolds) and the set of partial scaffolds output by the contig expansion phase. Any long read that maps to previous partial scaffolds are not considered for this phase. Only those that remain unmapped correspond to long reads that fall in the gap regions between the partial scaffolds.

(b)

Next, we use the resulting set of unmapped long reads to build partial scaffolds. This is achieved by inputing the unmapped long reads to Hifiasm. The output of this phase represent the second generation of partial scaffolds, each corresponding to a long read island.

Phase: link scaffolds with bridgesIn the last phase, we now link the first and second generations of partial scaffolds using any long reads that have been left unused so far. The objective is to bridge these two generations into longer scaffolds if there is sufficient information in the long reads to link them. Note that from an implementation standpoint this is same as for contig expansion, where the union of first and generation partial scaffolds serve as the “contigs” and the rest of the unused long reads serve as the long read set.Complexity analysisRecall that m denotes the number of input long reads in \(\mathcal {L}\), and n is the number of input contigs in \(\mathcal {C}\). Let p denote the number of processes used by our parallel program.Out of the three major phases of Maptcha, the contig expansion phase is the one that works on the entire input sets (\(\mathcal {L}\) and \(\mathcal {C}\)). The other two phases work on a reduced subset of long reads (unused by the partial scaffolds of the prior scaffolds) and the set of partial scaffolds (which represents a smaller size compared to \(\mathcal {C}\)). For this reason, we focus our complexity analysis on the contig expansion phase.In the contig expansion phase we have the following steps:

(i)

Mapping long reads to contigs: JEM-mapper [45] is an alignment-free distributed memory parallel implementation and hence processes load the long reads and contigs in a distributed manner. The dominant step is sketching the input sequences (long reads or contigs). Given that the number of long reads is expected to be more than the number of contigs (due to sequencing depth), the complexity can be expressed as \(O( \frac{{m \ell _l T}}{{p}} )\), where \(\ell _l\) is average long read length and \( T \) denotes the number of random trials used within its minhash sketch computation.

(ii)

Graph construction: Let the list of mapped tuples \(\langle c, r \rangle \) from the previous step contain \(\mathcal {T}\) tuples. These \(\mathcal {T}\) tuples are used to generate the contig graph by first sorting all the tuples by their long read IDs to aggregate all contigs that map to the same ID. This can be achieved using an integer sort that scans the list of tuples linearly and inserts into a lookup table for all long read IDs—providing a runtime of \(O(m+\mathcal {T})\) time. Next, this lookup table is scanned one long read ID at a time, and all contigs in its list are paired with one another to create all the edges corresponding to that long read. The runtime of this step is proportional to the output graph size (G(V, E)), which contains n vertices (one for each contig), and |E| is the number of edges corresponding to all contig pairs detected. Our implementation performs this graph construction in a multithreaded mode.

(iii)

Wiring heuristic: For the wiring step, each node detects a pair of edges incident on it that has the maximum intersection in the number of long read IDs. This can be achieved in time proportional to \(O(d^2)\) where d is the average degree of a vertex. The subsequent step of path enumeration traverses each edge at most once. Since both these steps are parallelized, the wiring heuristic can be completed in \(O(\frac{nd^2+|E|}{p})\) time.

(iv)

Contig batch assembly: The last step is the contig batch assembly, where each of the set of enumerated paths are partitioned into b batches, and each batch is individually assembled (using Hifiasm). As this step is trivially parallelizable, this step takes \( O\left( \frac{b \times a}{p}\right) \) time, where \( a \) is the average time taken for assembling any batch.

In our results, we show that the contig expansion phase dominates the overall runtime of execution (shown later in Fig. 6).The space complexity of Maptcha is dominated by the size to store the input sequences and the size of the contig graph—i.e., \(O(N+M+n+|E|)\).

Hot Topics

Related Articles