Efficient and low-complexity variable-to-variable length coding for DNA storage | BMC Bioinformatics

Problem formulationWe focus on variable-to-variable length encoding, which maps a binary input sequence to a DNA output sequence. Specifically, we assume the input binary sequence \({\textbf {X}}\in \{0, 1\}^* = \cup _{k=1}^\infty \{0, 1\}^k\) can have any finite length and is generated following an i.i.d. Bernoulli(1/2) distribution. We implicitly presume that the binary input is already compressed using a suitable compression algorithm, such as gzip. The corresponding output sequence can also have any finite length of DNA bases, with the alphabet represented by \(\mathcal{Y}= \{\texttt {A},\texttt {C}, {\texttt {G}}, {\texttt {T}}\}\). A variable-to-variable length encoder f processes the first \(\ell _X\) bits from \({\textbf {X}}\) and generates a DNA sequence \({\textbf {Y}}\in \mathcal{Y}^*\) with a length of \(\ell _Y\). As both \(\ell _X\) and \(\ell _Y\) are random variables, our objective is to maximize the expected rate \(R = \frac{{\mathbb {E}\left[ \ell _X\right] }}{{\mathbb {E}\left[ \ell _Y\right] }}\) bits per nucleotide (bits/nt).Due to biological limitations, the encoder’s output \({\textbf {Y}}= f({\textbf {X}})\) must satisfy both homopolymer and GC content constraints. Specifically, the maximum homopolymer run-length of \({\textbf {Y}}\) should not exceed h. Furthermore, the GC ratio of \({\textbf {Y}}\) must fall within the range \([0.5-c_{GC}, 0.5+c_{GC}]\). Our goal is to identify an encoder that maximizes the expected rate R while satisfying the homopolymer and GC content constraints.AlgorithmThe proposed encoder f is comprised of two encoders: the homopolymer encoder \(f_H\) and the GC content encoder \(f_{GC}\). Initially, the homopolymer encoder \(f_H\) operates as a variable-to-fixed length encoder, mapping the variable-length binary sequence \({\textbf {X}}\) to a fixed-length DNA sequence \({\textbf {U}}\) (i.e., \(f_H({\textbf {X}}) = {\textbf {U}}\)), where the target sequence length of \({\textbf {U}}\) is predetermined by \(\ell _U\). The primary objective of \(f_H\) is to generate a DNA sequence that complies with the homopolymer constraint. Subsequently, the GC content encoder \(f_{GC}\) functions as a fixed-to-variable length encoder, mapping the fixed-length DNA sequence \({\textbf {U}}\) to a DNA sequence \({\textbf {V}}\) with length \(\ell _V\) (i.e., \(f_{GC}({\textbf {U}}) = {\textbf {V}}\)). Finally, XOR encoder is applied to produce the ultimate DNA output sequence \({\textbf {Y}}\) with details provided in Section XOR Encoder. Note that the proposed scheme does not encompass the encoding of DNA sequence indexes.Homopolymer encoderThe core concept of the proposed homopolymer encoder \(f_H\) is to map every 2 bits to a single nucleotide, unless doing so would violate homopolymer constraints. Specifically, the homopolymer encoder \(f_H\) processes 2 bits \(x_1x_2\) from binary data \({\textbf {X}}\) at each step and maps them to the corresponding base \(U = M(x1x2)\), where \(M(00) = {\texttt {A}}\), \(M(01) = {\texttt {C}}\), \(M(10) = {\texttt {G}}\), and \(M(11) = {\texttt {T}}\). However, if the last h bases are equal to base B (i.e., \(U_{i-h} = \dots = U_{i-1} = B\) at step i), then the next nucleotide generated by \(f_H\) should not be B again (i.e., \(U_i \ne B\) is required). In such cases, \(f_H\) takes a single bit \(x_1\) from binary data \({\textbf {X}}\) and maps it to a base other than B. If \(B \in \{{\texttt {A}}, {\texttt {T}}\}\), then the next base \(U_i = M_{GC}(x_1)\) should be either \({\texttt {C}}\) or \({\texttt {G}}\), where \(M_{GC}(0) = {\texttt {C}}\) and \(M_{GC}(1) = {\texttt {G}}\). It is important to note that we limit \(U_i\) to two possibilities. For instance, when \(U_{i-h} = \dots = U_{i-1} = {\texttt {T}}\), the next base is either \({\texttt {C}}\) or \({\texttt {G}}\), instead of selecting from \(\{{\texttt {C}}, {\texttt {G}}, {\texttt {A}}\}\). This is because: 1) it offers a low-complexity scheme since a single bit maps to a single base, and 2) it also balances the GC ratio, which we will discuss in the following subsection. Similarly, if \(U_{i-h} = \dots = U_{i-1} = B \in \{{\texttt {G}}, {\texttt {C}}\}\), then \(U_i = M_{AT}(x_1)\) where \(M_{AT}(0) = {\texttt {A}}\) and \(M_{AT}(1) = {\texttt {T}}\). We repeat this process \(\ell _U\) times, and then \(f_H\) returns a DNA sequence \({\textbf {U}}= U_1\dots U_{\ell _U}\) with length \(\ell _U\). This procedure is outlined in Algorithm 1.Algorithm 1GC content encoderThe GC content encoder \(f_{GC}\) takes a DNA sequence \({\textbf {U}}\) of length \(\ell _U\) as input and adds dummy bases to meet the GC content constraint \([0.5-c_{GC}, 0.5+c_{GC}]\). Specifically, it adds \({\texttt {G}}\)’s and \({\texttt {C}}\)’s if the GC ratio is too low, or \({\texttt {A}}\)’s and \({\texttt {T}}\)’s if the GC ratio is too high. The objective is to minimize the length of the dummy sequence while still adhering to the homopolymer constraint. Let \(p_{GC}({\textbf {U}})\) represent the GC ratio of \({\textbf {U}}\). If \(p_{GC}({\textbf {U}}) < 0.5 – c_{GC}\), a dummy sequence \({\textbf {D}}\) consisting of \({\texttt {G}}\)’s and \({\texttt {C}}\)’s is generated to compensate for the GC ratio. The length of \({\textbf {D}}\) is the minimum \(\ell _{add}\) that satisfies$$\begin{aligned} \frac{\ell _U \times p_{GC}({\textbf {U}}) + \ell _{add}}{\ell _U + \ell _{add}} \ge 0.5-c_{GC}. \end{aligned}$$
(1)
The output of \(f_{GC}\) is a concatenation of \({\textbf {U}}\) and \({\textbf {D}}\), i.e., \({\textbf {V}}= \text{ CONCAT }({\textbf {U}}, {\textbf {D}})\). Note that \({\textbf {D}}\) can be either \({\texttt {G}}{\texttt {C}}{\texttt {G}}\dots \) or \({\texttt {C}}{\texttt {G}}{\texttt {C}}\dots \) depending on the last base of \({\textbf {U}}\), ensuring that \({\textbf {V}}\) satisfies the homopolymer constraint. Similarly, if \(p_{GC}({\textbf {U}}) > 0.5 + c_{GC}\), a dummy sequence \({\textbf {D}}\) composed of \({\texttt {A}}\)’s and \({\texttt {T}}\)’s can be generated accordingly. This procedure is outlined in Algorithm 2.Algorithm 2XOR encoderThe two-stage encoder with the homopolymer encoder \(f_H\) and GC content encoder \(f_{GC}\) offers a straightforward and effective encoding strategy. However, the nature of variable-to-variable length encoder can result in varying output sequence lengths \({\textbf {V}}\). For instance, if we set \(\ell _U = 150\), most output sequences have lengths between 150 and 155, but some extreme outliers have lengths longer than 165 (see Table 5). This non-uniform length may cause errors during reading and writing. Furthermore, DNA synthesis companies, such as TwistBioscience, often employ a tiered pricing model that is based on the maximum sequence length. Under such circumstances, it becomes crucial to minimize this maximum sequence length. To this end, we introduce a straightforward approach known as XOR encoder, specifically designed to achieve this minimization.Initially, we pre-generate four sufficiently long random binary sequences \({\textbf {Z}}_A\), \({\textbf {Z}}_C\), \({\textbf {Z}}_G\), \({\textbf {Z}}_T\), according to i.i.d. Bernoulli(1/2), which are provided to both the encoder and decoder beforehand. The idea is to apply \(f_{GC}\circ f_H\) to all possible XOR combinations, \({\textbf {Z}}_A\oplus {\textbf {X}}\), \({\textbf {Z}}_C\oplus {\textbf {X}}\), \({\textbf {Z}}_G\oplus {\textbf {X}}\), or \({\textbf {Z}}_T\oplus {\textbf {X}}\), and select the one that results in the shortest output sequence. More precisely, we find \(b^\star \in \mathcal{Y}= \{{\texttt {A}}, {\texttt {C}}, {\texttt {G}}, {\texttt {T}}\}\) such that$$\begin{aligned} b^\star = \mathop {\mathrm {arg\,min}}\limits _{b\in \mathcal{Y}} \left| f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_b))\right| \end{aligned}$$
(2)
where \(|\cdot |\) denotes the length of the sequence. For \(b\in \mathcal{Y}\), the binary sequence \({\textbf {Z}}_b\) generated based on a random seed \({\textbf {S}}_b\), where the random seed is available to both the XOR encoder and decoder.To indicate which random sequence is being used, the encoder adds the leading base \(b^\star \) at the beginning of the sequence. For example, if \(f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A))\) yields the shortest output sequence, the output sequence is a concatenation of \({\texttt {A}}\) and \(f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A))\), i.e., \({\textbf {Y}}= \text{ CONCAT }({\texttt {A}}, f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A)))\). In the following sections, we show that the XOR encoder dramatically reduces the number of long output sequences. This procedure is described in Algorithm 3.Algorithm 3DecoderUpon receiving the output sequence \({\textbf {Y}}\) and the random seed \({\textbf {S}}_b\), the decoder checks the first base \(b^\star =Y_1\), which indicates the pseudorandom sequence in use. Recall that the decoder has the seed \({\textbf {S}}_b\) and can recover the binary sequences \({\textbf {Z}}_b\). Since \(f_{GC}\) only adds the dummy sequence, the decoder recovers \(U = Y_2\dots Y_{\ell _U+1}\) by extracting the first \(\ell _U\) bases. Then, by reversing the \(f_H\) process, the decoder obtains \({\textbf {X}}\oplus {\textbf {Z}}_{b^\star }\). Finally, by applying XOR to \({\textbf {Z}}_{b^\star }\), the decoder retrieves the original binary input \({\textbf {X}}\). In the absence of the XOR encoder, upon receiving the sequence \({\textbf {V}}={\textbf {Y}}\) requiring decoding, the initial step involves removing the redundancy introduced by the GC content encoder \(f_{GC}\). This yields \(U = V_1\dots V_{\ell _U}\) for the decoder. Subsequently, \(f_H\) is reversed to recover the original binary input \({\textbf {X}}\). Throughout this process, no information loss or code bit mismatch occurred. Furthermore, our subsequent experiments did not encounter any decoding failures.

Hot Topics

Related Articles