Efficient and low-complexity variable-to-variable length coding for DNA storage | BMC Bioinformatics

Problem formulationWe focus on variable-to-variable length encoding, which maps a binary input sequence to a DNA output sequence. Specifically, we assume the input binary sequence ${\textbf {X}}\in \{0, 1\}^* = \cup _{k=1}^\infty \{0, 1\}^k$ can have any finite length and is generated following an i.i.d. Bernoulli(1/2) distribution. We implicitly presume that the binary input is already compressed using a suitable compression algorithm, such as gzip. The corresponding output sequence can also have any finite length of DNA bases, with the alphabet represented by $\mathcal{Y}= \{\texttt {A},\texttt {C}, {\texttt {G}}, {\texttt {T}}\}$. A variable-to-variable length encoder f processes the first $\ell _X$ bits from ${\textbf {X}}$ and generates a DNA sequence ${\textbf {Y}}\in \mathcal{Y}^*$ with a length of $\ell _Y$. As both $\ell _X$ and $\ell _Y$ are random variables, our objective is to maximize the expected rate $R = \frac{{\mathbb {E}\left[ \ell _X\right] }}{{\mathbb {E}\left[ \ell _Y\right] }}$ bits per nucleotide (bits/nt).Due to biological limitations, the encoder’s output ${\textbf {Y}}= f({\textbf {X}})$ must satisfy both homopolymer and GC content constraints. Specifically, the maximum homopolymer run-length of ${\textbf {Y}}$ should not exceed h. Furthermore, the GC ratio of ${\textbf {Y}}$ must fall within the range $[0.5-c_{GC}, 0.5+c_{GC}]$. Our goal is to identify an encoder that maximizes the expected rate R while satisfying the homopolymer and GC content constraints.AlgorithmThe proposed encoder f is comprised of two encoders: the homopolymer encoder $f_H$ and the GC content encoder $f_{GC}$. Initially, the homopolymer encoder $f_H$ operates as a variable-to-fixed length encoder, mapping the variable-length binary sequence ${\textbf {X}}$ to a fixed-length DNA sequence ${\textbf {U}}$ (i.e., $f_H({\textbf {X}}) = {\textbf {U}}$), where the target sequence length of ${\textbf {U}}$ is predetermined by $\ell _U$. The primary objective of $f_H$ is to generate a DNA sequence that complies with the homopolymer constraint. Subsequently, the GC content encoder $f_{GC}$ functions as a fixed-to-variable length encoder, mapping the fixed-length DNA sequence ${\textbf {U}}$ to a DNA sequence ${\textbf {V}}$ with length $\ell _V$ (i.e., $f_{GC}({\textbf {U}}) = {\textbf {V}}$). Finally, XOR encoder is applied to produce the ultimate DNA output sequence ${\textbf {Y}}$ with details provided in Section XOR Encoder. Note that the proposed scheme does not encompass the encoding of DNA sequence indexes.Homopolymer encoderThe core concept of the proposed homopolymer encoder $f_H$ is to map every 2 bits to a single nucleotide, unless doing so would violate homopolymer constraints. Specifically, the homopolymer encoder $f_H$ processes 2 bits $x_1x_2$ from binary data ${\textbf {X}}$ at each step and maps them to the corresponding base $U = M(x1x2)$, where $M(00) = {\texttt {A}}$, $M(01) = {\texttt {C}}$, $M(10) = {\texttt {G}}$, and $M(11) = {\texttt {T}}$. However, if the last h bases are equal to base B (i.e., $U_{i-h} = \dots = U_{i-1} = B$ at step i), then the next nucleotide generated by $f_H$ should not be B again (i.e., $U_i \ne B$ is required). In such cases, $f_H$ takes a single bit $x_1$ from binary data ${\textbf {X}}$ and maps it to a base other than B. If $B \in \{{\texttt {A}}, {\texttt {T}}\}$, then the next base $U_i = M_{GC}(x_1)$ should be either ${\texttt {C}}$ or ${\texttt {G}}$, where $M_{GC}(0) = {\texttt {C}}$ and $M_{GC}(1) = {\texttt {G}}$. It is important to note that we limit $U_i$ to two possibilities. For instance, when $U_{i-h} = \dots = U_{i-1} = {\texttt {T}}$, the next base is either ${\texttt {C}}$ or ${\texttt {G}}$, instead of selecting from $\{{\texttt {C}}, {\texttt {G}}, {\texttt {A}}\}$. This is because: 1) it offers a low-complexity scheme since a single bit maps to a single base, and 2) it also balances the GC ratio, which we will discuss in the following subsection. Similarly, if $U_{i-h} = \dots = U_{i-1} = B \in \{{\texttt {G}}, {\texttt {C}}\}$, then $U_i = M_{AT}(x_1)$ where $M_{AT}(0) = {\texttt {A}}$ and $M_{AT}(1) = {\texttt {T}}$. We repeat this process $\ell _U$ times, and then $f_H$ returns a DNA sequence ${\textbf {U}}= U_1\dots U_{\ell _U}$ with length $\ell _U$. This procedure is outlined in Algorithm 1.Algorithm 1GC content encoderThe GC content encoder $f_{GC}$ takes a DNA sequence ${\textbf {U}}$ of length $\ell _U$ as input and adds dummy bases to meet the GC content constraint $[0.5-c_{GC}, 0.5+c_{GC}]$. Specifically, it adds ${\texttt {G}}$’s and ${\texttt {C}}$’s if the GC ratio is too low, or ${\texttt {A}}$’s and ${\texttt {T}}$’s if the GC ratio is too high. The objective is to minimize the length of the dummy sequence while still adhering to the homopolymer constraint. Let $p_{GC}({\textbf {U}})$ represent the GC ratio of ${\textbf {U}}$. If $p_{GC}({\textbf {U}}) < 0.5 – c_{GC}$, a dummy sequence ${\textbf {D}}$ consisting of ${\texttt {G}}$’s and ${\texttt {C}}$’s is generated to compensate for the GC ratio. The length of ${\textbf {D}}$ is the minimum $\ell _{add}$ that satisfies$$\begin{aligned} \frac{\ell _U \times p_{GC}({\textbf {U}}) + \ell _{add}}{\ell _U + \ell _{add}} \ge 0.5-c_{GC}. \end{aligned}$$
(1)
The output of $f_{GC}$ is a concatenation of ${\textbf {U}}$ and ${\textbf {D}}$, i.e., ${\textbf {V}}= \text{ CONCAT }({\textbf {U}}, {\textbf {D}})$. Note that ${\textbf {D}}$ can be either ${\texttt {G}}{\texttt {C}}{\texttt {G}}\dots $ or ${\texttt {C}}{\texttt {G}}{\texttt {C}}\dots $ depending on the last base of ${\textbf {U}}$, ensuring that ${\textbf {V}}$ satisfies the homopolymer constraint. Similarly, if $p_{GC}({\textbf {U}}) > 0.5 + c_{GC}$, a dummy sequence ${\textbf {D}}$ composed of ${\texttt {A}}$’s and ${\texttt {T}}$’s can be generated accordingly. This procedure is outlined in Algorithm 2.Algorithm 2XOR encoderThe two-stage encoder with the homopolymer encoder $f_H$ and GC content encoder $f_{GC}$ offers a straightforward and effective encoding strategy. However, the nature of variable-to-variable length encoder can result in varying output sequence lengths ${\textbf {V}}$. For instance, if we set $\ell _U = 150$, most output sequences have lengths between 150 and 155, but some extreme outliers have lengths longer than 165 (see Table 5). This non-uniform length may cause errors during reading and writing. Furthermore, DNA synthesis companies, such as TwistBioscience, often employ a tiered pricing model that is based on the maximum sequence length. Under such circumstances, it becomes crucial to minimize this maximum sequence length. To this end, we introduce a straightforward approach known as XOR encoder, specifically designed to achieve this minimization.Initially, we pre-generate four sufficiently long random binary sequences ${\textbf {Z}}_A$, ${\textbf {Z}}_C$, ${\textbf {Z}}_G$, ${\textbf {Z}}_T$, according to i.i.d. Bernoulli(1/2), which are provided to both the encoder and decoder beforehand. The idea is to apply $f_{GC}\circ f_H$ to all possible XOR combinations, ${\textbf {Z}}_A\oplus {\textbf {X}}$, ${\textbf {Z}}_C\oplus {\textbf {X}}$, ${\textbf {Z}}_G\oplus {\textbf {X}}$, or ${\textbf {Z}}_T\oplus {\textbf {X}}$, and select the one that results in the shortest output sequence. More precisely, we find $b^\star \in \mathcal{Y}= \{{\texttt {A}}, {\texttt {C}}, {\texttt {G}}, {\texttt {T}}\}$ such that$$\begin{aligned} b^\star = \mathop {\mathrm {arg\,min}}\limits _{b\in \mathcal{Y}} \left| f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_b))\right| \end{aligned}$$
(2)
where $|\cdot |$ denotes the length of the sequence. For $b\in \mathcal{Y}$, the binary sequence ${\textbf {Z}}_b$ generated based on a random seed ${\textbf {S}}_b$, where the random seed is available to both the XOR encoder and decoder.To indicate which random sequence is being used, the encoder adds the leading base $b^\star $ at the beginning of the sequence. For example, if $f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A))$ yields the shortest output sequence, the output sequence is a concatenation of ${\texttt {A}}$ and $f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A))$, i.e., ${\textbf {Y}}= \text{ CONCAT }({\texttt {A}}, f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A)))$. In the following sections, we show that the XOR encoder dramatically reduces the number of long output sequences. This procedure is described in Algorithm 3.Algorithm 3DecoderUpon receiving the output sequence ${\textbf {Y}}$ and the random seed ${\textbf {S}}_b$, the decoder checks the first base $b^\star =Y_1$, which indicates the pseudorandom sequence in use. Recall that the decoder has the seed ${\textbf {S}}_b$ and can recover the binary sequences ${\textbf {Z}}_b$. Since $f_{GC}$ only adds the dummy sequence, the decoder recovers $U = Y_2\dots Y_{\ell _U+1}$ by extracting the first $\ell _U$ bases. Then, by reversing the $f_H$ process, the decoder obtains ${\textbf {X}}\oplus {\textbf {Z}}_{b^\star }$. Finally, by applying XOR to ${\textbf {Z}}_{b^\star }$, the decoder retrieves the original binary input ${\textbf {X}}$. In the absence of the XOR encoder, upon receiving the sequence ${\textbf {V}}={\textbf {Y}}$ requiring decoding, the initial step involves removing the redundancy introduced by the GC content encoder $f_{GC}$. This yields $U = V_1\dots V_{\ell _U}$ for the decoder. Subsequently, $f_H$ is reversed to recover the original binary input ${\textbf {X}}$. Throughout this process, no information loss or code bit mismatch occurred. Furthermore, our subsequent experiments did not encounter any decoding failures.

Efficient and low-complexity variable-to-variable length coding for DNA storage | BMC Bioinformatics

The dynamics of blaTEM resistance genes in Salmonella Typhi

Influence of climate on desert locust (Schistocerca gregaria Forskål, 1775) Plague and migration prediction in tropics

Global profiling of protein complex dynamics with an experimental library of protein interaction markers

Antioxidant activities of Saudi honey samples related to their content of short peptides

LDAGM: prediction lncRNA-disease asociations by graph convolutional auto-encoder and multilayer perceptron based on multi-view heterogeneous networks | BMC Bioinformatics

Hot Topics

The dynamics of blaTEM resistance genes in Salmonella Typhi

Influence of climate on desert locust (Schistocerca gregaria Forskål, 1775) Plague and migration prediction in tropics

Global profiling of protein complex dynamics with an experimental library of protein interaction markers

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

The dynamics of blaTEM resistance genes in Salmonella Typhi

Influence of climate on desert locust (Schistocerca gregaria Forskål, 1775) Plague and migration prediction in tropics

Global profiling of protein complex dynamics with an experimental library of protein interaction markers

Antioxidant activities of Saudi honey samples related to their content of short peptides

Popular Articles

The dynamics of blaTEM resistance genes in Salmonella Typhi

Influence of climate on desert locust (Schistocerca gregaria Forskål, 1775) Plague and migration prediction in tropics

Global profiling of protein complex dynamics with an experimental library of protein interaction markers