HiCMC: High-Efficiency Contact Matrix Compressor | BMC Bioinformatics

Our approach HiCMC is a major extension of CMC [31]. It comprises splitting the genome-wide contact matrix into intra- and inter-chromosomal sub-contact matrices, row and column masking, model-based transformation, row binarization, and entropy coding as shown in Fig. 2. The key idea of CMC is to transform contact matrix values so that in each row of the matrix the number of bits required for each value, i.e. the magnitude of the values, is similar. This facilitates more efficient entropy coding. The main drawback of CMC is that it does not account for structures that exist in an intra-chromosomal contact matrices, such as compartments and domains, which are highly interacting with themselves. These structures cause the interactions in certain regions of the contact matrix to be lower or higher than the expected interactions based on the distance. HiCMC improves intra-chromosomal contact matrix compression by modeling the aforementioned structures in a step called model-based transformation. For the inter-chromosomal contact matrix, no changes were made to the compression pipeline. These processes will be discussed in the following sections.Fig. 2The HiCMC compression pipeline consists of splitting the genome-wide contact matrix into intra- and inter-chromosomal contact matrices, row/column masking, model-based transformation, row binarization, and entropy coding. The type of input sub-contact matrix determines whether Intra or Inter is usedFig. 3Splitting and masking processes of HiCMC. (a) The contact matrix is divided into two different sub-contact matrices based on chromosome-chromosome interactions: intra-chromosomal (Intra) and inter-chromosomal (Inter) sub-contact matrix. We only store the sub-contact matrices that are in the main diagonal and upper triangle of the matrix. (b) The masking process works by marking empty rows/columns in the corresponding mask (left) and then removing them from the original matrix to construct the masked matrix (right)Split contact matrixThe first step in the compression pipeline is to divide a chromosome-wide contact matrix into chromosome-chromosome interaction matrices, hereafter referred to as sub-contact matrices. Due to the symmetry of contact matrices, only sub-contact matrices lying within the upper triangle need to be stored. The contact matrix after splitting is shown in the Fig. 3a.Row-column maskingTo remove redundant information in sub-contact matrices efficiently, we next remove unalignable regions [31] — rows and/or columns with no interactions —  by first marking the rows and columns with the binary masks (see Fig. 3b). The mask entry is set to 1 for the corresponding rows or/and columns containing only zeros, otherwise it is set to 0. The pipeline is branched differently depending on the type of sub-contact matrix: intra- or inter-chromosomal sub-contact matrix.Model-based transformationThe diagonal transformation of CMC assumes that the values in a diagonal of the contact matrix are of approximately similar magnitude. This transformation reflects the observation that the chromosomal interactions serve as an approximation of spatial distance [8]. By placing the entries from the same diagonal in a row in the new matrix, the number of bits required to represent the values in each row can be reduced. However, due to structures such as A/B compartments or TADs, it provides only a basic approximation of the interactions. Interactions within compartments and TADs are enriched, but an abrupt drop in interactions is observed for inter-compartments and inter-TADs [9].Fig. 4Overview of the model-based transformation pipeline and model prediction. (a) The model-based transformation pipeline creates models based on the entries of the sub-contact matrix and domain boundaries, and uses these models to generate predictions (step 1). The pipeline then sorts these predictions in magnitude order, resulting in sorting indices (step 2). Finally, the pipeline rearranges the original interactions according to these indices, starting from the top left to the bottom right (step 3). (b) An example of the model prediction derived from Fig. 1. Domain matrices modeled using the genomic distance function, characterized by color gradients such as the domain on the main diagonal, can be distinguished from those modeled using a constant domain value. The blue lines represent the domain boundaries determined by a TAD callerTo overcome the limitations of existing approaches, we propose a novel model that represents a sub-contact matrix as a set of rectangular intra- and inter-domain matrices. Specifically, we model intra-domain matrices using genomic domain information and inter-domain matrices using a constant value. To derive the domain matrices, we must first determine the domain boundaries of domains using a TAD caller. Figure 4b illustrates an example of the domain boundaries predicted by a TAD caller, denoted by the blue lines. Constructing and efficiently encoding this model is crucial, and various methods can be explored. Moreover, biases in visibility across regions of a chromosome, such as GC-rich regions and regions that are difficult to map, can affect boundary prediction. To improve model accuracy, we construct the model from a balanced matrix, thereby removing experimental bias introduced in the experiments.We divide the sub-contact matrix into two types of rectangular regions, which we refer to as domain matrices: inter-domain matrices and intra-domain matrices. For intra-domain matrices, we model the entries based on the function of genomic distance \(f(j – i)\), where i and j represent the row and column indices, respectively. In contrast, inter-domain matrices are modeled using a single constant value, which is advantageous because average interactions for certain domains are roughly constant and no longer correlate with genomic distance. The decision on how to model each domain matrix depends on the statistical properties of the domain matrix entries and the corresponding threshold, both of which are encoding parameters. Specifically, we compute the standard deviation of the non-zero entries to determine the domain matrix type. We encode this decision as a binary matrix called “domain classes”, where each entry represents the type of domain matrix for each domain.We transform the original sub-contact matrix by sorting its entries based on the modeled matrix entries, as shown in Fig. 4a. This process involves three steps: First, we model the domain matrices and predict the entries of the contact matrix based on our domain matrices. Next, we determine the sorting indices from these predictions. Finally, we sort the contact matrix by placing each entry of the original matrix into its corresponding index. Figure 4b illustrates the predicted domain boundaries and the modeled matrix entries.In detail, the genomic distance function for intra-domain matrices is implemented as a “distance table”, where each entry represents the average value of intra-domain matrix entries at a given genomic distance. The table is organized with columns representing specific genomic distances and rows representing specific domains, grouping values of similar magnitude together. The entries in both the sub-contact matrix and the domain matrix that lie on the same diagonal share the same genomic distance. For each domain matrix, we compute the average value of a particular diagonal and append it to the distance table. This organization enables efficient entropy coding, resulting in a higher compression ratio.$$\begin{aligned} \begin{array}{c} \left[ \begin{array}{c|c} \begin{array}{cc} c_{11} & c_{12} \\ 0 & c_{22} \end{array} & \begin{array}{cc} c_{13} & c_{14} \\ c_{23} & c_{24} \end{array} \\ \hline \begin{array}{cc} 0 & 0 \\ 0 & 0 \end{array} & \begin{array}{cc} c_{33} & c_{34} \\ 0 & c_{44} \end{array} \end{array} \right] \\ \text {Contact Matrix}\:C \end{array} \begin{array}{c} \left[ \begin{array}{c|c} \begin{array}{cc} {{D_{00}}}\\ & \end{array} & \begin{array}{cc} {{D_{01}}}\\ & \end{array} \\ \hline \begin{array}{cc} {{D_{10}}}\\ & \end{array} & \begin{array}{cc} {{D_{11}}}\\ & \end{array} \end{array} \right] \\ \text {Domains Matrices}\:D \\ \end{array} \quad \rightarrow \quad \begin{array}{c} \begin{array}{cccc} 0 & 1 & 2 & 3 \\ \hline d^{(0)}_{00} & d^{(1)}_{00} & d^{(2)}_{01} & d^{(3)}_{01} \\ d^{(0)}_{11} & d^{(1)}_{01} & 0 & 0 \\ 0 & d^{(1)}_{11} & 0 & 0 \end{array}\\ \text {Distance Table}\:T\end{array} \end{aligned}$$To illustrate this process, let us consider a \(4 \times 4\) contact matrix C with entries \(c_{ij}\) at position (i, j). We assume that the domain matrices have a size of \(2 \times 2\) and are indexed with (ab). Due to the symmetrical property of the contact matrix, its lower triangular entries are zero. Each column of the distance table stores the average entries of all domains for a specific genomic distance k. We compute the entries of the distance table \(d^{(k)}_{ab}\) by averaging all contact matrix entries \(c_{ij}\) that belong to a domain matrix \(D_{ab}\) at a distance of k:$$\begin{aligned} d^{(k)}_{ab} = \mathbb {E} \bigl [c_{ij}\bigr ], \forall {c_{ij}\in D_{ab}} \wedge k = j-i \wedge c_{ij} \ne 0 \end{aligned}$$where \(\mathbb {E} \bigl [\cdot \bigr ]\) denotes the averaging operation.For inter-domain matrices, We store the average interactions of each domain in a matrix called “domain values”, as these matrices are modeled using a single constant value. The “domain value” matrix has the same shape as the “domain classes” matrix.Based on the domain classes, the distance table, and the domain values, we predict the entries of the sub-contact matrix. For a domain modeled as a function of genomic distance, we retrieve the entries at a given genomic distance from the distance table. Otherwise, we set all entries of the corresponding domain matrix to the domain value of the corresponding domain retrieved from the domain values matrix, resulting in a predicted domain matrix with a uniform value.Figure 4b illustrates an example of a predicted sub-contact matrix. The model used for prediction must be included in the compressed payload, which introduces an overhead. This leads to a trade-off between the quality of our model and the compression performance. To mitigate the overhead, we reduce the floating-point precision of both the distance table and the domain values, thereby striking a balance between model quality and compression efficiency. It is important to note that the floating-point precision reduction does not render our compression method lossy, as the prediction is used to sort the original sub-contact matrix, and the reduction occurs prior to the prediction step in the encoding process, thereby preserving all information.For the model-based transformation, our primary goal is to infer the sorting order based on the prediction as similar as possible compared to the sorting order based on the original contact matrix, i.e. to predict the underlying relative differences between contact matrix entries (as measured by Spearman’s rank correlation) rather than to predict the magnitudes (which would be similar to minimizing the mean square difference). Furthermore, minimizing the absolute differences would introduce significant overhead for long-range interactions (i.e., entries for which the difference between row and column IDs is large) due to random ligation. We evaluate the quality of the model by examining the overall reduction in size, rather than directly assessing the model’s sorting using Spearman’s rank correlation between the original and predicted matrices. This approach is necessary because of the complex relationship between the model-based transformation and the entropy coding step.Mask-value decompositionFollowing the application of the model-based transformation, we decompose the transformed sub-contact matrix using mask value decomposition. Unlike row binarization in CMC, this decomposition yields comparable compression performance with a significantly simpler process. Mask-value decomposition separates the sub-contact matrix into two components: a binary matrix indicating the positions of non-zero entries, and a separate array containing the corresponding non-zero values. We refer to these two components as the sub-contact matrix mask and the sub-contact matrix values, respectively.Entropy codingIn total, four payloads are required for the model: the domain boundaries, the domain classes, the domain values, and the distance table. The domain boundaries can be represented as a one-dimensional binary array indicating the presence or absence of a boundary for each bin. It can be efficiently encoded using binary run-length encoding [31], since long sequences of zeros (indicating the absence of a boundary) are expected.Both the domain classes and the sub-contact matrix mask are binary matrices. Since there are many 1’s along the main diagonal of the matrix, it is first transformed using the diagonal transformation [31] and then compressed using an encoder conforming to the Joint Bi-level Image Experts Group (JBIG) standard (ISO/IEC 11544 [32]), specialized for lossless compression of bi-level (i.e., binary) images. It takes advantage of the spatial correlation of neighboring binary pixels.The domain values matrix is also transformed using the diagonal transformation, as higher values tend to be placed along the main diagonal. Both the domain values and the distance table matrix are encoded by serializing them into an array, which is also compressed using fpzip [33] with a certain floating point precision, which controls the quality of the model as mentioned in Sect. 2.3. Finally, the sub-contact matrix values are compressed using the prediction by partial patching (PPM) [34]-based technique PPMd [35].

Hot Topics

Related Articles