A python library for the fast and scalable computation of biologically meaningful individual specific networks

The perturbation-based computation of ISNsA network \(G:=(V,E)\) consists of a set V of nodes v, and a set of edges \(E \subseteq V\times V\), such that each edge \(e_{ij}\) connects a pair of nodes \((v_i,v_j)\in V\). ISN-tractor utilizes the perturbation-based approach to ISNs proposed by Ref.2. Each Individual Specific edge \(e_{ij}^q\), for the \(q^{\text {th}}\) individual is computed as:$$\begin{aligned} e_{ij}^q=N\ e_{ij}^\alpha – (N-1)\ e_{ij}^{\alpha -q} \end{aligned}$$
(1)
where N is the sample size, \(e_{ij}^\alpha \ \) is the edge computed on every individual in the dataset, and \(e_{ij}^{\alpha -q}\) is the same edge computed without considering the \(q^{\text {th}}\) sample. The network calculated on every individual is called population network, while the one with an individual removed leave-one-out network. The edge values can be computed in several ways. One possibility is to compute a correlation score, like the Pearson correlation, between the values associated with each pair of nodes. Hence, the resulting network is weighted and undirected. This formula capitalizes on the contrast between two networks, differing solely in the presence or absence of a specific individual q, to infer insights regarding the impact on network topology when an individual is added or removed. Furthermore, the rationale behind this is the desire to construct ISNs such that their average would be close to the network obtained by pooling all samples together. In Ref.2, the authors show that for large sample sizes and under specific linearity assumptions (see Suppl. 5.2 for an extensive discussion), the average of the ISNs across individuals is equal to the network computed on the population. Moreover, the final result is not influenced by the sample size. While larger sample sizes lead to a lower difference between \(e_{ij}^\alpha \) and \(e_{ij}^{\alpha -q}\), this is countered by the multiplicative factor N (sample size). Hence, the estimated ISN is driven by the deviation obtained by removing/adding individual q, and not by the sample size.A novel algorithm for the Incremental ISNs computation drastically reduces the computational complexityWhen ISNs are built using Pearson correlation, Eq. (1) involves several computations of Pearson between all the samples \(e_{ij}^{\alpha }\) and all the sample with the exception on the q-th target sample, indicated as \(e_{ij}^{\alpha -q}\). However, this means that a nearly identical operation is repeated N times. For example, the algorithm used in the LionessR ISNs package2,12, suffers from this redundancy in the computation, since it removes the target sample q at each iteration (i.e., for each individual). This is very inefficient since most of the operations repeated for each sample could be avoided instead. To overcome this issue, we propose an incremental calculation of the Pearson correlation that dramatically speeds up the calculation by avoiding all the redundant operations. In our approach, we store summary statistics of our population and, through addition and subtraction, we calculate the Pearson correlation of the population minus the target sample, without re-running the computation on the entire dataset. Furthermore, we vectorize the procedure to make it more efficient and better suited for GPU computations.Explaining the intuition behind the incremental Pearson correlation algorithmIn this section, we describe the reasoning and the mathematical steps behind the Incremental Pearson correlation algorithm, focused on one single edge. In the Supplementary (Section 1.2.1) we expand it to a vectorized version where an entire network is calculated in one single step. Finally, we discuss the advantages of the proposed incremental Pearson.Given a \(N\times p\) matrix X, where N is the number of samples and p is the number of features (i.e. genes), the Pearson correlation between columns (genes) j and k is computed as:$$\begin{aligned} r_{jk}^\alpha = \frac{\Sigma _{q=1}^n(x_{qj}-\bar{x}_j)(x_{qk}-\bar{x}_k)}{ \sqrt{\Sigma _{q=1}^n(x_{qj}-\bar{x}_j)^2} \sqrt{\Sigma _{q=1}^n(x_{qk}-\bar{x}_k)^2}} \end{aligned}$$
(2)
Here, we follow the notation used in Ref.2, indicating with the apex \(\alpha \) the Pearson correlation calculated on the whole population (all the N individuals). Moreover,

\(x_{qj}\) and \(x_{qk}\) are the individual gene values for genes j and k in individual q,

\(\bar{x}_j=\frac{1}{n}\Sigma _{q=1}^n x_{qj}\), is the arithmetic mean of gene j across all the individuals and, specularly, \(\bar{x}_k\) is the mean across all the individuals for gene k.

For big datasets, i.e., high sample size and feature set, recalculating Pearson correlation for each leave-one-out step of the LIONESS procedure can be very time-consuming. To reduce the computation time, we extract sufficient statistics from the population network, ensuring that to calculate the leave-one-out edges we only need to add or subtract quantities calculated from them. Sufficient statistics are quantities that summarize all the needed information from the samples to build the Pearson correlation. For example, in this context sufficient statistics calculate the sum of the contribution of each sample, such as the gene abundance, and thus, the contribution of each sample can be easily singled out.Hence, we rearranged the Eq. (2) to highlight these sufficient statistics. The resulting formula is as follows:$$\begin{aligned} r_{jk}^\alpha = \frac{NA_{jk}-\hat{x}_j\hat{x}_k}{ \sqrt{N\hat{a}_k-(\hat{x}_k)^2} \sqrt{N\hat{a}_j-(\hat{x}_j)^2}} \end{aligned}$$
(3)
The sufficient statistics are listed below. The size of the corresponding vectors and matrices are indicated as pedix:

\(\hat{x}_{p\times 1}\), with element \(\hat{x}_j=\sum _{i=1}^nx_{ij}\), the sum of j-th gene in all individuals.

\(A_{p\times p}\), with element \(A_{jk}=\sum _{i=1}^n x_{ij} x_{ik}\) the sum over all individuals of two genes j and k (dot product).

\(\hat{a}_{p\times 1}\), with element \(\hat{a}_j=\sum _{i=1}^n x_{ij}^2\), the sum of the square j-th gene in all individuals. (The diagonal of the dot product).

Since our goal is to calculate the Pearson correlation for a population composed of all-but-one sample (leave-one-out, indicated with \(\alpha -q\) as apex), we subtract the contribution given by observation q from each sufficient statistic. In detail, for a generic q-th individual:

\(x_{qj}\); the value of gene j in individual q subtracted from \(\hat{x}_j=\sum _{i=1}^nx_{ij}\) in both numerator and denominator; (similarly \(x_{qk}\) subtracted from \(\hat{x}_k\) ).

\(x_{qj}x_{qk}\); the product of genes j and k in individual q subtracted from \(A_{jk}=\sum _{i=1}^n x_{ij} x_{ik}\).

\(x_{qj}^{2}\); the squared value of gene j in individual q subtracted from \(\hat{a}_j=\sum _{i=1}^n x_{ij}^2\); (similarly \(x_{qk}^2\) subtracted from \(\hat{a}_k\) ).

1 from N, since the sample size is reduced by 1.

Such differences constitute the contribution of individual q to the population Pearson correlation; by eliminating them, the result is the leave-one-out network, based on the population without the individual qHence, the formula to calculate the leave-one-out edge (correlation) between genes j and k, without individual q, with in bold the contribution subtracted, is:$$\begin{aligned} r_{jk}^{\alpha -q} = \frac{(N \mathbf {-1})(A_{jk} \mathbf {-x_{qj}x_{qk}})-(\hat{x}_j \mathbf {-x_{qj}})(\hat{x}_k \mathbf {-x_{qk}})}{ \sqrt{(N \mathbf {-1})(\hat{a}_k\mathbf {-x_{qk}^2})-(\hat{x}_k \mathbf {-x_{qk}})^2} \sqrt{(N \mathbf {-1})(\hat{a}_j-\mathbf {x_{qj}^2})-(\hat{x}_j-\mathbf {x_{qj}})^2} } \end{aligned}$$
(4)
With the formula above, we calculate the Pearson correlation for the leave-one-out network directly from the sufficient statistic of the entire population. Hence, without the need to re-compute the Pearson correlation, but exploiting the knowledge of the population network.Thus, having both \(r_{jk}^\alpha \) and \(r_{jk}^{\alpha -q}\), as calculated in Eqs. (3), and (4), the LIONESS formula (Eq. (1)), for an ISN edge between genes j and k of individual q, is as follows:$$\begin{aligned} r_{jk}^q=Nr_{jk}^\alpha -(N-1)r_{jk}^{\alpha -q} \end{aligned}$$
(5)
A summarization of the various steps to achieve the incremental Pearson correlation algorithm is provided in Fig. S4.Speeding up further the ISN computation by exploiting the sparsity of biological networksExisting methods for constructing ISNs typically assume an underlying dense network. This poses challenges when the goal is to construct genome- or proteome-wide ISNs, since the number of edges scales quadratically with the number of genes/proteins considered. In contrast, the library we propose, ISN-tractor, introduces a novel way to reduce this computational burden in a biologically meaningful manner. If the user needs to build particularly large ISNs, they can choose to reduce the computation burden by defining a sparse network as scaffold underlying the computed ISNs. This underlying sparse network can be selected from established biological networks, from databases such as STRING15, KEGG16, consensuspathDB17, or any other source. In this way, ISN-tractor can build ISNs considering only user-defined edges, thereby leveraging the inherent sparsity of biological networks18 to reduce the size and the computational cost of the ISNs. The integration of prior biological knowledge reduces the number of edges to be inferred in a biologically meaningful way, ideally reducing the network size while preserving the most significant interactions that are likely to contribute meaningful insights. For instance, in the context of proteomics or gene expression data analysis, ISN-tractor can build ISNs based on known gene or protein interactions extracted from the Human Reference Interactome (HuRI)14, which is one of the most comprehensive protein-protein interaction network available to date. This allows researchers to construct proteome-scale ISNs in a scalable way, tailoring the network underlying the ISN to their specific analysis objectives, dataset characteristics, or organism of interest. While drastically improving the scalability of the computation, this biologically meaningful sparsification enables researchers to integrate prior knowledge into the network construction.ISN-tractor can compute ISNs from genotype array dataISN-tractor is designed to compute ISNs from a variety of -omics data, including but not limited to genotype array data, transcriptomics, and proteomics. This versatility allows ISN-tractor to be applied across different -omics datasets, ensuring its utility in a wide range of biological and clinical research applications. For instance, when applied to genotype array data, the observed Single Nucleotide Polymorphisms (SNPs) are first mapped to their respective user-defined genes or genomic regions.Each of these genomic regions is associated with a network node. The edge-weights between nodes are inferred by evaluating the correlation between the sets of SNPs mapped on each node \(V_i\). In case of large SNP-array data encompassing many genes, the number of edges to be computed can be reduced by sparsifying the expected network of connections between nodes, as explained in Section “Speeding up further the ISN computation by exploiting
the sparsity of biological networks”. For instance, if each node represents a gene, the user may opt to instruct ISN-tractor to permit edges exclusively between genes known to interact, as evidenced by biological networks such as HuRI14.To determine the weight assigned to each edge, ISN-tractor computes the correlations among the sets of SNPs mapped on each pair of nodes \(V_i\) and \(V_j\) that are connected by an edge, obtaining two \(N\times m_i\) and \(N\times m_j\) matrices respectively, where N is the number of samples and \(m_*\) indicates the number of SNPs mapped on each node \(V_*\). In the case of SNP-array data, ISN-tractor computes the correlation between every SNP mapped on \(V_i\) with every SNP mapped on \(V_j\), thus obtaining a \(m_i \times m_j\) matrix \(C_{ij}\). To compute correlations, ISN-tractor provides 3 commonly used similarity metrics (i.e., Pearson, Spearman, or Dot Product) and allows any other user-defined metric (see Sect. “Clustering HapMap data with ISN-tractor” for an example).Finally, in order to obtain a scalar value that can be used as edge weight between the \(V_i\) and \(V_j\) nodes in the ISN, the user can choose between several pooling strategies to summarize the matrix \(C_{ij}\). The simplest ones are to take the maximum or the average value of the elements in \(C_{ij}\). The computation of this additional matrix is necessary to accomodate the SNPs mapped on genomic regions to the ISN nodes in a flexible way. From this point on, the ISN can be computed conventionally with Eq. (1), similarly to an ISN derived from expression or proteomic data.Figure 1(A) shows the time and memory usage comparison between ISN-tractor and LionessR. ISN-tractor outperforms LionessR in both benchmarks. We computed dense ISNs using the Incremental Pearson algorithm in ISN-tractor for this comparison. This approach was chosen because LionessR can only compute dense ISNs with the Pearson correlation. (B) shows the Filtration curves computed by ISN-tractor on Osteosarcoma patients with good/poor MFS, respectively in green and red. The overlap is highlighted in yellow. (C) shows the PCA obtained from SNP-array based ISNs for individuals coming from different populations from the HapMap dataset.

Hot Topics

Related Articles