Ant colony optimization for the identification of dysregulated gene subnetworks from expression data | BMC Bioinformatics

In this section, we present the methods used in the proposed approach. First, we start with a description of the general framework of Ant Colony Optimization. Next, we define the measures of module significance, and we define the quantification of biological influence of dysregulated genes. Then, we explain the workflow steps, starting with the gene expression and interconnection datasets as input, up till the generation of candidate dysregulated modules. A diagram depicting these steps is shown in Fig. 1. The structures of the corresponding input and output matrices are visualized in Fig. 2.Fig. 1An overview of the proposed workflow which takes as input a gene expression dataset and an adjacency matrix representing gene interconnections. A built-in preprocessing step generates distance and rank correlation matrices, in addition to t-statistic scores reflecting the magnitude of differential expressions of gene in each module. Then, for each gene, the proposed Ant Colony Optimization approach identifies the most significantly dysregulated module, based on \(p^{fisher}\) which is derived from \(p^S\), the p value for the sphere of influence of a center gene on its neighbors in a module, and \(p^D\), the p value for the decay of differential expression within that moduleFig. 2The structures of input and output matrices in the proposed approachAnt colony optimizationAnt Colony Optimization (ACO) is one of numerous meta-heuristic algorithms inspired by swarm intelligence, in this case the foraging behavior of ants. It is widely used to tackle combinatorial problems [23]. The inspiration stems from the observed efficiency in which ants conduct their search for food starting from the nest. Biological ants display a type of communication known as stigmergy. The main characteristics of stigmergy arise from the medium employed in this type of communication: pheromones. Pheromone deposits reflect the quality of the achievement and are then a means of indirect communication between ants. Moreover, pheromone deposits are local and transmit information between ants within a locus. Pheromones not only indirectly reflect the end result of an explored path but also its length. Although random fluctuations exist early on during the search, ants usually deposit pheromones faster after returning to the nest from the shorter path. In this way, ants converge to the shortest path. In ACO algorithms (Algorithm 1) [23], a model \(P = (S, \Omega , \textit{f})\) consists of the search space S whereby a feasible solution \(s \in S\) satisfies all constraints in a set \(\Omega\) of constraints over the finite set of discrete variables defining S. A globally optimal solution \(s^* \in S\) additionally minimizes a given objective function \(\textit{f}: S\rightarrow \mathbb {R}_0^+\) (i.e. \(\textit{f}(s^*) \le \textit{f}(s) \forall s \in S\) ). The set of all possible solution components (i.e. all possible variable assignments in S) is denoted by C. Each component of a solution \(s \in S\) is associated with a pheromone value that varies with quality and evaporates at every iteration. A component is represented by either vertices from a set of vertices V or edges from a set of edges E of a construction graph \(G_C{(V,E)}\).Algorithm 1The Ant Colony Optimization MetaheuristicAt the start of an iteration, a set of A ants construct solutions by traversing the graph in a manner that satisfies constrains in \(\Omega\) followed by an optional local search. In Ant Systems, the decisions made by a given ant during its construction walk are governed by a stochastic process influenced by the pheromones allocated to possible components. ACO systems, however, use a pseudorandom mechanism that encourages elitism by deterministically picking the most probable component if a random number ranging from 0 to 1 falls under a user-specified threshold. Otherwise, decisions are made similarly to Ant Systems. Pheromones could be deposited in different ways. In Ant Systems, an offline pheromone update occurs at the end of each iteration after solutions have been constructed. ACO systems additionally employ a local pheromone update that decreases pheromone concentration on the last visited edge after each construction step performed by the ant with the current best solution. This serves to offset the offline pheromone update and encourage diversity. Evaporation is also performed using a user-specified or dynamic parameter to guide ants towards shorter and more frequently explored paths. In our formulation, each ant produces a candidate solution in each iteration, but these are independent from solutions generated at other iterations. In other words, information is not carried over between iterations, and only local pheromone updates are performed. Given the nature of the problem whereby the objective function is module-centric, an offline pheromone update would encourage convergence towards high-quality paths, which could bias the topology of generated modules. For the same reason, the final solution, or module, generated at each iteration is the union of the individual ant solutions (i.e. sets of vertices) explored in this iteration. Finally, we introduce an alloted capacity to each ant at every iteration. This capacity diminishes as the ant makes increasingly unfavorable moves, causing the ant to stop moving when its capacity is too low. Therefore, an iteration automatically terminates when all ants have no more capacity for movement.Module significance quantificationThe Order Statistic Correlation Coefficient (OSCC) [24] is a measure that can be used to detect linear and monotone nonlinear associations. It possesses the same basic characteristics as Pearson’s linear, Spearman’s \(\rho\), and Kendall’s \(\tau\) coefficients. OSCC also exhibits robustness to noise and efficient time complexity (\(O(n \log {n} )\)), when compared to Kendall’s \(\tau\) (\(O(n^2 )\)). The calculation of OSCC is given in equation 1 using two paired input arrays x and y.$$\begin{aligned} OSCC(x,y)\triangleq \frac{\sum _{i=1}^N(x_{(i)}- x_{(N-i+1)} ) y_{[i]} }{\sum _{i=1}^N(x_{(i)}- x_{(N-i+1)} ) y_{(i)} } \end{aligned}$$
(1)
The paired input arrays x and y are first ordered such that the order statistics \(x_{(1)}\le x_{(2)} \le \cdots \le x_{(N)}\) have respective concomitants being \(y_{[1]}, y_{[2]}, \cdots , y_{[N]}\). The order statistics and concomitants of the input y array are similarly defined. As \(N\rightarrow \infty\), \(E\{OSCC(x,y)\}=0\) under the assumption that x and y are mutually independent and both are independently identically distributed.In this application, similarly to how GS uses Kendall’s \(\tau\) to score the decay of DE of modules [22], OSCC is used to score a module centered at a given gene \(G_i\) and denoted by the set module as shown in equation 2. The geodesic distance is the number of edges in the shortest path to the center gene. OSCC outputs a score ranging from -1 to 1. A module having a score of -1 is optimal, since the discordance between absolute moderated t-statistics, representing the magnitude of DE of genes belonging to a given set module, and the geodesic distances of those genes from the center gene \(G_i\) is maximal according to the OSCC.$$\begin{aligned} OSCC(module)=OSCC(\{|t_j | : G_j \in module\},\{d(G_i,G_j ) : G_j \in module\}) \end{aligned}$$
(2)
We denote the change in OSCC with and without the inclusion of a given node \(G_p\) to a set module by \(\triangle OSCC(module \cup \{G_p \})\) which is calculated in equation 3. \(\triangle OSCC\) outputs a value ranging from -2 to 2, with -2 being the greatest possible change in OSCC in the favorable direction, that is from 1 to -1.$$\begin{aligned} \triangle OSCC(module \cup \{G_p \})=OSCC(module \cup \{G_p \})- OSCC(module) \end{aligned}$$
(3)
Biological influence quantificationIn this part, we explain how GS [22] quantifies biological influence at the gene-level through the incorporation of system-level network information. Then, we present our approach to derive such measures in the next section. A gene score is defined as a combination of two scores representing the decay of DE and the sphere of influence detected in neighboring genes that are selected on the basis of a variable radius R. The sphere of influence score indicates that a gene influences its neighbors such that their expression intensities are correlated. The decay of the DE score indicates how the dysregulation of a disease-relevant gene is propagated to its neighbors in a decreasing pattern whereby the level of dysregulation is inversely proportional to the distance from the given gene. Since the extent to which a gene influences its surrounding neighbors with respect to both scores is unknown, all possible values for R are considered. The pseudocode for GS is given in Algorithm 2.Algorithm 2Given a gene \(G_i\), for each possible value for R, all neighboring genes having a minimum distance from \(G_i\) less than or equal to R are selected for the module centered at \(G_i\). The combined score is then calculated for the candidate modules, and the optimal value for R is chosen as the one with the highest statistical significance.Proposed methodThe choice to base our implementation on ACO is mainly due to the problem being a local search [23]. The main modification applied to ACO was the introduction of a limited capacity for movement assigned to each ant to automatically terminate iterations. This capacity diminishes as the ant moves to nodes further from the center gene and nodes that unfavorably impact the module’s score. Another important modification to ACO is that although the search is carried out over a specified number of iterations, each iteration represents an independent search that yields a candidate result. In other words, there is no exchange of information between different iterations. This is because the aim is not for ants to converge to similar paths but to spread and explore different paths outwards from the designated center node. Moreover, since each gene’s score is a combination of both scores for the decay of DE and the sphere of influence, we optimize the scores for the decay of DE rather than optimize both scores. This is because the module of neighboring genes that exhibit the highest correlation of gene expression intensities with those of the center gene are expected to be relevant if they first show that they are influenced by the center gene in a disease-relevant context through the decay effect. The score for the sphere of influence is calculated for the module resulting from each iteration, and the optimal module is selected as the one having the combined score with the highest statistical significance, similarly to GS.Given N samples \(S_1, S_2, \cdots , S_j, S_{j+1}, \cdots , S_N\) having N class labels as well as a graph of \(V’\) nodes and E edges represented as an adjacency matrix M, we define a microarray dataset X of V genes \(G_1, G_2, \cdots , G_i, G_{i+1}, \cdots , G_V\) and N samples. The input to the algorithm are M, X and its class labels, a seed for reproducibility (different than the seed gene), as well as several configuration parameters. The pseudocode is given in Algorithm 3.Algorithm 3An built-in preprocessing step, as depicted in Fig. 1, filters dataset X and M to only include genes present in both X and M, calculates a Spearman rank correlation matrix \(\rho\) and \(\rho ^k\) from dataset X as well as a user-specified number K of random permutations of genes in X, calculates a distance matrix d from matrix M, conducts a DEA over dataset X using true class labels as well as a user-specified number of random permutations K of class labels in X, and designates the resulting moderated t-statistic \(t_i\) and \(t_i^k\) as the observed and permuted scores, respectively, for each vertex \(G_i\).The algorithm is repeated a maximum of \(\min {(V,V’)}\) times, each time given a specified gene \(G_i\) and number of iterations n, and outputs n modules centered at \(G_i\) with each module being assigned a combined p value adjusted for n iterations similarly to \(p_i^{GS}\) except using Benjamini-Hochberg (BH) adjustment for a less stringent correction than the Bonferroni adjustment. A gene \(G_i\) is skipped if it is detected as being an isolated node (i.e. if \(G_i\) has no edge connected to another gene \(G_j\) where \(i \ne j\)).As in the classic ACO algorithm [23], a designated number of ants A are given a starting point. In this case, the starting point is the designated center gene \(G_i\). Figure 3 depicts the start of an iteration in the algorithm for a gene \(G_i\) as the starting point in a network whereby t indicates the score of each vertex.Fig. 3An example of two ants searching for gene modules in the proposed Ant Colony Optimization approach. The networks from left to right depict consecutive iterations with gene \(G_i\) as the starting point. Below every network, a table shows the position, capacity, and moving ability of ants. At each step, an ant moves to an adjacent gene, and updates the pheromone value on the traversed edge based on the favorability of the move. Visited nodes, colored in pink, constitute the module centered around \(G_i\). Possible adjacent nodes to visit are colored in green nodes, while the rest of the nodes in the network are colored in grey. Note that given the proposed algorithm constraints, ants \(a_1\) and \(a_2\) become unable to move in the middle and the right networks, respectivelyIn our method, at the start of every iteration, the ants are placed at the starting point. In the first iteration, the ants move randomly. Although iterations in this implementation are independent, the iteration marked as first is still randomized and considered as a viable candidate solution. This iteration could also be used as a reference in future analyses that might investigate the topology of resulting modules. This is because modules in the first iteration are mainly shaped by capacity rather than pheromone, both of which are discussed below. Nevertheless, since investigating how the topology of resulting modules is formed is not the focus of the study, the random iteration is not investigated further and is treated like any other iteration.At each move, an ant a picks a possible node \(G_p\) directly neighboring (i.e., one edge away from) its current position \(G_j\) and assign a value to the explored edge as a representation of pheromone. For example, Fig. 3 depicts a snapshot mid-iteration where ant \(a_1\), positioned on \({G_j}’\) is no longer able to move and ant \(a_2\) positioned on \({G_j}\) is assessing its possible movements after local pheromone updates are performed for the previous movement. The pheromone update value reflects the favorability of the move with respect to the objective function and is calculated using Eq. 4 and 5.$$\begin{aligned}{} & {} EL(module(a) \cup \{G_p\})=\frac{d_{normalized} (G_i,G_p )}{4} (2+ \triangle OSCC(module(a) \cup \{G_p\})) \end{aligned}$$
(4)
$$\begin{aligned}{} & {} update(Pher_{jp},a)=Pher_{jp}+(1-EL(module(a) \cup \{G_p\})) \end{aligned}$$
(5)
The nodes currently marked as visited by ant a constitute the set module(a) centered at \(G_i\). The values for pheromones present on each edge are recorded in a pheromone matrix Pher that resets at every iteration to prevent information exchange between iterations. Nevertheless, pheromones transfer information between ants within the same iteration. An ant deposits pheromone mid-iteration as it crosses an edge with a local pheromone update similarly to the ACO system [23]. Rather than decreasing pheromones locally, ants exchange decision-quality information locally. In this way, diversification of trajectories explored by ants between iterations is encouraged with no offline information guiding ants across iterations. This is because the goal is to produce many candidate modules that are qualitatively different before choosing the best one. Moreover, another difference is that all ants are capable of depositing pheromone, not just the best one. This modification is put in place to encourage diversification of trajectories taken by the ants within a given iteration. In other words, within an iteration, the ants are encouraged to take different paths which are combined to result in a module of varying topology. The pheromone map is initialized to have all values equal to 1.0 at the start of every iteration. Pheromone deposited on the same edge accumulates, and no evaporation is incorporated since the pheromone map resets every iteration. The energy lost or required by ant a to make a specific move from \(G_j\) to \(G_p\) is represented by \(EL(module(a) \cup \{G_p\})\). This value is calculated using a row-wise normalized version \(d_{normalized}\) of the distance matrix d. EL returns a value in the interval [0,1], and the favorability of the move is expressed as \(1-EL\).For each iteration except the first, the ants move probabilistically using Eq. 6 to calculate the probability of making a certain move from \(G_j\) to \(G_p\). This is done similarly to Ant systems rather than an ACO system in order to avoid increasing elitism in the probability function [23].$$\begin{aligned} P_{jp} (a)= \frac{(Pher_{jp} )^\alpha (2-EL(module(a) \cup \{G_p\}))^\beta }{\sum _{\{G_q:G_q\in possibilities(a)\}}(Pher_{jq} )^\alpha (2-EL(module(a) \cup \{G_q\}))^\beta } \end{aligned}$$
(6)
The parameters \(\alpha\) and \(\beta\) are user-specified and reflect the weight given to pheromone and attractiveness of the move, respectively. In other words, in a given iteration, \(\alpha\) reflects how similarly the ants will move, and \(\beta\) reflects how much importance an ant will give to more favorable moves rather than moves similar to other ants. As \(\alpha\) is increased, the ants are expected to produce modules that increasingly converge to a single path. Increasing \(\beta\) will allow the ants to explore different paths and decrease the constraints over the topology of the network. If the pattern of decay can be observed over numerous paths going away from the center (seed) gene, the ants are more likely to explore them. Moreover, the more ants are available, the more different favorable paths are likely to be explored. This could also be achieved by increasing the number of iterations to produce more candidate modules, but this could be more time-intensive than increasing the number of ants. Moreover, increasing the number of ants would impose less restrictions over how many different paths can be explored.The probabilities are calculated for each possible movement. Possible movements are defined as nodes that are one edge away from the current position of an ant a and are represented by possibilities(a). Nodes that have already been visited by a during the current iteration, nodes that would diminish the capacity of a, Capacity(a), below 0 if visited, as well as nodes that keep a at the same distance from or bring a closer to \(G_i\) (i.e. \(d(G_i,G_p)\le d(G_i,G_j)\)) are excluded from the set possibilities(a). This helps short-list possible movements as well as guide ants to spread outwards from the center.The main difference between ACO and our implementation is that a capacity measure is introduced. Each ant is specified a starting capacity that diminishes on each move the ant makes to a node \(G_p\) according to Eq. 7.$$\begin{aligned} update(Capacity(a))=Capacity(a)-EL(module(a) \cup \{G_p\}) \end{aligned}$$
(7)
When an ant’s capacity reaches 0 or it has no more possible movements to consider (i.e. \(possibilities(a)= \phi\)), the ant stops moving. An iteration ends when all ants are no longer able to move. The resulting module centered around \(G_i\) for this iteration are all nodes visited by at least one ant and is denoted as module. Using Eqs. 8–11, module is scored and recorded at each iteration. \(p_i^{fisher}\) is extracted similarly to GS. Figure 3 also depicts the end of a single iteration n whereby both ants are unable to move, and the module produced includes all nodes visited by any of the two ants. Significance is then assessed, but not depicted, using \(\rho ^k\) and \(t^k\).$$\begin{aligned}{} & {} C_i (module)=\sum _{\{G_j:G_j\in module\}}|\rho _{ij} | \end{aligned}$$
(8)
$$\begin{aligned}{} & {} p_i^S (module)=\frac{1}{K} |\{k\in [1,K] \cap N^*: C_i^k (module)>C_i (module)\}| \end{aligned}$$
(9)
$$\begin{aligned}{} & {} p_i^D (module)=\frac{1}{K} |\{k\in [1,K] \cap N^*:OSCC_i^k (module)<OSCC_i (module)\}| \end{aligned}$$
(10)
$$\begin{aligned}{} & {} \chi ^2= -2(\ln {p_i^S (module)}+\ln {p_i^D (module)}) \end{aligned}$$
(11)
To further offset the added computational time of the proposed method as compared to GS, for each gene \(G_i\), the n iterations are parallelized, as depicted in Fig. 1, using the R package parallel’s mclapply [25]. Therefore, parallelization is only supported for Mac users. No parallelization is implemented between genes. We also use the order statistic correlation coefficient (OSCC) [24] which is more time-efficient (\(O(n \log { n})\)) than Kendall’s tau-b (\(O(n^2)\)), that is used in GS, in addition to numerous distance penalties and search restrictions, as described above, to constrain the search-space.ExperimentationDatasetsTo test our approach, we rely on the R package GSEABenchmarkeR [26] of Bioconductor [27]. This package includes curated expression datasets that are related to various human diseases. It was developed to facilitate the assessment and comparison of enrichment analysis methods. We consider a total of four datasets that have different case to control proportions and sample sizes. The characteristics of these datasets are shown in Table 2 which covers the corresponding GEO dataset ID, disease, total number of samples, total number of control samples, total number of mapped genes in the expression dataset, number of common genes between the expression dataset and the KEGG network, the diameter of the resulting largest component of the KEGG network, and the total number of isolated genes which are not part of the connected component. Three of them are related to neurodegenerative diseases, namely Parkinson’s (\(\text {GSE20291}\)) [28], Alzheimer’s (\(\text {GSE5281}\)) [29], and Huntington’s (\(\text {GSE8762}\)) [30] disease. The latter allows testing the stability of our approach with respect to different dataset characteristics. All three neurodegeneration datasets are already preprocessed by removing outlier arrays, applying a log transform, applying RMA normalization [31] from the affy [32] R package, resolving duplicate probe to Entrez ID mappings by keeping the probe with the most significant limma [1] moderated t-statistic, and filtering out genes that could not be mapped to any KEGG [33] pathway [26, 34, 35]. The fourth dataset is the p53 mutation dataset from the NCI-60 cell lines originally available through the GSEA Broad Institute website [4, 36]. This dataset is used to ensure the method performs comparably better than other tested methods on a dataset unrelated to neurodegenerative diseases. The dataset is imported through the R package GSAR [37]. The p53 mutation dataset is also already preprocessed by quantile normalizing and log-transforming probe intensities, discarding probes without Entrez ID mappings, and resolving duplicate probe to Entrez ID mappings by keeping the probe with the largest absolute t-statistic between cases and controls [37].Table 2 Microarray gene expression datasets considered in this study along with their basic characteristicsGene-gene interaction networkAn initial step in our approach consists of generating the gene-gene interaction network that corresponds to the input gene expression dataset. Accordingly, we use the R package KEGGREST [38] of Bioconductor [27] to get the set of unweighted and undirected KEGG [33] pathways. Next, we combine those individual pathways into one network, using a graph union operation, and we extract the largest connected component using the Python package NetworkX [39]. Then, this component is filtered to only cover that genes assayed in the expression dataset.Experimental setupTo assess our proposed method, each dataset is used for the proposed heuristic, GS [22], and LEAN [3] using the respective KEGG [33] network. limma [1] is also run as the baseline tool. The parameter values used for the proposed heuristic are shown in Table 3. They consist of the number of ants per iteration, the number of iterations per gene, the starting capacity of each ant, the number of cores for parallelization, alpha and beta parameters, in addition to the number of random permutations. Several values for the number of ants are tested (results are not included), and the value of 40 is chosen through trial-and-error. It is generally observed that increasing the number of ants improved results although no value above 40 is tried. Hence, it could be useful to further investigate the impact of this parameter, and others, on the algorithm’s behavior in future studies. The starting capacity is always 1.0, the number of iterations n is always 5, the number of cores used for the parallelization of iterations is always 5, and \(\alpha\) and \(\beta\) were fixed to 0.6 and 1.2, respectively. In line with the formulation of ant colony proposed here which encourages diversification of paths taken by ants rather than their convergence within a given iteration, we set a higher value for hyperparamter \(\beta\) (1.2) as compared to \(\alpha\) (0.6) so that, within an iteration, ants are influenced more by the favorability of their own route rather than that of the other ants. After several experimentations, we found that these parameter values gave the best results. The random seed is set to 1, 2, or 3 in each experiment, respectively, for reproducibility. For GS, the random seed is set to 121, 122, and 123 in each experiment, respectively. limma and LEAN are both employed using default configurations and random seed being set to 1, 2, or 3 in each experiment, respectively, for LEAN. The number of random permutations used for non-parametric statistical significance assessment within the tools is set to 100. The proposed method is observed to be stable for these datasets across the experiments. The methods are also only run for a single experiment on the p53 mutation dataset using the same corresponding random seeds as those of Experiment 3. The statistical significance cutoff is set to 0.05 for all experiments.Table 3 Parameter settings used for the proposed heuristic methodAn enrichment analysis was then conducted by testing whether genes of a given KEGG pathway exhibit p values that are lower than the background genes represented by all other genes. A total of 353 KEGG pathways were extracted using the R package KEGGREST [38]. In order to infer the statistical significance of our results, the Wilcoxon rank-sum tests were used along with BH adjustment to correct for multiple testing. Accordingly, we are able to examine whether the underlying distributions and findings would still hold when the data is rearranged and shuffled. The significance and ranking of disease-relevant pathways are then compared between tested methods for each dataset. Disease-relevant pathways were selected according to KEGG pathways listed as relevant under the KEGG [33] entry for each disease. This analysis constitutes a single experiment. This process was repeated ten times for each neurodegeneration dataset with different seed values to check the relative stability of the tested methods since all except limma include one or more random components. Therefore, a total of ten experiments per neurodegeneration dataset were conducted as compared to a single experiment for the p53 mutation dataset. Note that for the enrichment analysis, only the GSE5281_VCX subset was used to represent Alzheimer’s disease since it is the one with the most samples (33) out of the three subsets considered.Finally, concordance was assessed between pairs of 3 different subsets of the Alzheimer’s dataset (\(\text {GSE5281}\)) [28]. Each subset consists of samples taken from different brain regions relevant to this disease. In this case, the visual cortex subset (GSE5281_VCX), the hippocampus subset (GSE5281_HIP), and the entorhinal cortex subset (GSE5281_EC), as divided in the GSEABenchmarkeR [26] package, were considered. As in GS [22], Spearman’s \(\rho\) was used to assess concordance of gene-level p values between different dataset pairs for each of the tested methods.

Hot Topics

Related Articles