PEMPS: a phylogenetic software tool to model the evolution of metabolic pathways | BMC Bioinformatics

PEMPS is a Linux command line-executable Python program that can simulate the evolution of an input metabolic pathway over an input phylogeny and report the changes in model parameters along the lineages of the phylogenetic tree over evolutionary time. The user provides commands in a text file designating the underlying biology, including a metabolic model, the reaction flux(es) on which selection operates and the relative weights in the calculation of selective coefficients, a phylogeny as a Newick string for the species (a standard format which gives the topology and branch lengths of the species phylogenetic tree), the ploidy of the genome, population size(s) of all branches, any non-genetic parameters to be excluded from mutation, and the number of simulations to run. A future release could enable lineage-specific shifts in the optimal flux for a pathway. The user is given control over the network dynamics, as they determine which model parameters are held constant and on which reaction fluxes selection acts. A different decision about the target of selective pressure results in new evolutionary constraints on different constellations of model parameters in the network, which is likely to affect the changes observed in parameters over phylogenetic time. A flowchart of the overall program function is presented in Fig. 1.Fig. 1The program requires a Newick string from the user-specified file for which the branch simulation proceeds recursively. The tree must be rooted [18]. Rooted trees can be obtained from the NCBI taxonomy [19], TimeTree [20], rooted using gene tree/species tree reconciliation with software like SoftParsMap [21] or NOTUNG [22], or by software like DendroPy [18] using techniques like mid-point rooting when species relationships are unknown [18]. PEMPS requires branch lengths as substitutions per site or in millions of years to determine the number of generations to simulate for each branch. The input metabolic model is prepared with the GUI version of COPASI, and the SE version, interfaced through Python, can be used to calculate steady-state flux [23, 24]. BasiCO is a Python module which enables easy modifications and simulations (including steady-state flux) with COPASI [25]. Most but not all models’ parameters can be accessed through BasiCO; if this package cannot interface with a given model, PEMPS will instead use regex to rewrite the.cps file and run COPASI with subprocess commands. As the fitness calculation in the mutation-selection process depends on steady-state flux (where selection is based upon this criterion), the program is only compatible with models that can reach steady state with their original parameters. Detection of binding constants for mutation and Haldane’s constraint depends on informative parameter naming, which some models do not provide (the label could be parameter__12 as opposed to KmATP). In such cases, the user would have to offer specifications in the commands file.The program conducts a forward-time simulation with discrete generations using a mutation-selection model [26]. In each generation, each parameter has a small probability (Poisson random variable, lambda = 0.003) of mutating, and if mutation occurs, the parameters are updated before the recalculation of steady-state flux. To model the tendency towards slightly deleterious change, mutational effects are drawn from a normal distribution centered at − 1% for all parameters except binding constants (1%), because poorer binding corresponds to a larger Km value. The assumption that mutations tend towards slight functional degradation is consistent with results from mutation accumulation experiments [13] and parameterizations of the distribution of fitness effects in protein coding genes [15, 27].The user can hold any global or kinetic parameters to their starting values by prohibiting their mutation, which could be useful, for example, to incorporate an assumption of constantly replenishing supply of the network’s input(s). Most metabolic models explored during development of the program are constructed this way, and without any specification, the network inputs remain constant throughout the simulation. To implement Haldane’s constraint, the program collects each reaction’s maximum velocity, equilibrium constant, and binding constants, provided that they are informatively named (which tends to be the case) and that there are equal numbers of substrates and products, so that pairings can be established. After a mutation to either a forward or reverse binding constant, the reverse reaction velocity is calculated, and the counterpart parameter is updated accordingly. An analogous procedure would be applied to the catalytic constant and its reverse, although most models are not specified with reverse kcat values.To model selective pressure on the steady-state flux of a subset of the network’s reactions, population fitness is calculated from proportionalized reaction flux (x) with the logistic function given in Eq. 1 [2].$$Fitness = \frac{1}{{1 + e^{10(0.5 – x)} }}$$
(1)
A logistic curve is suitable because the upper asymptote represents a diminishing marginal return that corresponds to the biomolecular reality that although a cell cannot survive with zero flux, increasing production beyond a certain threshold results in no further fitness advantage. Some contexts, however, where there is effectively no ceiling to the benefit of increasing cellular product would be best modeled with a linear function, as might be the case for the generation of materials required for cell division during the exponential growth phase in bacteria. Because multiple reaction fluxes may be under selection, and these values may lie on different scales, the fluxes are proportionalized based on initial (pre-mutation) flux values such that the starting fitness score for each flux is 1. The user can override this default by passing a command to change the multiplier in the following formula: proportionalized_flux = start_flux/(start_flux * multiplier). Fitness scores are calculated for each of these proportionalized flux values and adjusted such that a minimum fitness of 0 corresponds to no enzyme flux, a fitness of 1 corresponds to the starting flux, and the asymptote represents the production threshold beyond which no further fitness advantage is gained. The justification for this is that the wild type is fit and evolution under stabilizing selection is modeled. Alternatives that involve positive selection are enabled with the multiplier. The population fitness is then the weighted geometric mean of these normalized scores. This function can be modified in the code by a user. The population fitness values from before and after a mutation are used to calculate the probability of fixation with Kimura’s formula (Eq. 2) [28] as presented by Otto and Whitlock [29].$$Pfixation = \frac{1 – {\text{e}}^{-2{\text{cN}}_{e}\text{sp}}}{1 – {\text{e}}^{-2{\text{cN}}_{e}\text{s}}}$$
(2)
The selection coefficient s is the ratio of the new population fitness (after mutation) to the old fitness − 1. The initial frequency parameter p is usually calculated by 1/(cNe) where c is the ploidy and Ne is the effective population size, but to speed up simulation, this parameter is set to 0.5 [30]. This gives neutral mutations a 50% chance of fixing, accelerating the neutral walk across sequence space. Before the simulation runs through the tree, the “root” population undergoes mutation and selection until its fitness and most (80%) of its reaction fluxes reach values that remain stable over many generations. This state of equilibrium is determined by calculating the slope of the line of best fit and coefficient of variation (values indicating “flatness” of the line) of the select parameters over 100-fixation windows. When the values of five consecutive windows are within certain thresholds of zero for fitness and the majority of fluxes, equilibration continues for as many generations as had transpired by that point, and then branching simulation commences. Because equilibration occurs before branching, the phylogeny is not relevant in this step, which means that if the starting population size, ploidy, and fluxes under selection are the same, one equilibrated population can act as the starting point for any other tree the user wants to simulate with the same starting population parameters and metabolic model. The user has the option to import an equilibrated set of parameters into the root node population and can either equilibrate further or start the phylogenetic simulation.Branch lengths as either millions of years or substitutions per site are required, and the number of generations to be simulated for each branch depends on the length and user-specified measurement type. To convert branch lengths to number of generations to simulate, the program determines at what number of generations during equilibration an average number of 7 mutational proposals are introduced to both enzyme concentration and enzyme functional parameters across the model reactions. This threshold is based on the following assumptions: (1) the two-percent rule which states that branching avian and mammalian lineages tend to differ from each other by 1% every 1 million years [31]; (2) that 50% of neutral changes fix; (3) a bias towards deleterious change; (4) that the coding sequence for a typical enzyme contains ~ 250 amino acids subject to the mutational effect distribution for enzyme parameters and ~ 250 sites that can affect gene expression leading to changes in protein concentration. This latter category would include mutations in promoter sites that influence transcription factor binding [32], which affect the quantity of transcripts and ultimately proteins produced, thus causing changes in the concentration of enzyme ([E]). In models parameterizing [E], PEMPS probabilistically proposes mutations to [E] and other enzyme parameters in a 1:1 ratio. If the input branch lengths are measured in millions of years, this threshold is simply multiplied by branch length to calculate simulated generations per branch; if measured in substitutions per site, a factor of 100 is introduced to convert branch lengths to percentages.Lineages (edges) are formed by connecting branch simulations when tracing a path directly from the equilibrated root node through all internal nodes to an external node, with bifurcating branches each starting their separate simulations with the same set of inherited parameters. Larger branch lengths result in a higher number of generations along the lineage, affording more opportunity to fix new mutations, which creates an expectation of greater change along longer lineages. Population sizes are also determined by the user, who can choose to set a uniform value across the entire tree or enter the population size individually for each branch. A new equilibrium may not be reached after branching to a new population size by the next bifurcation, but this corresponds to real evolution. Output generated by the program includes the changes in parameters for all mutations, fixed and unfixed, in tabular format, graphs of parameters for lineage and branch, and heatmaps of percent differences from starting values for lineages and branches (a more detailed description of usage and output can be found in the program docstrings https://github.com/nmccloskey/PEMPS). When multiple simulations are run (the user’s decision), tables of average percent differences for each parameter for lineages and branches are produced with corresponding heatmaps.

Hot Topics

Related Articles