TopoDoE: a design of experiment strategy for selection and refinement in ensembles of executable gene regulatory networks | BMC Bioinformatics

Average number of differences between GRNsTo measure the number of differences between candidate GRNs topologies, we considered GRNs as directed graphs where nodes were the genes and edges were the $\theta$ interaction values. We computed for each pair of GRNs a and b the number $d_{a,b}$ of different $\theta$ values between all pairs of genes i and j :$$\begin{aligned} d_{a,b}, = \sum _{i = 1}^{G} \sum _{j = 1}^{G} \mathbbm {1}_{\iota (\theta _{i,j}^a) \ne \iota (\theta _{i,j}^b)} \end{aligned}$$
(1)
with$$\begin{aligned} \iota (\theta ) = {\left\{ \begin{array}{ll} 1, &{} \text {if } \theta > 0 \\ -~1, &{} \text {if } \theta < 0 \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)
where G is the number of genes in the GRNs and $\theta _{i,j}^x$ is the interaction value between genes i and j in GRN x. All $d_{a,b}$ values were finally averaged to obtain the mean number of pairwise differences.Mechanistic model of gene expressionSimulations were run using a mechanistic model of gene expression described in [18] and based on the two-state model. Briefly, a gene i is described by a promoter $E_i$ which can be in states on or off and randomly switches between those states at rates $k_{on,i}$ and $k_{off,i}$ respectively. When the promoter is active (on state), mRNA molecules ($M_i$) are synthesized at rate $s_0,i$. At any time, proteins ($P_i$) are produced at rate $s_1,i$ from mRNA molecules, mRNAs are degraded at rate $d_0,i$ and proteins are degraded at rate $d_1,i$. The following equations summarize the model:$$\begin{aligned} {\left\{ \begin{array}{ll} E(t): 0 \xrightarrow {k_{on}} 1, 1 \xrightarrow {k_{off}} 0 \\ M'(t) = s_0 E(t) – d_0 M(t) \\ P'(t) = s_1 M(t) – d_1 P(t) \end{array}\right. } \end{aligned}$$
(3)
Here, the first line of the model describes a discrete, Markov random process. The second and third lines are ordinary differential equations describing the evolution of mRNA (M) and protein (P) levels.Interactions between a set of S stimuli and G genes in a GRN are encoded by interaction parameters $\theta$ and by letting rates $k_{on,i}$ and $k_{off,i}$ be functions of proteins $P = (P_1,\ldots , P_G)$ and stimuli $Q = (Q_1,\ldots , Q_S)$ levels as described in equation 4 (see Fig. 5).$$\begin{aligned} k_{on,i}(P, Q) = \frac{k_{on\_min,i} + k_{on\_max,i} \beta _{kon,i} \Phi _i(P,Q)}{1 + \beta _i \Phi _i(P,Q)} \end{aligned}$$
(4)
with$$\begin{aligned} \Phi _i(P,Q) = \prod _{s=1}^{S} \frac{1 + e^{\theta _Q,i,s} \left( \frac{Q_s}{H_{Q,s}}\right) ^\gamma }{1 + \left( \frac{Q_s}{H_{Q,s}}\right) ^\gamma } \prod _{j=1}^{G} \frac{1 + e^{\theta _P,i,j} \left( \frac{P_j}{H_{P,j}}\right) ^\gamma }{1 + \left( \frac{P_j}{H_{P,j}}\right) ^\gamma } \end{aligned}$$
(5)
In equation 5, $H_{Q,s}$ and $H_{P,j}$ are interaction thresholds for stimuli and proteins respectively, defining the minimal stimulus (or protein) quantity required to have a significant effect on a target gene. The Hill exponent parameter $\gamma$ is set to 4 in all cases. Similarly, the interaction function for the rate $k_{off}$ is given by equation 4 when replacing occurrences of index $_{kon}$ by $_{koff}$. Equation 5 is a modified version of those introduced in [17, 18] to account for multiple stimuli and with the added stimulus threshold parameter $H_{Q,s}$.For each gene, parameters $\beta _{kon,i}$ and $\beta _{koff,i}$ modify the gene’s basal expression level to account for the constant influence of genes outside of the modelled GRN. $\beta$ values are estimated from experimental single-cell mRNA distributions but their correct identification is challenging and the algorithm used to that end is described in section Simulation balancing.Fig. 6Executable GRN model. An example of a model of 2 genes A and B with a stimulus S represented by a yellow thunderbolt. Gene B regulates genes A, shown here by a non-null $\theta _{B,A}$ value, and is itself under the regulation of the stimulus. Gene A has an auto-activation loop: $\theta _{A,A}$ defines a self regulationKnock-Out perturbation implementationGene KOs were implemented in the simulation model by setting all $\theta$ interaction values between the perturbed gene and its neighbors to 0. $\beta _{Kon}$ and $\beta _{Koff}$ values were also set to 0 for that gene. During simulation, the probability of promoter activation (E being in state on) was forced to 0 so that the gene would not even be transcribed at the basal level.During balancing and simulation, reference mRNA and protein counts were systematically set to 0 for the knocked-out gene so that the simulation would start completely devoid of such molecules.Variance IndicesPer-gene variances in the gene-to-gene and stimulus-to-gene interactions in a set of GRNs (all sharing the same set of genes and stimuli) were computed using a variance index. First, all interaction values were categorized into activation (if the interaction value was greater than 0), inhibition (if the interaction value was lesser than 0) and no-interaction (if the interaction values was equal to 0). Those values were consequently replaced by 1, $-$ 1 and 0 respectively using the $\iota (\theta )$ function defined in equation 2.The Ancestors Variance Index (AVI) only considers interactions between a given gene and its parents (i.e. the genes regulating that given gene), while the Descendants Variance Index (DVI) only considers interactions with its children (i.e. genes regulated by that gene). Those two indices make it easy to identify if a gene’s interactions vary mostly because of the interactions with genes upstream or downstream. In particular, a gene KO is expected to have an effect on downstream genes.For each gene i in a collection of GRNs, the indices were defined as:$$\begin{aligned} AVI_i = \sum _{g=1}^G Var(P_g) \end{aligned}$$
(6)
$$\begin{aligned} DVI_i = \sum _{g=1}^G Var(C_g) \end{aligned}$$
(7)
where $P_g$ is the vectors of interaction values between gene g and its parents and $C_g$ the vector of interactions values with its children. G is the number of genes in the GRNs.Measure of distance between multivariate distributionsDistances between multivariate distributions (simulated or experimental data) were measured using the Kantorovich distance [21] (also referred to as Wassertein or EMD distance). Because of the high number of variables (i.e. genes), the number of sample points (i.e. cells) was however too low to correctly estimate the multivariate Kantorovich distance. We thus devised a modified distance which computes the sum of Kantorovich distances on marginals (i.e. one variable at the time), making it practically usable.This distance, named $Kantorovich_{1D}$, has the following form:$$\begin{aligned} Kantorovich_{1D}\big (D^{(1)}, D^{(2)}\big ) = \sum _{g=1}^G W_2\left( D^{(1)}_g, D^{(2)}_g\right) \end{aligned}$$
(8)
where $D^{(1)}$ and $D^{(2)}$ are 2 multivariate distributions (both with the same G variables) and $W_2$ is the regular Kantorovich distance. $D^{(1)}_g$ and $D^{(2)}_g$ refer to the vector of values in $D^{(1)}$ and $D^{(2)}$ for variable g.Simulation balancingBefore any simulation could be executed, it was essential to correctly balance it, i.e. constant hyper-parameters had to be chosen such that the initial state produced by the simulation was a desired stable state. Here, parameters $\beta _{Kon}$ and $\beta _{Koff}$ act as adjustment variables which can force genes into high or low expression regimes by increasing or decreasing the value of the interaction function $\phi$.Finding the correct $\beta$ values is a non trivial task since 2 parameters ($\beta _{Kon}$ and $\beta _{Koff}$) need to be fitted for each gene of the G genes in the GRN. In our case, $49 \times 2 = 98$ parameters needed to be fitted per GRN. Such high dimensional optimization problems suffer from the curse of dimensionality and either converge to a solution in very long times or don’t allow a solution to be found at all.Fortunately, each gene could be considered independent from the other since the goal of the balancing process was to find $\beta$ values such that the expression level of all genes remained totally unchanged. In this case, all mRNA and protein distributions (apart from that of the gene we are trying to balance) can be considered constant through time, thus transforming a $G \times 2$ dimensional optimization problem into G 2 dimensional problems.Resolution of these problems was however made difficult by the stochastic nature of the simulation outputs. To that end, we adapted a simulated annealing algorithm to the noisy cost function case as described in [34, 35]. We designed a cost function taking as input a tuple of $\beta _{Kon}$ and $\beta _{Koff}$ values, which executes the simulation of a single gene with those $\beta$ values for 20 h and returns the Kantorovich distance between the simulated data (at t=20 h) and the initial data (at t=0 h). The simulated annealing method is detailed in Additional file 1.Simulation initializationAfter balancing, simulations were initialized by setting mRNA counts for each gene from the distribution of single-cell RT-qPCR data which was used in the GRN inference task. Only data at the initial time point was used here.Measure of information gained after perturbationSimulations of perturbations on the set of candidate GRNs predicted effects on varying numbers of genes and significant gene level variations were observed for different proportions of the 364 GRNs. To determine which perturbation carried the most information, we computed the entropy of the proportion of GRNs with significant expression variation for each of the 49 genes. As an example, FNIP1 had significant expression variation for genes SNX22 (in 88 out of the 364 GRNs), SLC25A37 (in 45 GRNs) and ENSGALG00010025565 (in 102 GRNs). We thus computed the entropy using equation 9, where p was the vector $(\frac{88}{364}, \frac{45}{364}, \frac{102}{364}, 0,\ldots , 0)$ (with 46 trailing zeros for the 46 genes with no significant variation) encoding the proportion of GRNs with expression variation.$$\begin{aligned} -\sum p_k \times log(p_k) \end{aligned}$$
(9)
Statistical testsStatistical analyses were performed in Python 3.10 [36]. Difference in expression levels across conditions were tested using a one-sided T-test implemented by the ttest_ind function of the scipy package [37]. p-values were corrected using the multipletests function of the statsmodels package [38] using the Benjamini-Hochberg algorithm and default family-wise error rate of 0.05. Entropy values were computed using the entropy function from the scipy package. Box plots were produced with the plotly package [39].Cell cultureFertlized SPAFAS white leghorn chicken eggs provided by INRAE (Tours, France) were incubated in our facilities. Embryos were euthanized by decapitation in accordance with the DIRECTIVE 2010/63/EU OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 22 September 2010 on the protection of animals used for scientific purposes, a procedure approved by our local animal ethics committee. After embryo sacrifice, bone marrow cells were harvested by perfusion from tibias of 19-days old embryos [40], seeded into culture and maintained in self-renewal state in LM1 medium [20].CRISPR plasmids constructionA guide RNA against FNIP1’s sequence (ENSGALG00000017462) was designed using the CRISPOR design tool [41] to target the exon number 5. Oligonucleotides were purchased from Eurogentec (Table 3). The guide was cloned after hU6 promoter into BbsI-digested pCRISPR-P2A-tRNA vector [42]. The efficiency of our CRISPR vector in T2EC cells was confirmed by analyzing mutations after sequencing.Table 3 Oligonucleotides sequences used for CRISPR plasmid constructionCells transfectionAfter 12 days in culture, $30 \times 10^5$ cells were resuspended in 100µL of transfection medium (Cell Line Nucleofector Kit V—Amaxa) and transfected with 6,5µg of pCRISPR-P2A-tRNA empty vector or pcrFNIP#9 vector using the T16 K652 program. 500µL of RPMI (RPMI 1640 Medium no phenol red—Gibco) were added to the cell solution for a recovery step of 8 min. Then cells were transferred at $1,25.10^6$ cells/ml in LM1 medium without penicillin and streptomycin and grown in standard culture conditions.Single cells sorting24 H after transfection, cells were harvested and resuspended in LM1 medium. Sorting was performed at room temperature using BD FACS Aria 1 flow cytometer. Living GFP-expressing cells were sorted in 96 wells U-shape culture plates containing 50µL of regular LM1. Non-transfected cells were also sorted to be used as a negative control. Plates were then placed back in incubator at 37 °C, with 5% CO2 (Fig. 3).Identification of FNIP1 KO clones by sequencing30 clones were selected 7 days post-sorting and half of the culture was collected for DNA/RNA extraction with Quick-DNA/RNA™ Microprep Plus Kit (Zymo) according to the manufacturer’s protocol. The amplified DNA fragments (with FNIP1-S1/FNIP1-R3 primers (Table 3)) were cloned into the pCR™4-TOPO® TA vector (TOPO™ TA Cloning™ Kit for Sequencing without competent cells—Invitrogen). We selected a clone presenting a frame shift leading to an early stop codon on both alleles for subsequent transcriptomics analysis.Single-cell RT-qPCR analysisIndividual cells from clones transfected with the pcrFNIP#9 vector (FNIP1 KO cells) or the empty vector clones (wild type cells) were sorted into 96-well plates using BD FACS Aria 1 flow cytometer. All the manipulations related to the high-throughput scRT-qPCR experiments in microfluidics were performed according to the protocol recommended by the Fluidigm company (PN 68000088 K1, p.157–172). All steps from single-cell isolation, gene selection, data generation by scRT-qPCR are described in details in [19].

TopoDoE: a design of experiment strategy for selection and refinement in ensembles of executable gene regulatory networks | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis