Tensor product algorithms for inference of contact network from epidemiological data | BMC Bioinformatics

Forward problem: \({\varepsilon }\)-SIS epidemic on networkWe consider the \(\varepsilon\)-SIS model of the contact process, which is a variation of a classical susceptible-infected-susceptible (SIS) model, allowing for every node to self-infect with rate \(\varepsilon .\) This process was originally proposed by Hill et al. to describe the spread of emotions in social network [36]. Mieghem et al. studied analytical properties of the model for fully connected networks [37, 38]. Zhang et al. extended this study to arbitrary networks and found conditions under which the equilibrium distribution can be accurately approximated by a scaled SIS process, gaining useful insights on vulnerability of the population [39].The classical SIS model has an absorbing state where all nodes are susceptible (i.e. the network is fully healthy), but due to spontaneous self-infections the \(\varepsilon\)-SIS model does not have an absorbing state, hence the epidemics lasts forever. This property allows us to observe the epidemics for sufficiently long time and eventually collect the dataset which is large enough to ensure the required contrast for the Bayesian optimisation.We consider a \(\varepsilon\)-SIS epidemic on a unweighted simple network \(\mathcal {G}= (\mathcal {V},\mathcal {E}),\) which is a set of nodes (representing people, or agents) \(\mathcal {V}= \{1,2,\ldots ,N\}\) and links (or edges, representing contacts between agents) \(\mathcal {E}= \{(m,n):\, m\in \mathcal {V}, n\in \mathcal {V}, \, m\ne n\}.\) We additionally assume that the contacts are bidirectional, i.e. \((m,n) \in \mathcal {E}\,\Leftrightarrow \, (n,m)\in \mathcal {E},\) which allows us to introduce a symmetric adjacency relation \(m\sim n \,\Leftrightarrow \, (m,n)\in \mathcal {E}\) for the connections.Each node can be in one of two states, \(x_n \in \mathbb {X}= \{\text {susceptible}, \text {infected}\} = \{0,1\},\) for \(n\in \mathcal {V}.\) The state of the whole network is therefore a vector \(\textrm{x}= \begin{pmatrix}x_1&x_2&\ldots&x_N \end{pmatrix}^T \in \mathbb {X}^N.\) We consider the system dynamics as a continuous-time Markov jump process on the state space \(\Omega =\mathbb {X}^N.\) The following two types of transitions (or reactions), infection and recovery, occur independently and at random according to the Poisson process with the following rates$$\begin{aligned} p_{\textrm{x}\rightarrow \textrm{y}} = {\left\{ \begin{array}{ll} p_{\textrm{x}\rightarrow \textrm{y}}^\text {(inf)}, & \text {if}\, \exists n\in \mathcal {V}:\, \textrm{y}=\textrm{x}+\textrm{e}_n; \\ p_{\textrm{x}\rightarrow \textrm{y}}^\text {(rec)}, & \text {if}\, \exists n\in \mathcal {V}:\, \textrm{y}=\textrm{x}-\textrm{e}_n; \\ 0, & \text {otherwise,} \end{array}\right. } \end{aligned}$$
(1)
where \(\textrm{e}_n\in \mathbb {R}^N\) is the n-th unit vector. For simplicity, we assume that the recovery rates \(p_{\textrm{x}\rightarrow \textrm{y}}^\text {(rec)}=\gamma\) are the same for all nodes of the network. In the classical SIS model, the infection rate for the susceptible node \(x_n=0\) is proportional to the number of its infected neighbours, \(I_n(\textrm{x}) = \left| \{ m\in \mathcal {V}: m\sim n, x_m=1 \} \right| ,\) and the per-contact rate \(\beta ,\) which we also consider the same across all network. In the \(\varepsilon\)-SIS model, an additional infection rate \(\varepsilon\) is introduced to describe possible infection through contacts with the external, off-the-network, environment. Hence, the infection rate is \(p_{\textrm{x}\rightarrow \textrm{y}}^\text {(inf)}=I_n(\textrm{x})\beta +\varepsilon .\) Examples of the Markov transition graph are shown in Fig. 2.The stochastic properties of the system are described as probabilities of network states \(p(\textrm{x},t) = {\textsf{P}}(\text {system is in state { x} at time { t}}).\) The system dynamics is written as a system of ordinary differential equations (ODEs), known as Markovian master equation [3, 7], Chapman or forward Kolmogorov equations:$$\begin{aligned} p'(\textrm{x},t) = \sum _{\textrm{y}\in \mathbb {X}^N} \left( p_{\textrm{y}\rightarrow \textrm{x}} \cdot p(\textrm{y},t) – p_{\textrm{x}\rightarrow \textrm{y}}\cdot p(\textrm{x},t) \right) , \qquad \textrm{x}\in \mathbb {X}^N, \end{aligned}$$
(2)
subject to initial conditions \(p(\textrm{x}_0,0) = 1\) for the initial state \(\textrm{x}=\textrm{x}_0\) and \(p(\textrm{x},0) = 0\) otherwise. The number of ODEs scales as \(2^N,\) making traditional numerical solvers struggle for even moderate values of N.Fig. 2Markov transitions between network states: a \({\varepsilon }\)-SIS epidemic on a chain of \(N=3\) people; b \({\varepsilon }\)-SIS epidemic in a fully connected network of \(N=3\) people. On the graph, green arrows denote recovery process with rate \(\gamma ,\) and red arrows with a circled number k denote infection process with rate \(k\beta +\varepsilon\)Inverse problem: Bayesian inference of the networkOur goal is to infer the most probable contact network \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) from the observed data \(\mathcal {X}= \{t_k,\textrm{x}(t_k)\}_{k=0}^K.\) According to the Bayes theorem [40], \({\textsf{P}}(\mathcal {G}| \mathcal {X}) = \frac{{\textsf{P}}(\mathcal {X}| \mathcal {G}) {\textsf{P}}(\mathcal {G})}{{\textsf{P}}(\mathcal {X})},\) where \({\textsf{P}}(\mathcal {G})\) is the prior probability distribution for the grid, \({\textsf{P}}(\mathcal {X}|\mathcal {G})\) is the likelihood of the observed data as a function of the grid \(\mathcal {G},\) \({\textsf{P}}(\mathcal {X})\) is the probability of the observed data, and \({\textsf{P}}(\mathcal {G}|\mathcal {X})\) is the posterior probability of the grid given the data.The connectivity network \(\mathcal {G}\) can be described by its adjacency matrix \(G = \left[ g_{m,n}\right] _{m,n\in \mathcal {V}}\) with \(g_{m,n} = 1 \,\Leftrightarrow \, (m,n)\in \mathcal {E}\) and \(g_{m,n} = 0,\) otherwise. Since \(\mathcal {G}\) is simple, G is a binary symmetric matrix, \(G=G^T,\) with zero diagonal, \(\mathop {\textrm{diag}}\nolimits (G)=0.\) It is easy to see that a grid \(\mathcal {G}\) can be described by \(\tfrac{1}{2} N(N-1)\) independent binary variables \(g_{m,n}\in \mathbb {B}=\{0,1\}\) with \(m,n\in \mathcal {V}\) and \(m>n,\) representing the states (on/off) of possible edges \((m,n)\in \mathcal {E}.\) Therefore, for a fixed number of nodes \(|\mathcal {V}|=N,\) the set of all possible networks \(\mathbb {G}= \{\mathcal {G}: ~g_{m,n}=g_{n,m}\in \mathbb {B}, ~m,n\in \mathcal {V}, ~m>n\}\) has cardinality \(|\mathbb {G}| = 2^{N(N-1)/2}.\) Note that \(\mathbb {G}\) is isomorphic to \(\mathbb {B}^{N(N-1)/2}\), the set of adjacency elements \(g_{m,n}\). The structure of this set is illustrated in Fig. 3.Assuming that nodes are known and no prior information of the edges \(\mathcal {E}\) is available, we take the uniform prior distribution, \({\textsf{P}}(\mathcal {G}) = 2^{-N(N-1)/2}.\) Although \({\textsf{P}}(\mathcal {X})\) is unknown, it does not affect the optimisation problem, as \(\max _{\mathcal {G}} {\textsf{P}}(\mathcal {G}|\mathcal {X}) = \frac{2^{- N(N-1)/2}}{{\textsf{P}}(\mathcal {X})} \max _{\mathcal {G}} {\textsf{P}}(\mathcal {X}|\mathcal {G}),\) hence \(\arg \max _{\mathcal {G}} {\textsf{P}}(\mathcal {G}|\mathcal {X}) = \arg \max _{\mathcal {G}} {\textsf{P}}(\mathcal {X}|\mathcal {G}).\)Due to the Markovian property of the system dynamics, the likelihood can be expanded as follows,$$\begin{aligned} \begin{aligned} {\textsf{L}}(\mathcal {G})&= {\textsf{P}}(\mathcal {X}| \mathcal {G}) = {\textsf{P}}( X(t_1)=\textrm{x}_1, \cdots , X(t_K)=\textrm{x}_K | \mathcal {G}) \\&= {\textsf{P}}( X(t_1)=\textrm{x}_1 | X(t_0)=\textrm{x}_0, \mathcal {G}) \cdots {\textsf{P}}( X(t_K)=\textrm{x}_K | X(t_{K-1})=\textrm{x}_{K-1}, \mathcal {G}), \\&= \prod _{k=1}^K \underbrace{{\textsf{P}}( X(t_k)=\textrm{x}_k | X(t_{k-1})=\textrm{x}_{k-1}, \mathcal {G})}_{{\textsf{P}}(\textrm{x}_{k-1}\rightarrow \textrm{x}_k | \mathcal {G})}, \end{aligned} \end{aligned}$$
(3)
where \(X(t)\in \mathbb {X}^N\) are random variables describing the network states during its stochastic evolution, and the time sequence \(\{t_k\}_{k=0}^K\) is monotonically increasing. We see that the likelihood is a product of transition probabilities \({\textsf{P}}(\textrm{x}_{k-1}\rightarrow \textrm{x}_k | \mathcal {G}) = {\textsf{P}}( X(t_k)=\textrm{x}_k | X(t_{k-1})=\textrm{x}_{k-1}, \mathcal {G}),\) which are the probabilities for the system to evolve from the state \(\textrm{x}_{k-1}\) to \(\textrm{x}_k\) over the time period \([t_{k-1},t_k].\) The (black-box) Bayesian network inference therefore boils down to likelihood optimisation,$$\begin{aligned} \begin{aligned} \mathcal {G}_{\text {opt}}&= \arg \max _{\mathcal {G}\in \mathbb {G}} \log {\textsf{L}}(\mathcal {G}) = \arg \max _{\mathcal {G}\in \mathbb {G}} \sum _{k=1}^K \log {\textsf{P}}( \textrm{x}_{k-1} \rightarrow \textrm{x}_k | \mathcal {G}). \end{aligned} \end{aligned}$$
(4)
Fig. 3The set of all possible networks with \(N=3\) nodes is a binary hypercube in dimension \(\tfrac{1}{2} N(N-1)=3\)To compute a single log-likelihood \(\log {\textsf{L}}(\mathcal {G})\) in (4), we need to solve K forward problems, i.e. estimate the chance of arriving to the state \(\textrm{x}_k\) from the state \(\textrm{x}_{k-1}\) over the period of time \(t\in [t_{k-1},t_k]\) for \(k=1,\ldots ,K.\) To find the optimal network \(\mathcal {G}_{\text {opt}}\), we may need to compute a large amount of log-likelihoods for different networks \(\mathcal {G},\) hence the efficiency of the forward solver is crucial to make the optimisation procedure feasible. This rules out a possibility of solving (2) directly due to the curse of dimensionality.Stochastic simulation algorithms for forward problemTraditionally, probabilities \(p(\textrm{x},t)\) are estimated using the Stochastic Simulation Algorithm (SSA) [8], or some more efficient (e.g. multilevel) Monte Carlo simulations of the realisations of the model [41,42,43,44]. Essentially, these methods sample \(N_{\textrm{SSA}}\) random walks through the state space \(\Omega =\mathbb {X}^N\) starting at the initial state \(\textrm{x}_{k-1},\) count the number \(n_{\textrm{SSA}}\) of trajectories that end up in the state \(\textrm{x}_k\) by the time \(t_k,\) and estimate the target probability as a frequency,$$\begin{aligned} {\textsf{P}}( \textrm{x}_{k-1} \rightarrow \textrm{x}_k | \mathcal {G}) \approx {\tilde{{\textsf{P}}}}( \textrm{x}_{k-1} \rightarrow \textrm{x}_k | \mathcal {G}) = \frac{n_{\textrm{SSA}}}{N_{\textrm{SSA}}}. \end{aligned}$$
(5)
The cost of sampling each trajectory does not grow exponentially with N which makes these methods free from the curse of dimensionality. However, the convergence of these algorithms is not particularly fast, with typical estimates$$\begin{aligned} \textrm{err} = \left| {\textsf{P}}( \textrm{x}_{k-1} \rightarrow \textrm{x}_k | \mathcal {G}) – {\tilde{{\textsf{P}}}}( \textrm{x}_{k-1} \rightarrow \textrm{x}_k | \mathcal {G}) \right| \leqslant c N_{\textrm{SSA}}^{-\delta }, \end{aligned}$$with \(0.5\leqslant \delta \leqslant 1\) depending on a particular method. This is usually sufficient to estimate large probabilities and main statistics (such as mean and variance) of the process. However, if the probability \(p = {\textsf{P}}( \textrm{x}_{k-1} \rightarrow \textrm{x}_k | \mathcal {G})\) is small, to ensure the desired relative precision \(| p – {\tilde{p}} | \leqslant \epsilon p,\) the number of samples should satisfy \(c N_{\textrm{SSA}}^{-\delta } \leqslant \epsilon p,\) or \(N_{\textrm{SSA}}\geqslant (\epsilon p / c)^{-1/\delta }.\) In practice this means that estimation of rare events with probabilities \(p\lesssim 10^{-6}\) with these algorithms can be prohibitively expensive.If the computational budget of \(N_{\textrm{SSA}}\) trajectories is exhausted and none of them arrived at \(\textrm{x}_k\), then \(n_{\textrm{SSA}}=0\) and \({\tilde{p}}=0,\) i.e. the event is not resolved with the algorithm and is considered impossible. If this happens for at least one transition from state \(\textrm{x}_{k-1}\) to \(\textrm{x}_k\) for some \(k=1,\ldots ,K,\) then the whole likelihood \({\textsf{L}}(\mathcal {G})=0\) for the given network \(\mathcal {G}.\) In practical computations this issue can occur for most networks \(\mathcal {G}\ne \mathcal {G}_\star\) except the ‘ground truth’ network and its close neighbours. The limited computation budget for the forward problem therefore leads to flattening of the high-dimensional landscape of the likelihoods, wiping off the structural information that should be used to navigate the optimisation algorithm towards the solution of (4). This motivates the development of more accurate methods for the forward problem, such as the tensor product approach that we discuss in the following subsection.Tensor product algorithms for forward problemTo find the probabilities composing the likelihood (4), we can solve the system of ODEs (2) for the probabilities of the network states. This system is commonly known as the chemical master equation (CME) and consists of \(|\mathbb {X}^N|=2^N\) equations and unknowns, hence traditional solvers suffer from the curse of dimensionality. To mitigate this problem, different approaches were used, including sparse grids [45], adaptive finite state projections [46,47,48], radial basis functions [49], neural networks [16, 17], and tensor product approximations, such as canonical polyadic (CP) format [50,51,52], and more recently tensor train (TT) format [18, 53,54,55,56,57,58,59,60]. Here we briefly describe the tensor train approach used in our recent paper [18].First, we note that among \(2^N\) reaction rates \(p_{\textrm{y}\rightarrow \textrm{x}}\) in (2) only 2N are non-zero, according to (1):$$\begin{aligned} \begin{aligned} p'(\textrm{x},t)&= \sum _{n=1}^N \left( p^\text {(inf)}_{(\textrm{x}-\textrm{e}_n)\rightarrow \textrm{x}} p(\textrm{x}-\textrm{e}_n,t) + p^\text {(rec)}_{(\textrm{x}+\textrm{e}_n)\rightarrow \textrm{x}} p(\textrm{x}+\textrm{e}_n,t) \right. \\ &\left. \qquad \qquad \qquad – \left( p^\text {(inf)}_{\textrm{x}\rightarrow (\textrm{x}+\textrm{e}_n)}+ p^\text {(rec)}_{\textrm{x}\rightarrow (\textrm{x}-\textrm{e}_n)}\right) p(\textrm{x},t) \right) , \end{aligned} \end{aligned}$$
(6)
for \(\textrm{x}\in \mathbb {X}^N.\) Let us now uncover the tensor product structure of the matrix of this CME. For this we assume that \(2^N\) network states \(\textrm{x}\in \mathbb {X}^N\) are ordered lexicographically, i.e. the state \(\textrm{x}=(x_1,\cdots ,x_N)^T\) has index$$\begin{aligned} {\overline{\textrm{x}}} = \overline{x_1x_2\ldots x_N} = 2^{N-1} x_1 + 2^{N-2} x_2 + \cdots + 2^0 x_N, \end{aligned}$$which corresponds to how binary numbers are written in big-endian notation, e.g. \({\overline{000}} = 0,\) \({\overline{001}} = 1,\) \({\overline{010}} = 2,\) \({\overline{100}} = 4,\) \({\overline{101}} = 5,\) \({\overline{111}} = 7,\) see also Fig. 4. When the probability distribution function (p.d.f.) vector \(\textbf{p}(t)=\begin{bmatrix}p(\textrm{x},t)\end{bmatrix}_{\textrm{x}\in \mathbb {X}^N}\) is composed, we place the probability \(p(\textrm{x},t)\) in position \({\overline{\textrm{x}}}\), assuming 0-based indexing scheme, i.e. vector indices enumerated from 0 up to \(2^N-1.\) If 1-based indexing scheme is used, we place \(p(\textrm{x},t)\) in position \({\overline{\textrm{x}}}+1.\)Fig. 4The tensor product structure of recovery transitions \([ p^\text {(rec)}_{\textrm{x}\rightarrow (\textrm{x}-\textrm{e}_n)} ]_{\textrm{x}\in \mathbb {X}^N}\) is illustrated for population of \(N=3\) people. Recovery takes place on individual nodes and hence does not depend on contact network. In each panel, highlighted states \(\textrm{x}\) are where \(p^\text {(rec)}_{\textrm{x}\rightarrow (\textrm{x}-\textrm{e}_n)} = \gamma\), indicating that person n is infected and can recover; this is also shown by green arrows. Non-highlighted states correspond to \(p^\text {(rec)}_{\textrm{x}\rightarrow (\textrm{x}-\textrm{e}_n)} = 0\). a \(n=1\), b \(n=2\), c \(n=3\)Using indicator function$$\begin{aligned} \textbf{1}_{\text {condition }} = {\left\{ \begin{array}{ll} 1, & \text {if condition is true} \\ 0, & \text {if condition is false}, \end{array}\right. } \end{aligned}$$we can write \(p^\text {(rec)}_{\textrm{x}\rightarrow (\textrm{x}-\textrm{e}_n)} = \gamma \textbf{1}_{x_n=1},\) and \(p^\text {(inf)}_{\textrm{x}\rightarrow (\textrm{x}+\textrm{e}_n)} = (\varepsilon + \beta \sum _{m\sim n}\textbf{1}_{x_m=1})\textbf{1}_{x_n=0}.\) Collecting these reaction rates in vectors of size \(2^N\), and using the big-endian lexicographic ordering as explained above, we obtain tensor product decomposition$$\begin{aligned} \begin{aligned} \begin{bmatrix} p^\text {(rec)}_{\textrm{x}\rightarrow (\textrm{x}-\textrm{e}_n)} \end{bmatrix}_{\textrm{x}\in \mathbb {X}^N}&= \gamma \vec {e}\otimes \cdots \otimes \vec {e}\otimes \vec {i}\otimes \vec {e}\otimes \cdots \otimes \vec {e}, \\ \end{aligned} \end{aligned}$$
(7)
where \(\vec {i}=\begin{pmatrix}0&1\end{pmatrix}^T\) appears in position n,  \(\vec {e}=\begin{pmatrix}1&1\end{pmatrix}^T\) appear elsewhere. The tensor product structure is illustrated for \(N=3\) in Fig. 4. For example, the full vector of recovery transitions corresponding to recovery of person \(n=1\) is$$\begin{aligned} \begin{aligned} \begin{bmatrix} p^\text {(rec)}_{\textrm{x}\rightarrow (\textrm{x}-\textrm{e}_1)} \end{bmatrix}_{\textrm{x}\in \mathbb {X}^N}&= \gamma \vec {i}\otimes \vec {e}\otimes \vec {e}\\&= \gamma \begin{pmatrix}0&1\end{pmatrix}^T \otimes \begin{pmatrix}1&1\end{pmatrix}^T \otimes \begin{pmatrix}1&1\end{pmatrix}^T \\&= \gamma \begin{pmatrix}0&0&0&0&1&1&1&1\end{pmatrix}^T. \end{aligned} \end{aligned}$$
(8)
We can see that this transition is possible from states \(\textrm{x}=\begin{pmatrix}x_1&x_2&x_3\end{pmatrix}\) with \(x_1=1,\) which are located in positions \({\overline{100}} = 4,\) \({\overline{101}} = 5,\) \({\overline{110}} = 6,\) \({\overline{111}} = 7\) of the vector (assuming 0-based indexing), in agreement to Eq. (8) and Fig. 4a.Similarly,$$\begin{aligned} \begin{aligned} \begin{bmatrix} p^\text {(inf)}_{\textrm{x}\rightarrow (\textrm{x}+\textrm{e}_n)} \end{bmatrix}_{\textrm{x}\in \mathbb {X}^N}&= \varepsilon \vec {e}\otimes \cdots \otimes \vec {e}\otimes \vec {s}\otimes \vec {e}\otimes \cdots \otimes \vec {e}\\ &\quad + \beta \sum _{m\sim n} \vec {e}\otimes \cdots \otimes \vec {e}\otimes \vec {s}\otimes \vec {e}\otimes \cdots \otimes \vec {e}\otimes \vec {i}\otimes \vec {e}\otimes \cdots \otimes \vec {e}, \end{aligned} \end{aligned}$$
(9)
where \(\vec {s}=\begin{pmatrix}1&0\end{pmatrix}^T\) appears in position n,  \(\vec {i}=\begin{pmatrix}0&1\end{pmatrix}^T\) appear in positions \(m\sim n,\) \(\vec {e}=\begin{pmatrix}1&1\end{pmatrix}^T\) appear elsewhere. To complete the expansion for the right-hand side of (6), we need to express the shifted state \(p(\textrm{x}-\textrm{e}_n,t)\) as a sum over probabilities \(p(\textrm{y},t)\) as follows$$\begin{aligned} \begin{aligned} p(\textrm{x}-\textrm{e}_n,t)&= \sum _{\textrm{y}\in \mathbb {X}^N} {1}_{x_1=y_1} \cdots {1}_{x_{n-1}=y_{n-1}} \cdot {1}_{x_n-1=y_n} \cdot {1}_{x_{n+1}=y_{n+1}} \cdots {1}_{x_N=y_N} \cdot p(\textrm{y},t), \\ \begin{bmatrix} p(\textrm{x}-\textrm{e}_n,t) \end{bmatrix}_{\textrm{x}\in \mathbb {X}^N}&= \underbrace{\left( \textrm{Id}\otimes \cdots \otimes \textrm{Id}\otimes J^T \otimes \textrm{Id}\otimes \cdots \otimes \textrm{Id}\right) }_{\textbf{J}_n^T} \textbf{p}(t), \end{aligned} \end{aligned}$$
(10)
where the shift matrix \(J^T=\left( {\begin{matrix} 0& 0\\ 1 & 0\end{matrix}}\right)\) appears in position n,  and identity matrices \(\textrm{Id}=\left( {\begin{matrix} 1 & 0\\ 0& 1 \end{matrix}}\right)\) appear elsewhere. Similarly, \(\begin{bmatrix} p(\textrm{x}+\textrm{e}_n,t) \end{bmatrix}_{\textrm{x}\in \mathbb {X}^N} = \textbf{J}_n \textbf{p}\) with \(\textbf{J}_n = \textrm{Id}\otimes \cdots \otimes \textrm{Id}\otimes J \otimes \textrm{Id}\otimes \cdots \otimes \textrm{Id}.\) Combining the above, we collect all equations of (6) in a vectorised CME$$\begin{aligned} \textbf{p}'(t) = \textbf{A}\textbf{p}(t), \qquad \textbf{p}(0)=\textbf{p}_0, \end{aligned}$$
(11)
where the \(2^N\times 2^N\) matrix \(\textbf{A}\) admits the following tensor product form:$$\begin{aligned} \begin{aligned} \textbf{A}&= \gamma \sum _{n=1}^N \textrm{Id}\otimes \cdots \otimes \textrm{Id}\otimes (J-\textrm{Id}) {\hat{I}} \otimes \textrm{Id}\otimes \cdots \otimes \textrm{Id}\\ &\quad + \varepsilon \sum _{n=1}^N \textrm{Id}\otimes \cdots \otimes \textrm{Id}\otimes (J^T-\textrm{Id}) {\hat{S}} \otimes \textrm{Id}\otimes \cdots \otimes \textrm{Id}\\ &\quad + \beta \sum _{n=1}^N \sum _{m\sim n} \textrm{Id}\otimes \cdots \otimes \textrm{Id}\otimes (J^T-\textrm{Id}) {\hat{S}} \otimes \textrm{Id}\otimes \cdots \otimes \textrm{Id}\otimes {\hat{I}} \otimes \textrm{Id}\otimes \cdots \otimes \textrm{Id}, \end{aligned} \end{aligned}$$
(12)
where \({\hat{I}} = \mathop {\textrm{diag}}\nolimits (\vec {i}) = \left( {\begin{matrix} 0& 0\\ 0& 1 \end{matrix}}\right)\) and \({\hat{S}} = \mathop {\textrm{diag}}\nolimits (\vec {s}) = \left( {\begin{matrix} 1 & 0\\ 0& 0\end{matrix}}\right) .\) This so-called canonical polyadic (CP) [61,62,63] form represents the CME matrix \(\textbf{A}\) as a sum of \((2N+|\mathcal {E}|)\) elementary tensors, each of which is a direct product of N small \(2\times 2\) matrices, acting on a single node of the network only. Hence, the storage for A reduces from \(\mathcal {O}(2^N \langle k \rangle )\) down to \(\mathcal {O}((2N+\langle k \rangle )N)\) elements, where \(\langle k \rangle = |\mathcal {E}”https://bmcbioinformatics.biomedcentral.com/”\mathcal {V}|\) denotes the average degree of \(\mathcal {G}.\) The curse of dimensionality for the matrix is therefore removed.To remove the exponential complexity in solving (11), we need to achieve a similar compression for the p.d.f. \(\textbf{p}(t)=[p(\textrm{x},t)]_{\textrm{x}\in \mathbb {X}^N},\) for which we employ the tensor train (TT) format [64].$$\begin{aligned} \textbf{p}\approx {\tilde{\textbf{p}}} = \sum _{\alpha _0,\ldots ,\alpha _N=1}^{r_0,\ldots ,r_N} \textrm{p}^{(1)}_{\alpha _0,\alpha _1} \otimes \cdots \otimes \textrm{p}^{(n)}_{\alpha _{n-1},\alpha _n} \otimes \cdots \otimes \textrm{p}^{(N)}_{\alpha _{N-1},\alpha _N}. \end{aligned}$$
(13)
Here, the \(r_{n-1} \times 2 \times r_n\) factors \(\textrm{p}^{(n)} = \left[ \textrm{p}^{(n)}_{\alpha _{n-1},\alpha _n}(x_n) \right] ,\) \(n=1,\ldots ,N,\) are called TT cores, and the ranges of the summation indices \(r_0,\ldots ,r_N\) are called TT ranks. Each core \(\textrm{p}^{(n)}\) contains information related to node n in the network, and the summation indices \(\alpha _{n-1},\alpha _n\) of core \(\textbf{p}^{(n)}\) link it to cores \(\textrm{p}^{(n-1)}\) and \(\textrm{p}^{(n+1)}.\) The matrix-vector multiplication can be performed fully in tensor product format. Using recently proposed algorithms [55, 56] the linear system of ODEs (11) can be solved fully in the TT format avoiding the curse of dimensionality, as explained in details in [18].Ordering of network nodes for faster forward problem solvingThe TT decomposition of probability functions exhibits low ranks when distant (with respect to their position in the state vector) variables are weakly correlated; see e.g. [65] for a rigorous analysis of this for the multivariate normal probability density function. Numerical approaches to order the variables in such a way include a greedy complexity optimisation over a reduced space of permutations [66, 67], using gradients to compute a Fisher-type information matrix and its eigenvalue decomposition to sort or rotate the variables [68], and sorting the variables according to the Fiedler vector of the network [69]. Since in our case the variables are discrete, we adopt the latter approach.We consider the Laplacian matrix of the network \(\mathcal {G},\) defined as follows$$\begin{aligned} L = \mathop {\textrm{diag}}\nolimits ( G \textrm{e}) – G, \end{aligned}$$
(14)
where \(G\in \mathbb {B}^{N\times N}\) is the adjacency matrix of \(\mathcal {G},\) and \(\textrm{e}= \begin{pmatrix} 1&1&\cdots&1 \end{pmatrix}^T \in \mathbb {B}^N.\) We are particularly interested in the Laplacian spectrum of \(\mathcal {G},\) i.e. solutions to the eigenvalue problem \(L \textrm{u}= \lambda \textrm{u}.\) It is easy to see that since \(G=G^T\) we also have \(L=L^T,\) hence the spectrum is real, \(\lambda \in \mathbb {R}.\) We also note that \(L=\left[ \ell _{m,n}\right] _{m,n\in \mathcal {V}}\) is diagonally dominant, since \(\ell _{m,m}= \sum _{n\in \mathcal {V}} g_{m,n} \geqslant 0,\) and \(\ell _{m,n} = -g_{m,n} \leqslant 0\) for \(m\ne n,\) hence \(|\ell _{m,m}| = \sum _{m\ne n} |\ell _{m,n}|.\) Since \(\mathop {\textrm{diag}}\nolimits (L)\geqslant 0\) and L is symmetric and diagonally dominant, the Laplacian is positive semi-definite, hence its eigenvalues are nonnegative, \(\lambda _{n-1} \geqslant \lambda _{n-2} \geqslant \ldots \geqslant \lambda _1 \geqslant \lambda _0 \geqslant 0.\) It is easy to see that \(L\textrm{e}= 0,\) hence \(\lambda _0=0\) is the lowest eigenvalue with the corresponding eigenvector \(\textrm{u}_0=\textrm{e}.\)The second eigenvalue, which we denote \(\lambda _1,\) is called the algebraic connectivity of \(\mathcal {G}\) or Fiedler value. It was known since [70] that \(\lambda _1=0\) if and only if \(\mathcal {G}\) is disconnected. This is a particularly easy scenario for epidemiological modelling. If \(\mathcal {G}\) consists of two disjoint networks, \(\mathcal {G}_1\) and \(\mathcal {G}_2,\) then nodes from \(\mathcal {G}_1\) and \(\mathcal {G}_2\) can not affect each other. The random variables associated with those nodes are therefore independent, i.e. if \(X_1\in \mathcal {G}_1\) and \(X_2\in \mathcal {G}_2\) then \({\textsf{P}}(X_1, X_2) = {\textsf{P}}(X_1) {\textsf{P}}(X_2).\) This means that if we order the variables \(x_1,\ldots ,x_N\) in such a way that the first block \(x_1,\ldots ,x_m\in \mathcal {G}_1\) and the second block \(x_{m+1},\ldots ,x_N \in \mathcal {G}_2\) do not overlap, then the TT format (13) for the p.d.f. \(p(x_1,\ldots ,x_N)\) will have the TT rank \(r_m=1\) for the connection separating \(\mathcal {G}_1\) and \(\mathcal {G}_2\).These geometric properties of the network can be estimated from the eigenvector \(\textrm{v}= \textrm{u}_1,\) also known as Fiedler vector. In the pioneering paper [71] it was related to finding an optimal cut in the network \(\mathcal {G}.\) It was generalised to reducing the envelope (or the bandwidth) of the adjacency matrix G in [72], and later to finding orderings of variables for which TT decomposition (13), and related MPS and DMRG representations in quantum physics [73], have lower TT ranks [69]. The Fiedler vector can be computed by minimising the Rayleigh quotient \(\textrm{v}= \arg \min _{\textrm{v}\perp \textrm{e}} \textrm{v}^T L \textrm{v}/ \Vert v\Vert ^2,\) also known as the Courant minimax principle, orthogonally to \(\textrm{u}_0=\textrm{e}.\) Following [72], we use the Fiedler vector to define a one-dimensional embedding of the graph to a linear chain. Let \(\sigma \in \mathfrak {S}_n\) be the permutation vector of the set of nodes \(\mathcal {V}\) such that \((\textrm{v}_\sigma )\) is ordered, i.e. \(\textrm{v}_{\sigma _1}\geqslant \textrm{v}_{\sigma _2}\geqslant \cdots \geqslant \textrm{v}_{\sigma _N},\) or equivalently in the ascending order. Following [69], the same permutation of variables also reduces the TT ranks of the TT decomposition (13). In particular, it groups together variables corresponding to independent subnetworks. Hence, we compute this permutation and adopt its order of variables before solving the forward problem (11) using tensor product algorithms.Algorithms for Bayesian inverse problemSince the full grid search is unfeasible for even moderate networks, we adopt the Metropolis-Hastings Markov Chain Monte Carlo (MCMC) method to approach the maximum of \({\textsf{L}}(\mathcal {G})\). The method is depicted in Algorithm 1. We need to choose a proposal distribution \(q({\hat{\mathcal {G}}} | \mathcal {G})\) which is tractable for sampling a new state of the network \({\hat{\mathcal {G}}}\), given the current network \(\mathcal {G}\). In each iteration, given the current network \(\mathcal {G}_i\), we sample a new proposal \({\hat{\mathcal {G}}}\) and accept or reject it with probability based on the Metropolis-Hastings ratio, forming a Markov Chain of network configurations \(\mathcal {G}_0,\mathcal {G}_1,\ldots\) After the Markov Chain is computed, we return the sample of this chain with the maximal likelihood.Algorithm 1MCMC algorithm for the likelihood maximisationThis algorithm is known to converge to the true distribution \(\frac{1}{Z}{\textsf{L}}(\mathcal {G})\) (where Z is the normalising constant) under mild assumptions [74]. We implement two proposal distributions.

1.

Choose one link in \(\mathcal {G}_i\) uniformly at random and toggle its state (on/off). Since there are \(N(N-1)/2\) possible links to toggle, \(q({\hat{\mathcal {G}}} | \mathcal {G}) = \frac{2}{N(N-1)}\) independently of \({\hat{\mathcal {G}}}\) and \(\mathcal {G}\), and hence cancels in the Metropolis-Hastings ratio. We will call Algorithm 1 with this proposal “MCMC-R”, since it samples the links with Replacement.

2.

Every \(N(N-1)/2\) iterations (i.e. when \(\mod (n,N(N-1)/2)=0\)), sample a random permutation vector \(\sigma \in \mathfrak {S}_{N(N-1)/2}\) of the set \(\{1,\ldots ,N(N-1)/2\},\) and in the next \(N(N-1)/2\) iterations toggle links in the order prescribed by \(\sigma\). This is still a valid MCMC algorithm with a constant proposal distribution, but now with respect to \(\sigma\), corresponding to a collection of networks in consecutive update steps, \((\mathcal {G}_{i+1}, \ldots , \mathcal {G}_{i+N(N-1)/2})\). In our numerical experiments reported in Sect. “Results” this algorithm sometimes increased the likelihood faster in terms of the individual link changes (and hence computing time), and resulted in a better grid reconstruction. In each block of \(N(N-1)/2\) iterations, this algorithm proposes link changes without replacement [75]. For this reason, we will call this algorithm “MCMC-noR” (for “no Replacement”).

Cancelling the constant proposal probability, and using log-likelihoods to avoid numerical over- and under-flow errors, we can rewrite the Metropolis-Hastings ratio as \(h({\hat{\mathcal {G}}}, \mathcal {G}_i) = e^{\ln {\textsf{L}}({\hat{\mathcal {G}}}) – \ln {\textsf{L}}(\mathcal {G}_i)}.\) We can further modify this formula to gain more control on performance of MCMC. Specifically, we introduce the temperature, or tempering parameter \(\tau\) as follows:$$\begin{aligned} h({\hat{\mathcal {G}}}, \mathcal {G}_i) = \exp \left( -\frac{\ln {\textsf{L}}(\mathcal {G}_i) – \ln {\textsf{L}}({\hat{\mathcal {G}}})}{\tau }\right) . \end{aligned}$$
(15)
We note that this formula resembles the Boltzmann distribution, also know as Gibbs distribution, which is used in statistical and theoretical physics and describes probability to observe particles in the ground and exited energy states. Similarly, the parameter \(\tau\) controls how often MCMC accepts a network with worse likelihood that the current one, which in order affects the convergence of the algorithm and its ability to get out of local optima. If a new grid \({\hat{\mathcal {G}}}\) has better likelihood than the current grid \(\mathcal {G}_i,\) it is always accepted. Otherwise, it is accepted only with probability \(p={\textsf{L}}({\hat{\mathcal {G}}})/{\textsf{L}}(\mathcal {G}_i)<1\) for \(\tau =1\). For \(\tau >1,\) this probability becomes \(p^{1/\tau } > p,\) which increases the probability that the new grid is accepted, encouraging MCMC to escape a local maximum.Choosing initial guess for optimisationA good initial guess \(\mathcal {G}_0\) for the contact network can significantly improve the computational efficiency by reducing the number of steps required by the optimisation algorithm to converge towards the optimum \(\mathcal {G}_{\text {opt}},\) but also by simplifying the forward problems and hence reducing the computational time required to evaluate each likelihood \({\textsf{L}}(\mathcal {G})\) in (4). Here we present a simple algorithm to generate an initial guess using the given nodal time series data \(\mathcal {X}= \{t_k,\textrm{x}_k\}_{k=0}^K.\) By comparing each next state \(\textrm{x}_k\) against a previous one \(\textrm{x}_{k-1}\) for \(k=1,\ldots ,K\) node-by-node, we observe events of two possible types: infected nodes that become susceptible (recovery), and susceptible nodes becoming infected (infection). Recoveries are single-node events and provide no information on the network connectivity. In contrast, infections are two-node events that occur when a susceptible node is connected to an infected node. Therefore, any node m that was or became infected during \(t\in [t_{k-1},t_k]\) could have infected any connected susceptible node n that became infected during the same time interval.Thus, we compute the connectivity scores \(h_{m,n}\) for all \(m,n\in \mathcal {V}\) as follows$$\begin{aligned} h_{m,n} = \sum _{k=1}^K h_{m,n}^{(k)}, \quad \text {with}\quad h_{m,n}^{(k)} = {\left\{ \begin{array}{ll} \tfrac{1}{|\mathcal {I}_k|}, & \text {if}\, x_{n,k-1}=0, \, x_{n,k}=1, \,\text {and}\, m\in \mathcal {I}_k; \\ 0, & \text {otherwise}, \end{array}\right. } \end{aligned}$$
(16)
where \(\mathcal {I}_k = \{ m\in \mathcal {V}: x_{m,k-1}=1 \,\text {or}\, x_{m,k}=1 \}\) is the set of all infected nodes at the beginning or by the end of the interval \(t\in [t_{k-1},t_k].\) The higher is the acquired score \(h_{m,n}\) the higher is the evidence that \(m\sim n\) in the contact network. Hence, when the scores are calculated, we can sample an initial guess network \(\mathcal {G}_0\) randomly with probabilities for each link proportional to the scores. Alternatively, for a more deterministic approach, we can discretise the distribution, and set \(m\sim n\) in \(\mathcal {G}_0\) for all links (m, n) for which the score exceeds the average, \(h_{m,n} \geqslant \frac{2}{N(N-1)} \sum _{i>j} h_{i,j}.\)

Hot Topics

Related Articles