ArchitectureA stage-sensitive prediction model for infectious disease is proposed in this paper. As shown in Fig. 1, the proposed model mainly consists of two modules, data augmentation and prediction. In data augmentation module, time series related to the population are segmented by a breakpoint detection algorithm at first. Then, the staged population data are fused with features from the compartmental model, and that about the prevention and control policies. Importantly, a self-attention mechanism is leveraged to quantify the variations in transmission capabilities of the infectious disease across stages. As to the prediction module, a Bidirectional Gated Recurrent Unit (Bi-GRU) network is used to predict the populations may be infected in a long-term.In Fig. 1, \(X={\{X_t}\}\) represents population data reflecting the number of uninfected individuals, infected individuals, recovered individuals and dead individuals, where \(0<t< T\), and \(X_t\) indicates the population data on the t-th day. The time points, at which the number of infections seems to change largely, are detected and been used to divide the time series X into stages. These time points are also called breakpoints. Correspondingly, population features in all stages are denoted as \(S={\{S_p}\}\), where \(0<p<P\), and \(S_p\) indicates the population data on the stage p. The features from the SIRF model in all stages is termed as \(Y={\{Y_p}\}\). The GI feature \(G={\{G_p}\}\) refers to the disease prevention and control measures taken by the government in each stage.In order to deeply reveal the rules in virus transmission, the features Y from the SIRF model are fused with features G reflecting governmental control measures. We employ a self-attention mechanism to learn the transmission relationship among various stages. Thus, trainable weights will be assigned to time series in different stages. This can help more accurately reflect the dynamic changes and impacts of infectious disease transmission, thereby improving the prediction.Finally, a Bi-GRU network is introduced to learn temporal dependencies from the augmented time series in both forward and backward directions, and predict the number of infections at a given future time point.Transmission stage divisionFor a non-smooth time series, the breakpoint detection is to identify the positions of data mutations. On the curve of infected population, the segment between two consecutive breakpoints seems to be stable. So, this can be used to divide outbreak stages of an infectious disease.Due to the complexity and non-stationarity of COVID-19 pandemic data, we adopt an adaptive breakpoint detection method, which can determine the optimal breakpoints. Given a time series \(X={\{X_t}\}\) of infected population, we denote the time series between breakpoints \(t_b\) and \(t_b+1\) as \(X_{t_b:t_b+1}\). For a given breakpoint \(t_b\), its score is calculated using as \(\tau =t_b/T\in (\left. 0,1\right. ]\). The set of breakpoint scores, is denoted as \(\tau ={\{\tau _1,\tau _2,…}\}\). The idea is to build a contrast function, \(V\left( \tau ,X\right) =\sum _{p=0}^{P}{C(X_{t_b:t_b+1})}\), that is the sum of loss functions c(X). This loss function represents the evaluation of the goodness of fit. In order to obtain the most accurate segmentation of the data, it is necessary to minimize the loss function of each segment, which is ultimately reflected in the minimization of the contrast function. The detected breakpoint values are shown as equation \(\underset{\tau }{\mathop {P=\min }}\,V(\tau )+pen(\tau )\), where, P represents the number of breakpoints, and \(pen\left( \tau \right)\) is the penalty for \(\tau\). The addition of this is aimed at balancing the reference function \(V\left( \tau ,X\right)\). If the penalty is small, it tends to use more segments to reduce V. Conversely, if the penalty is large, it tends to use fewer segments.The mean shift model is the most widely used and simple loss function in breakpoint detection. This function follows a Gaussian distribution with a fixed variance. The loss function is also known as the quadratic error loss, \({C}({{X}_{t_b:t_b+1}})=\ \sum \limits _{t\ =\ t_b}^{t_b+1}{||{{X}_{t}}-{{{\overline{X}}}_{t_b:t_b+1}}}||_{2}^{2}\), where, \({{\overline{X}}}_{t_b:t_b+1}\) is the empirical mean of sub-signal \(X_{t_b:t_b+1}\).To find the optimal breakpoints, a search method is required for the optimization. The penalty \(l_0\), which is also known as the linear penalty, is considered the most popular penalty. The penalty \(l_0\) is denoted as \(pen_{l_0}(\tau ):=\beta |\tau |\), where \(\beta >0\) is the smoothing parameter. Intuitively, the smoothing parameter controls the trade-off between complexity and goodness of fit. A low value of \(\beta\) favors segmentation with many breakpoints, while a high value of \(\beta\) discards most of the change points.The Pruned Exact Linear Time (PELT) algorithm is used to find the exact solution for the penalty term \(pen=pen_{l_0}\). This method considers each sample in order and determines whether to discard it from the potential set of breakpoints based on explicit pruning rules.Stage weight learningWhen categorizing the stages of infectious disease transmission, the self-attention mechanism can be employed to assign weights to each stage, thus providing a more accurate representation of their importance and impact30. By leveraging the self-attention mechanism, the model can autonomously discern the correlations and interactions between different stages, facilitating the allocation of appropriate weights to each stage.Self-attention is a variant of the traditional attention mechanism, which can capture the correlations of input sequences without additional knowledge. The self-attention mechanism adopts a query-key-value pattern. Given the input time series of fused features \(Z={\{Z_p}\}\) (see in Fig. 1), the calculation of the weight between the two sequences is described as Eq. (1), where \(0<p<P\), and \(Z_p\) denote the fused feature sequence in the p-th stage.$$\begin{aligned} {\begin{aligned}&Q_p=Z_pW_Q \\&K_{p+1}=Z_{p+1}W_K \\&V_{p+1}=Z_{p+1}W_V \\&A=softmax(Q_pK_{p+1}) \\&H=AV_{p+1} \end{aligned}} \end{aligned}$$
(1)
In Eq. (1), parameter \(W_Q\), \(W_K\) and \(W_V\) are three trainable linear matrices, while \(Q_p\), \(K_{p+1}\) and \(V_{p+1}\) are matrices composed of query vectors, key vectors, and value vectors, respectively. The attention weight matrix is denoted as A, and H is the time series matrix after adding the attention weight.Feature fusionThe SIRF model, which is a variant of the traditional compartmental model, can describe the interaction and evolution of different population in infectious disease spread. Therefore, incorporating the prevention and control measures into such interactions will have impacts on the spread prediction of infectious disease. We suppose that S, \(S^*\), I, R, and F represent the susceptible population, asymptomatic population, confirmed cases, recovered population, and the fatalities, respectively. The interaction and evolution of different population in infectious disease spread can be represented as \(\overset{\beta I}{\mathop {\rightarrow }}\,{{S}^{*}}\overset{{{\alpha }_{1}}}{\mathop {\rightarrow }}\,F\), \({{S}^{*}}\overset{1-{{\alpha }_{1}}}{\mathop {\rightarrow }}\,I\overset{\gamma }{\mathop {\rightarrow }}\,R\), \(I\overset{{{\alpha }_{2}}}{\mathop {\rightarrow }}\,F\), where, \(\alpha _1\), \(\alpha _2\), \(\beta\), and \(\gamma\) refer to the asymptomatic fatality rate, the confirmed case fatality rate, the effective transmission rate, and the recovery rate, respectively.Generally, there are \(\frac{dS}{dP}=-{{N}^{-1}}\beta SI\), \(\frac{dI}{dP}={{N}^{-1}}(1-{{\alpha }_{1}})\beta SI-(\gamma +{{\alpha }_{2}})I\), and \(\frac{dR}{dP}=\gamma I\), and \(\frac{dF}{dP}={{N}^{-1}}{{\alpha }_{1}}\beta SI+{{\alpha }_{2}}I\), where, \(N=S+I+R+F\) represents the total population. Thus, the features from the infectious disease model are denoted as \(SIRF(S,{{S}^{*}},F,I,R,{{\alpha }_{1}},{{\alpha }_{2}},\gamma ,\beta )\).In this paper, we introduce 13 GI features for quantifying prevention and control measures into the prediction model, including Close Schools, Close Workplaces, Cancel Gatherings, Restrict Gatherings, Close Traffic, Restrict Staying Home, Restrict Domestic Travel, Restrict International Travel, Public Information Campaigns, Testing Policies, Contact Tracing, Infection Detection, and Deflationary Index. To be convenient for discussion, we shortly termed them as \(G(P_1,P_2\cdots \ P_{13})\). Finally, we fuse these two kinds of features into \(Z=(Y, G)\) (see in Fig. 1).Spread trend predictionThe Bi-GRU network has advantages in modeling time series. By simultaneously extracting temporal features from both the period preceding and following a specific time point, it can significantly improve the prediction accuracy. As shown in Fig. 1, the Bi-GRU network used in the proposed model contains an input layer, a forward hidden layer, a backward hidden layer, and an output layer. At each time step, the input data is transmitted to both the forward and backward hidden layers. The output of the output layer is determined by the combined representations of the two hidden layers.The Bi-GRU network receives the weighted time series \(WZ={\{W_p, Z_p}\}\) as input, where \(0<p<P\), and \(\{W_p, Z_p\}\) is the input vector of the p-th stage. The feature extraction of the Bi-GRU network is simply represented as follows:$$\begin{aligned} \begin{aligned}{}&z_p=\sigma (W_zW_pZ_p+U_zh_{p-1}+b_z)&r_p=\sigma (W_rW_pZ_p+U_rh_{p-1}+b_r) & \\&{{\widetilde{h}}}_p=tanh{(}W_hW_pZ_p+U_h(r_p\circ \ h_{p-1})+b_n)&h_p=z_p\circ \ h_{p-1}+(1-z_p)\circ {{\widetilde{h}}}_p & \end{aligned} \end{aligned}$$
(2)
where, \(z_p\) and \(r_p\) represent the update gate and reset gate, respectively; \({{\widetilde{h}}}_p\) represents the candidate hidden state; \(h_{p-1}\) and \(h_p\) represent the hidden state at stages \(p-1\) and p , respectively; \(W_z\), \(W_h\), \(W_r\), \(U_z\), \(U_h\) and \(U_r\) are trainable weights; b is a bias; \(\sigma\) represents the sigmoid function. It can be found, from the equations above, that a GRU can either remember or forget some previous temporal features through learning parameters \(z_p\) and \(r_p\), thereby improving the temporal dependencies extraction from time series. Thus, the mathematical expression of the Bi-GRU network structure is as follows:$$\begin{aligned} {\textbf{h}}_p=GRU(W_pZ_p,{\textbf{h}}_{p-1})\quad \quad {{\overset{\scriptscriptstyle \leftarrow }{h}}_{p}}=GRU({{W}_{p}}{{Z}_{p}},{{\overset{\scriptscriptstyle \leftarrow }{h}}_{p-1}}) \quad \quad {{h}_{p}}=f({{W}_{{{{\textbf{h}}}_{p}}}}{{\textbf{h}}_{p}}+{{W}_{{{{\overset{\scriptscriptstyle \leftarrow }{h}}}_{p}}}}{{\overset{\scriptscriptstyle \leftarrow }{h}}_{p}}+{{b}_{p}}) \end{aligned}$$
(3)
where, \({\textbf{h}}_p\) and \({\overset{\scriptscriptstyle \leftarrow }{h}}_{p}\) are the states of the forward and backward hidden layers at stage p, respectively; \(W_{{\textbf{h}}_p}\) and \({{W}_{{{{\overset{\scriptscriptstyle \leftarrow }{h}}}_{p}}}}\) are the weights of the forward and backward hidden layers at stage p, respectively; \(b_p\) is the bias of the hidden layer at time p. Thus, the Bi-GRU network can extract the long-term temporal dependencies, thereby improving the prediction of time series.