An adaptive weight ensemble approach to forecast influenza activity in an irregular seasonality context

Influenza epidemics in Hong KongIn Hong Kong, an indicator combining influenza-like-illness intensity (ILI+) and the proportion of respiratory specimens positive for influenza each week serves as the gold standard to measure influenza activity22. From 1998 to 2019, 32 epidemics occurred: 20 in winter (November to April) and 12 in summer (May to October). Winter epidemics consistently occurred, except in 2013 and 2016. The start week of epidemics varied (Fig. 1), with winter and summer epidemics ranging from week 5 (Nov, 29th)–21 (Mar, 21st), and week 27 (May, 7th)−47 (Sep, 18th), respectively.Fig. 1: Influenza activity in Hong Kong.A Influenza activity (ILI + ) in Hong Kong from 1998 to 2019, encompassing the influenza season, and the training and testing periods. Blue dotted vertical lines indicate the start of epidemics. B Influenza trends in a year Hong Kong. Epidemic week is defined as November of the preceding year to the October in the current year.A 1-week delay exists in ILI+ reporting (i.e., at week t, ILI+ is only available up to week t−1). We mimic real-time analysis considering these delays to generate nowcasts and up to 8-week ahead forecasts. In 2009, the sentinel surveillance system was affected by the establishment of Special Designated Flu clinics (not part of the sentinel network) in response to the H1N1 influenza pandemic. In our previous study, we estimated that this establishment of special clinics inflated the ILI+ values by threefold in 200923. Thus, we exclude 2009 forecasts from all evaluations. This removal is due to the change in the data-generating process from sentinel surveillance to a combination of sentinel surveillance and Special Designated Flu Clinics in 2009. When changes in the data-generating process occur over time periods, our methods would not be applicable to those periods.Overview of nowcasting and forecasting of ILI+Given the irregular seasonality of influenza activity in Hong Kong and potential influenza activity in summer, we generate nowcasts and forecasts of ILI+ throughout the year. This approach contrasts with season-based methods, which only forecast influenza activity from December to May in regions with regular influenza seasonality in winter and no influenza activity in summer.We first develop various statistical models (Appendix Section 1) to nowcast and forecast ILI+ up to 8 weeks ahead, incorporating epidemiological predictors such as past ILI+ values from the previous 14 weeks, week number, and month of the year to reflect seasonality, and meteorological data including weekly temperature, temperature range, absolute and relative humidity, rainfall, solar radiation, wind speed, and atmospheric pressure (Appendix Section 2). Predictions from these individual models are combined into a single ensemble forecast. Eight models are tested, including the Autoregressive Integrated Moving Average model (ARIMA), Generalized Autoregressive Conditional Heteroskedasticity model (GARCH), Random Forests model (RF)24, Extreme Gradient Boosting model (XGB)25, Long Short-Term Memory Network Model (LSTM)26, Gated Recurrent Units Network Model (GRU)27, a Transformer-based time series framework named TSTPlus (TSTPlus)28 and an ensemble of deep Convolutional Neural Network (CNN) models called InceptionTime Plus Model29.We use January 1998–October 2007 as the training period and November 2007—July 2019 as the test period. To avoid collinearity among meteorological predictors and determine the optimal lag, we compute the Pearson correlation between ILI+ data and predictors with varying lags during the training period, selecting 5 out of 8 meteorological predictors (see methods). Epidemiological predictors are included in all models. Individual models are fitted on the selected predictors and observed ILI+ data, generating forecasts using a rolling method: At each week t during the test period (Nov 2007—Jul 2019), we use meteorological data up to week t and ILI+ data only up to week t−1 to make forecasts for the period t up to t + 8. Statistical models (ARIMA and GARCH) are retrained weekly with the most recent available data. Other models (machine learning and deep learning) are retrained only in the first week of November each year, as weekly updates require higher training costs with no performance improvement (Fig. S1).Then, we generate ensemble forecasts based on individual models by using a simple averaging of the top two models and a model blending method that weights models based on past performance18,30,31. We also introduce a time adaptive decay weighting scheme, where performance with more recent data contributes more heavily to the estimation of the weight of each model. For comparison with the ensemble and individual models, we consider a baseline “constant” null model, in which the ILI+ from weeks t to t + 8 weeks remains the same as ILI+ at week t−1.Prediction intervalsMost machine learning and deep learning approaches do not provide prediction intervals, and our trials using Monte Carlo Dropout (MCDropout) show that MCDropout may generate worse point forecasts compared to those without MCDropout (Fig. S2). To address this limitation, we extend the previous approach using a normal distribution with the point forecast as a mean and the standard deviation (SD) calculated using residuals from a rolling 20-week window32. Instead of fixing the 20-week window, we use the training data to determine the optimal week number for the rolling window for each prediction horizon and model (see method). To ensure fair comparisons, we used the same method to generate prediction intervals for baseline models, and statistical models.Evaluation metricsWe evaluate and compare the performance of individual and ensemble models during the test period (Nov 2007—Jul 2019). Primarily, we use root mean square error (RMSE), symmetric mean absolute percentage error (SMAPE), and weighted interval score (WIS) to compare models, while also providing mean absolute error (MAE) and mean absolute percentage error (MAPE) (Appendix Section 3).Performance of individual models in nowcasting and forecastingEight individual models are used to forecast ILI+ up to an 8-week time horizon. Most of these models broadly capture ILI+ dynamics (Figs. S3 and S4). Across all 9 weeks of the horizon (t = 0 to t + 8), all models outperform the baseline constant model, reducing RMSE by 23%–29%, SMAPE by 17%–22%, and WIS by 25%–31% (Fig. 2). As the prediction horizon lengthens, individual model improvements become more apparent. For instance, all models outperform the baseline, reducing RMSE by 22%–31% for 4 weeks ahead forecast and 33%–37% for 8 weeks ahead forecast (Table S1).Fig. 2: Performance comparison of individual and ensemble models over the testing period by prediction horizon.A–E referred to RMSE, MAE, WIS, SMAPE, and MAPE. Respectively. F showed the numerical value of the relative performance of metrics compared to the Baseline. Models: ARIMA Autoregressive Integrated Moving Average Model, GARCH Generalized AutoRegressive Conditional Heteroskedasticity Model, RF Random Forest, XGB Extreme Gradient Boosting, InTimePlus InceptionTime Plus Model, LSTM Long Short-Term Memory Network, GRU Gated Recurrent Neural Network, TSTPlus Transformer-based Framework for Multivariate Time Series Representation Learning Model, SAE Sample Average Ensemble model, NBE Normal Blending Ensemble model, AWAE Adaptive Weighted Average Ensemble model, AWBE Adaptive Weighted Blending Ensemble model.We then assess model performance during distinct epidemic phases (Fig. 3). The best-performing model varies among phases, with LSTM, InTimePlus, and RF performing best in growth, plateau, and decline phases, respectively, reducing RMSE by 24%, 31%, and 26% compared to the baseline model. We evaluate model performance by season (winter or summer) (Figs. 3 and S5) and find that the best-performing model in one season may not perform as well in the other. The RF model performs best in winter but has the worst performance in summer among all individual models. Conversely, the InTimePlus model performs best in summer but has the worst performance in winter. Since the relative skills of different types of models depend on the epidemic phase and the season, an ensemble would likely be well-positioned to improve overall performance.Fig. 3: Trajectory of the forecasts of ensemble models.Red, yellow and blue indicate the point forecasts of 0-week, 4-week and 8-week ahead, accompanying the 90% prediction interval by shaded area of corresponding colors.Performance of ensemble models in nowcasting and forecastingWe first examine two ensemble approaches: the simple averaging ensemble (SAE) and the normal blending ensemble (NBE). For the SAE model, we select the three best models at each week t based on RMSE calculated using data up to week t−1 and take the unweighted mean of these individual forecasts. Two models are selected based on the RMSE on ensemble models with different numbers of best models in training period (Figs. S6 and S7). For the NBE model, we fit a LASSO regression on observed ILI+ up to week t−1, and the corresponding out-sample prediction of ILI+ from all individual models, assigning weights to the nowcast and forecast periods using regression coefficients. Therefore, the weights of individual models are not constrained to sum up to one and are allowed to be negative. To further improve forecasts, we use exponential time decay to weight previous data, assigning the highest weight to the most recent week (t−1) when averaging or blending. These are named the Adaptive Weighted Average Ensemble (AWAE) and the Adaptive Weighted Blending Ensemble (AWBE) models, respectively. Based on the training data, we set the decay rate to be 0.384 in the AWAE and AWBE models, the weight decayed to 0.01 at 12 weeks before the week of conducting forecast (Figs. S8 and S9).Compared to the baseline constant model, the SAE and NBE models reduced RMSE by 27%−29%, SMAPE by 19–21%, and WIS by 28–29%, comparable to the performance of the best individual models, without showing a significant advantage. The performance further improved when considering decaying weights based on recent performance; the AWAE and AWBE ensemble models reduced RMSE by 39% and 52%, SMAPE by 35% and 43%, and WIS by 42% and 53%, demonstrating that adaptive weighting could further improve forecasting performance. Further evaluating performance by different forecasting horizons, the AWAE and AWBE improvements were more apparent at longer prediction horizons (Table S1, 2, Figs. 2 and 4). Additionally, the AWAE and AWBE ensembles consistently improved forecast accuracy for different epidemic phases and both summer and winter, compared to the SAE and NBE models or all individual models (Fig. 3).Fig. 4: Model performance by epidemic phases and seasons.All metrics are relative to the Baseline model and include RMSE, SMAPE, MAE, WIS, and MAPE. A Model performance for distinct epidemic trends, with black dashed lines representing the best individual models. B Different stages of Hong Kong flu data distinguished by color, with the gray dashed line signifying outbreak threshold. C Model performance by winter and summer seasons.Model performance in predicting occurrence of epidemicDespite our framework is designed and optimized to generate 0–8 week ahead forecast, our framework can provide predictions on occurrence of epidemic, peak timing and magnitude (Appendix Section 3.3). We follow the evaluation criteria based on a previous study in Hong Kong13 (Details available in method section). The accuracy of predicting occurrence of outbreaks within the next eight weeks, as well as the sensitivity (TPR), specificity (SPC), precision (PPV), and recall (NPV) indicators, based on AWAE and AWBE are higher than the baseline and individual models, with values equal to 0.9 for these 5 indicators. When compared with previous predictions for Hong Kong (17), our AWAE and AWBE model demonstrates better performance in term of NPV, and comparable performance across all other indicators (Fig. S10). Regarding peak timing (Fig. S11), our individual models could perform better in 0-week ahead forecast under stricter criteria, and in 0–4-week ahead forecast under looser criteria, compared with the previous work (17). The accuracy is comparable for other horizons. It should be noted that the ensemble model may not always better than individual models. Regarding peak magnitude, AWBE model demonstrates better accuracy for 1–6 week ahead forecast under the stricter, compared to previous work (17), and similar performance for 0-week ahead forecast (Appendix Section 3.3).Feature importanceThe feature importance map illustrates the significance of different predictors for various models and time horizons, reflecting model forecasting accuracy results and intuitive model interpretation (Fig. 5). First, the ILI+ value in t−1 is highly important for all models, as expected, since it represents the most updated information about the current epidemic state. Second, different models utilize predictors differently. For statistical models (ARIMA and GARCH), except for the ILI+ value in t−1, meteorological predictors and ILI+ values at t−2 and earlier are equally and weakly important, with little change in predictors across time horizons. For machine learning models (RF and XGB), predictors become more important for longer prediction horizons. Deep learning models exhibit different feature importance maps from other approaches. For LSTM and GRU, epidemiological predictors are more important than meteorological predictors. For InTimePlus and TSTPlus, only past ILI+ values are important in nowcasts, but all other predictors become more important for longer prediction horizons. We examine the differences in feature importance between summer and winter seasons. The importance of most features remains consistent, except for ‘Minimum of temperature’. For the RF, XGB, InTimePlus, and TSTPlus models, this feature is slightly more important for long-term winter predictions than for summer predictions (Fig. S12). The differences in learning capabilities regarding features and lags of various models also support the use of ensemble models and their improvements compared to individual models.Fig. 5: Feature importance in nowcast, 4-week, and 8-week ahead forecasts.Importance is measured by average regression coefficients in ARIMA and GARCH models, average feature importance in RF and XGB models, average saliency maps for LSTM and GRU models, and average permutation importance for TSTPlus and InTimePlus models. It should be noted that the numerical results of different comparison methods may not be directly comparable.We conducted an ablation test to compare the prediction performances for models with or without meteorological predictors (Fig. S13). Overall, the inclusion of meteorological variables does not have a consistent impact on the performance. The impact depends on the models, evaluation metrics and evaluation period (Appendix Section 2.3).Requirement of the amount of training data and associated computational costTo determine the required amount of training data, we vary the training period duration from 1 to 9 years, retrain the models, generate forecasts, and compute the RMSE (Fig. S14). Decreasing trends in RMSE are observed when the training period increases, particularly for deep learning models. Overall, 7 years of data are needed for a performance comparable to models using all available data (relative RMSE < 20%). When there are only one year of training data, all individual models are worse than baseline (Fig. S15).We also evaluate the computational cost of each model, measured by the training running time for predicting a single week’s ILI+. The running time of ensemble models is the sum of the running time of all individual models plus the ensemble’s running time. We find that RF and XGB have low training costs regardless of the amount of training data (<1 sec), while TSTPlus and InTimePlus have high training costs, increasing from around 5–6 sec for 1 year of training data to around 170–200 sec of rolling training approach. Overall, one run of training takes at most 196 sec for the complicated deep learning model when using the rolling method. For ensemble models, since running all individual models is required, it takes around 430 sec for 1 year. However, it should be noted that individual models could be run parallelly. If so, the ensemble models required only no more than 200 sec.Model performance after COVID-19 pandemicDuring 2020–2023, Hong Kong implemented substantial public health and social measures to suppress COVID-19 outbreaks, resulting in zero influenza activity with no epidemics in these 3 years33,34. Considering the potential impact on influenza transmission dynamics due to the presence of SARS-CoV-2 viruses, we test our approaches in the post-COVID-19 era, using March 2023 to January 2024 as another testing period.Overall, all models except for ARIMA and GRU perform better than the baseline model during 2023–2024 but slightly worse than the pre-COVID era (Table S2, Fig. S16). Only 6 out of 8 models improve predictions, reducing the RMSE by 5–17% compared to the baseline model (Table S3). Ensemble models further enhance predictions, with NBE, AWAE, and AWBE reducing the RMSE by 24%, 27%, and 39%, respectively, compared to the baseline (Table S3, Fig. 6), but not for SAE.Fig. 6: The trajectory of forecast of adaptive ensemble models during the post-COVID period.Red, yellow, blue and purple indicate the point forecasts of 0-week, 2-week, 4-week and 8-week ahead, accompanying the 90% prediction interval by shaded area of corresponding colors. A, B show the results of Adaptive Weighted Average Ensemble model (AWAE) and Adaptive Weighted Blending Ensemble (AWBE) respectively.

Hot Topics

Related Articles