Leveraging Synthetic and Genetic Data to Improve Epidemic Forecasting
Leveraging Synthetic and Genetic Data to Improve Epidemic Forecasting
Authors: Dave Osthus, Alexander C. Murph, Emma E. Goldberg, Lauren J. Beesley, William M. Fischer, Nidhi K. Parikh, Lauren A. Castro Date: 2026-03-25 Paper ID: arxiv:2603.24474
Summary
This paper investigates methods to improve emerging infectious disease forecasting, especially when historical data is scarce, by leveraging synthetic epidemiological data and pathogen genetic information. The authors trained deep learning models on various combinations of real and artificially generated data, with and without genetic inputs, to forecast COVID-19 cases across US states. Key findings indicate that models trained with synthetic data and those incorporating genetic variants show superior forecast accuracy compared to models relying solely on historical real data. All proposed models outperformed a persistence baseline, and several exceeded the performance of the COVIDHub-4_week_ensemble, establishing a strategy for next-pandemic preparedness.
Key Contributions
- Demonstrated that deep learning models trained with synthetic data achieve better forecast accuracy for COVID-19 than those trained on real data alone.
- Showed that incorporating genetic variant information significantly improves the forecast accuracy of epidemic models compared to models without it.
- Developed models that consistently outperformed a baseline persistence model and often surpassed the performance of the COVIDHub-4_week_ensemble for short-term COVID-19 forecasting.
- Provided a blueprint for utilizing underutilized data sources (synthetic and genetic) to improve forecasting preparedness for future emerging pandemics.
Limitations
The study focuses specifically on COVID-19, and the generalizability of the synthetic data generation process and genetic integration methods to novel, uncharacterized pathogens requires further validation.
Open Questions & Future Work
- simulator-fidelity-for-deep-learning-forecasting
- incorporating-day-of-week-effects-in-synthetic-data
- extending-context-window-for-seasonality
- joint-geographic-forecasting-with-deep-learning
- joint-forecasting-of-multiple-outcomes
- retraining-and-fine-tuning-strategies
- bridging-real-time-reporting-delay-gap
- improving-forecast-calibration-and-coverage
Key Concepts
- Synthetic Data Augmentation: The use of artificially generated data, rather than purely real-world observations, to augment training sets for machine learning models, particularly in low-data regimes like emerging epidemics.
- Genetic Information Integration in Forecasting: Incorporating pathogen genetic sequencing data alongside epidemiological time series to enhance the accuracy and lead time of infectious disease forecasts.
Datasets
Limitations
The study focuses specifically on COVID-19, and the generalizability of the synthetic data generation process and genetic integration methods to novel, uncharacterized pathogens requires further validation.
Links
Metadata & Links
- url
- https://arxiv.org/abs/2603.24474
- paper_id
- 2603.24474
- paper_source
- arxiv
- domain
- time-series
- tags
- time-seriesforecastingdatasetevaluationanomaly-detection
- architectures
-
- datasets
- COVID-19 cases (US states)
- skill
- TimeSeriesSkill
- created_at
- 2026-03-26T06:26:52Z