
MINING SARS-COV-2 PHYLOGENETIC TREES TO ESTIMATE CIRCULATING INFECTIONS AND PATTERNS OF MIGRATION
Abstract
The SARS-CoV-2 pandemic led to the formation of very large databases of genomic viral data. These databases contain information on transmission dynamics, emergence and evolution of SARS-CoV-2. However, extracting this information from sequences is difficult, as most methods of analyzing viral genomes were developed for smaller data sets. Therefore, my objective was to develop new fast estimators of the number of infections (I) and the rate of migration based on simple features of SARS-CoV-2 phylogenies.
I simulated pathogen evolution using a susceptible-exposed-infectious-recovered (SEIR) model of pathogen spread, reconstructing evolution using CoVizu. For simulations of I, I varied the total number of infections when a final sample was obtained. For simulations of migration rates, I simulated independent groups of infections and varied the rates of movement between these groups. I then extracted summary statistics from the simulation output and developed general linear models (GLMs) and Markov models to predict I and migration rates respectfully. I evaluated the models using validation data and veritable SARS-CoV-2 data.
The GLMs formulated to predict I showed significant promise, especially when predicting when there were less than 1 million infections. The Markov models developed to predict migration rates were less successful. However, the simulation pipeline formulated to test the Markov models may be used for further development of efficient methods to estimate migration rates.
This research will help inform public health officials on SARS-CoV-2 spread between countries and emerging variants that may become variants of concern. Additionally, the algorithms are flexible and, with new training, may be applied to future outbreaks of novel viral pathogens.