Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article


Master of Science


Pathology and Laboratory Medicine


Poon, Art


The SARS-CoV-2 pandemic led to the formation of very large databases of genomic viral data. These databases contain information on transmission dynamics, emergence and evolution of SARS-CoV-2. However, extracting this information from sequences is difficult, as most methods of analyzing viral genomes were developed for smaller data sets. Therefore, my objective was to develop new fast estimators of the number of infections (I) and the rate of migration based on simple features of SARS-CoV-2 phylogenies.

I simulated pathogen evolution using a susceptible-exposed-infectious-recovered (SEIR) model of pathogen spread, reconstructing evolution using CoVizu. For simulations of I, I varied the total number of infections when a final sample was obtained. For simulations of migration rates, I simulated independent groups of infections and varied the rates of movement between these groups. I then extracted summary statistics from the simulation output and developed general linear models (GLMs) and Markov models to predict I and migration rates respectfully. I evaluated the models using validation data and veritable SARS-CoV-2 data.

The GLMs formulated to predict I showed significant promise, especially when predicting when there were less than 1 million infections. The Markov models developed to predict migration rates were less successful. However, the simulation pipeline formulated to test the Markov models may be used for further development of efficient methods to estimate migration rates.

This research will help inform public health officials on SARS-CoV-2 spread between countries and emerging variants that may become variants of concern. Additionally, the algorithms are flexible and, with new training, may be applied to future outbreaks of novel viral pathogens.

Summary for Lay Audience

Covid-19 has led to unprecedented production and sharing of viral genetic data sets. These data sets are so large that existing data analysis tools are no longer practical. As a result, scientists have developed novel approaches to use the data to show how viruses are evolving, illustrated using trees. However, new techniques are still needed to extract information about the spread and evolution of viruses from these trees.

In this body of work, I used computer simulations of virus spread to create models which could estimate the number of people infected with Covid-19 and the movement of the virus between countries. These models used the trees created by the existing software CoVizu to make estimates. Information given by my models will be essential in the continued monitoring of the spread of Covid-19 and any future known or new viruses. iii