Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Pathology and Laboratory Medicine


Art Poon


Collecting genetic sequences from infectious pathogens allows for the construction of molecular clusters: groups of cases with genetically similar pathogen populations. These imply outbreaks, but the methods used to create them often require a threshold to qualify clusters (ie. 99\% average pairwise sequence identity among cases). This project demonstrates a framework to observe the way that cluster-based outbreak detection responds to threshold selection using a variety of thresholds, three different sets of North American HIV-1 sequence data and two different methods to define clusters. This is done through cross-validation of predictive models, measuring performance through the loss of Akaike's information criterion, a metric which indicates the benefit of predictive variables given a threshold. I compare thresholds which maximize this loss between clustering methods and data sets, analyzing the optimum thresholds for clustering at each location.

Summary for Lay Audience

If a fast-evolving virus has little time to mutate in one host before being transmitted to the next, the result is that many hosts share genetically similar viruses. This can be evidence of an outbreak, and such evidence is vital for public health authorities, especially as similar viral sequences are collected from new patients (indicating a growth of the outbreak). Such methods have been particularly well-used for Human Immunodeficiency Virus (HIV), the causative agent for AIDs. However, it is difficult to establish how genetically similar these viruses need to be before a group of cases is labelled an outbreak and arbitrary thresholds of similarity are often used for this task. A poor choice for this threshold can lead to overestimation or the underestimation of the outbreak. Furthermore, this may make the predictive models which estimate how the outbreak will grow ineffective. This work shows a statistical method which chooses such a threshold based on how accurately it will predict the growth of outbreaks. Three different data sets of HIV genetic sequences are used as an example, each of which were collected from North America. We used two different examples of these sequence-based outbreak detection methods and found that the ideal threshold of similarity for predicting outbreak growth differs between location and method.

Creative Commons License

Creative Commons Attribution-Share Alike 4.0 License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 License.