Electronic Thesis and Dissertation Repository

An outcome-based statistical framework to select and optimize molecular clustering methods for infectious diseases

Connor Chato, The University of Western Ontario

Abstract

Collecting genetic sequences from infectious pathogens allows for the construction of molecular clusters: groups of cases with genetically similar pathogen populations. These imply outbreaks, but the methods used to create them often require a threshold to qualify clusters (ie. 99\% average pairwise sequence identity among cases). This project demonstrates a framework to observe the way that cluster-based outbreak detection responds to threshold selection using a variety of thresholds, three different sets of North American HIV-1 sequence data and two different methods to define clusters. This is done through cross-validation of predictive models, measuring performance through the loss of Akaike's information criterion, a metric which indicates the benefit of predictive variables given a threshold. I compare thresholds which maximize this loss between clustering methods and data sets, analyzing the optimum thresholds for clustering at each location.