
An outcome-based statistical framework to select and optimize molecular clustering methods for infectious diseases
Abstract
Collecting genetic sequences from infectious pathogens allows for the construction of molecular clusters: groups of cases with genetically similar pathogen populations. These imply outbreaks, but the methods used to create them often require a threshold to qualify clusters (ie. 99\% average pairwise sequence identity among cases). This project demonstrates a framework to observe the way that cluster-based outbreak detection responds to threshold selection using a variety of thresholds, three different sets of North American HIV-1 sequence data and two different methods to define clusters. This is done through cross-validation of predictive models, measuring performance through the loss of Akaike's information criterion, a metric which indicates the benefit of predictive variables given a threshold. I compare thresholds which maximize this loss between clustering methods and data sets, analyzing the optimum thresholds for clustering at each location.