
Outbreak Detection From Virus Genetic Sequence Variation By Community Detection
Abstract
Detecting risk groups in transmission networks can be difficult due to a virus' high transmission rate. We hypothesize that this problem can be resolved by community detection methods. Community detection is a clustering method based on edge density, which can break a connected component into multiple smaller clusters. My project develops a framework to find more informative clusters of virus sequences by applying community detection methods to transmission networks of HIV-1 sequences from Beijing and Tennessee, and a global dataset of SARS-CoV-2 sequences. We set the sequences with the most recent sample collection date as “new cases” and the remaining as “known cases”. Then, the difference of Akaike information criterion (AIC) between two Poisson regression models is measured. By using this framework, we determine that the HIV-1 database from Beijing favors a higher distance threshold than Tennessee, and in the SARS-CoV-2 transmission network, some pairs of countries (i.e., England and Portugal) are more significantly associated than by chance.