Thesis Format
Monograph
Degree
Master of Science
Program
Pathology and Laboratory Medicine
Abstract
Detecting risk groups in transmission networks can be difficult due to a virus' high transmission rate. We hypothesize that this problem can be resolved by community detection methods. Community detection is a clustering method based on edge density, which can break a connected component into multiple smaller clusters. My project develops a framework to find more informative clusters of virus sequences by applying community detection methods to transmission networks of HIV-1 sequences from Beijing and Tennessee, and a global dataset of SARS-CoV-2 sequences. We set the sequences with the most recent sample collection date as “new cases” and the remaining as “known cases”. Then, the difference of Akaike information criterion (AIC) between two Poisson regression models is measured. By using this framework, we determine that the HIV-1 database from Beijing favors a higher distance threshold than Tennessee, and in the SARS-CoV-2 transmission network, some pairs of countries (i.e., England and Portugal) are more significantly associated than by chance.
Summary for Lay Audience
Identifying risk groups among infections can be difficult in the study of virus epidemiology. A transmission network is a graph-based method to describe the relations among infections by considering pairs of sequences to be connected if their difference (e.g., genetic pairwise difference) falls below a given threshold. A transmission network can be partitioned into several connected components or clusters. A connected component in a network is a subgraph in which nodes representing infections are connected to each other. Previous research in transmission networks has focused on HIV-1 due to its rapid evolution. This method can also be applied to other viruses, such as SARS-CoV-2. However, due to the rapid transmission rate of SARS-CoV-2, component based clustering is not able to detect informative clusters from a large number of infections with a small number of mutations. We hypothesize that this problem can be resolved by community detection methods. Community detection is another clustering method based on edge density, such that infections within a community would have more edges and fewer edges between communities. My project develops a framework to find more informative clusters of virus sequences by applying community detection methods to the network given by pairwise distances from three different datasets: Beijing and Tennessee HIV-1 sequence data and global SARS-CoV-2 sequence data. We observe a higher optimal threshold in community detection methods, so that we are able to include more cases in the model than connected component-based clustering methods. By using this framework, we determine that the HIV database from Beijing favors a higher distance threshold than Tennessee. In the SARS-CoV-2 transmission network, some pairs of countries (i.e., England and Portugal) are more significantly associated than by chance.
Recommended Citation
Liu, Mo, "Outbreak Detection From Virus Genetic Sequence Variation By Community Detection" (2022). Electronic Thesis and Dissertation Repository. 8781.
https://ir.lib.uwo.ca/etd/8781
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.