Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Science

Program

Pathology and Laboratory Medicine

Abstract

Detecting risk groups in transmission networks can be difficult due to a virus' high transmission rate. We hypothesize that this problem can be resolved by community detection methods. Community detection is a clustering method based on edge density, which can break a connected component into multiple smaller clusters. My project develops a framework to find more informative clusters of virus sequences by applying community detection methods to transmission networks of HIV-1 sequences from Beijing and Tennessee, and a global dataset of SARS-CoV-2 sequences. We set the sequences with the most recent sample collection date as “new cases” and the remaining as “known cases”. Then, the difference of Akaike information criterion (AIC) between two Poisson regression models is measured. By using this framework, we determine that the HIV-1 database from Beijing favors a higher distance threshold than Tennessee, and in the SARS-CoV-2 transmission network, some pairs of countries (i.e., England and Portugal) are more significantly associated than by chance.

Summary for Lay Audience

Identifying risk groups among infections can be difficult in the study of virus epidemiology. A transmission network is a graph-based method to describe the relations among infections by considering pairs of sequences to be connected if their difference (e.g., genetic pairwise difference) falls below a given threshold. A transmission network can be partitioned into several connected components or clusters. A connected component in a network is a subgraph in which nodes representing infections are connected to each other. Previous research in transmission networks has focused on HIV-1 due to its rapid evolution. This method can also be applied to other viruses, such as SARS-CoV-2. However, due to the rapid transmission rate of SARS-CoV-2, component based clustering is not able to detect informative clusters from a large number of infections with a small number of mutations. We hypothesize that this problem can be resolved by community detection methods. Community detection is another clustering method based on edge density, such that infections within a community would have more edges and fewer edges between communities. My project develops a framework to find more informative clusters of virus sequences by applying community detection methods to the network given by pairwise distances from three different datasets: Beijing and Tennessee HIV-1 sequence data and global SARS-CoV-2 sequence data. We observe a higher optimal threshold in community detection methods, so that we are able to include more cases in the model than connected component-based clustering methods. By using this framework, we determine that the HIV database from Beijing favors a higher distance threshold than Tennessee. In the SARS-CoV-2 transmission network, some pairs of countries (i.e., England and Portugal) are more significantly associated than by chance.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS