Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article


Doctor of Philosophy


Microbiology and Immunology


Poon, Art FY


Viruses are formidable pathogens that represent the majority of biological entities in our planet, and their genomes are a source of interesting enigmas. One feature in which virus genomes are usually rich, is the presence of overlapping reading frames (OvRFs) — portions of the genome where the same nucleotide sequence encodes more than one protein. OvRFs are hypothesized to be used by viruses to encode proteins more compactly and to regulate transcription. In addition, OvRFs might be a source of gene novelty, facilitating the creation of new open reading frames (ORF) within the transcriptional context of existing ones.

To characterize the distribution OvRFs in viruses, I analyzed 12,609 reference genomes from the NCBI virus database and discovered that, while the number of OvRFs increases the genome length, the overlapping regions tend to be shorter in longer genomes. I also demonstrated that dif- ferent frameshifts have distinct patterns in OvRFs. For example, +2 frameshifts are predominantly found in dsDNA viruses, whereas +0 frameshifts in RNA viruses tend to involve longer overlaps, which may increase the selective burden of the same nucleotide positions within codons. Further, I retrieved n = 8, 586 protein-coding sequences from n = 1, 224 reference genomes, and used an alignment-free method to cluster these sequences within virus families. I used these clusters to develop a new network-based representation of the distribution of OvRFs, which provides a means of visualizing and analyzing these genome features for each virus family. I also used these net- works to generate a high-level visualization of how overlapping genes are distributed among virus genomes in the same family.

Evolution in overlapping genes is complicated because the effect of a nucleotide substitution has multiple contexts. To unravel the effects of OvRFs on virus evolution, I developed HexSE, a simulation model of nucleotide sequence evolution along a phylogeny that tracks the substitution rates at every nucleotide site. In HexSE, I implemented a customized data structure to efficiently track the substitution rates at every nucleotide site. These rates are determined by the stationary nucleotide frequencies, transition bias, and the distribution of selection biases (dN and dS) in the respective reading frames. Next, I compared HexSE simulations under varying settings to an alignment of actual hepatitis B virus (HBV) genomes, which revealed consistent drops in synonymous substitution rates (dS) in association with overlapping regions of an ORF.

This thesis explores the cryptic information contained in viral genomes to help explain the evolutionary processes that shape them. In particular, understanding the impact of OvRFs on the evolution of virus genomes will provide us with crucial pieces of a significant puzzle — under- standing the origin of new genes in virus genomes, and thereby virus diversity.

Summary for Lay Audience

This research delves into the intriguing world of viruses, which are highly diverse pathogens with genomes that hold clues to the origin of life. Many viral genomes have a type of gene arrangement known as overlapping reading frames (OvRFs), where the same sequence encodes multiple pro- teins. OvRFs are thought to be used by viruses to increase the amount of information contained in smaller genomes, regulate transcription, and contribute to the creation of new genes.

By analyzing thousands of viral genomes, I found that OvRFs in longer genomes tend to be shorter in length, and that different types of viruses exhibit distinct patterns in OvRFs, with spe- cific frameshift preferences. Additionally, I developed a unique network-based approach to visu- alize and analyze the OvRF distribution within virus families. Notably, the presence of OvRFs is correlated with network-based statistics for some virus families such as Coronaviridae, Rhab- doviridae, and Papillomaviridae. To explore the evolutionary impact of OvRFs, I also developed a simulation model called HexSE. By simulating nucleotide sequence evolution in hepatitis B virus, I discovered consistent drops in synonymous substitution rates within overlapping gene regions.

This project aims to decipher the cryptic information contained within viral genomes to shed light on the evolutionary processes shaping them. Understanding the role of OvRFs in virus genomes provides valuable insights into the origin of new genes and the diversity of viruses. Over- all, this research contributes to our understanding of virus genomes and their significance in the larger context of life’s origin.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.