
Evolution of overlapping reading frames in virus genomes
Abstract
Viruses are formidable pathogens that represent the majority of biological entities in our planet, and their genomes are a source of interesting enigmas. One feature in which virus genomes are usually rich, is the presence of overlapping reading frames (OvRFs) — portions of the genome where the same nucleotide sequence encodes more than one protein. OvRFs are hypothesized to be used by viruses to encode proteins more compactly and to regulate transcription. In addition, OvRFs might be a source of gene novelty, facilitating the creation of new open reading frames (ORF) within the transcriptional context of existing ones.
To characterize the distribution OvRFs in viruses, I analyzed 12,609 reference genomes from the NCBI virus database and discovered that, while the number of OvRFs increases the genome length, the overlapping regions tend to be shorter in longer genomes. I also demonstrated that dif- ferent frameshifts have distinct patterns in OvRFs. For example, +2 frameshifts are predominantly found in dsDNA viruses, whereas +0 frameshifts in RNA viruses tend to involve longer overlaps, which may increase the selective burden of the same nucleotide positions within codons. Further, I retrieved n = 8, 586 protein-coding sequences from n = 1, 224 reference genomes, and used an alignment-free method to cluster these sequences within virus families. I used these clusters to develop a new network-based representation of the distribution of OvRFs, which provides a means of visualizing and analyzing these genome features for each virus family. I also used these net- works to generate a high-level visualization of how overlapping genes are distributed among virus genomes in the same family.
Evolution in overlapping genes is complicated because the effect of a nucleotide substitution has multiple contexts. To unravel the effects of OvRFs on virus evolution, I developed HexSE, a simulation model of nucleotide sequence evolution along a phylogeny that tracks the substitution rates at every nucleotide site. In HexSE, I implemented a customized data structure to efficiently track the substitution rates at every nucleotide site. These rates are determined by the stationary nucleotide frequencies, transition bias, and the distribution of selection biases (dN and dS) in the respective reading frames. Next, I compared HexSE simulations under varying settings to an alignment of actual hepatitis B virus (HBV) genomes, which revealed consistent drops in synonymous substitution rates (dS) in association with overlapping regions of an ORF.
This thesis explores the cryptic information contained in viral genomes to help explain the evolutionary processes that shape them. In particular, understanding the impact of OvRFs on the evolution of virus genomes will provide us with crucial pieces of a significant puzzle — under- standing the origin of new genes in virus genomes, and thereby virus diversity.