Electronic Thesis and Dissertation Repository


Doctor of Philosophy


Computer Science


Dr. Lucian Ilie


The ability to obtain the genetic code of any species has caused a revolution in biological sciences. Current technologies are capable of sequencing short pieces of DNA with very high quality. These short pieces of DNA determint the sequence of bases in the genome of any species. This information is key in understanding many of the aspects of how life functions.

The accuracy of sequencing is extremely important since the differences between individuals of the same species are caused by very few changes. All sequencing technologies make errors, and before the data can be used for downstream applications it is usually best to correct the errors first. I present an error correction program called RACER that is an error correction program that aims to correct substitution sequencing errors.

There are many substitution error correction programs available for DNA sequencing technologies, so it is important for biologists to know which program is best to use for their sequencing technology. I present a comprehensive survey of substitution error correction programs for DNA sequencing data to address this issue. I also present two programs to evaluate the performance of error correcting programs.

Since the current dominant platform in the market can only obtain small pieces of DNA, software is needed to assemble these pieces to determine the full sequence of the sampled genome. Current genome assembly programs are not capable of assembling the entire genome of most species due to the repetitive nature of genomes and the uneven coverage of the sampled genome. I present a genome assembly program called SAGE2 that improves upon the current state-of-the-art.