Electronic Thesis and Dissertation Repository

Degree

Doctor of Philosophy

Program

Computer Science

Supervisor

Lucian Ilie

2nd Supervisor

NA

3rd Supervisor

NA

Abstract

The DNA Next Generation Sequencing (NGS) technologies produce data at a low cost, enabling their application to many ambitious fields such as cancer research, disease control, personalized medicine etc. However, even after a decade of research, the modern aligners and assemblers are far from providing efficient and error free genome alignments and assemblies respectively. This is due to the inherent nature of the genome alignment and assembly problem, which involves many complexities. Many algorithms to address this problem have been proposed over the years, but there still is a huge scope for improvement in this research space.

Many new genome alignment algorithms are proposed over time and one of the key differentiator among these algorithms is the efficiency of the genome alignment process. I present a new algorithm for efficiently finding Maximal Exact Matches (MEMs) between two genomes: E-MEM (Efficient computation of maximal exact matches for very large genomes). Computing MEMs is one of the most time consuming step during the alignment process. E-MEM can be used to find MEMs which are used as seeds in genome aligner to increase its efficiency. The E-MEM program is the most efficient algorithm as of today for computing MEMs and it surpasses all competition by large margins.

There are many genome assembly algorithms available for use, but none produces perfect genome assemblies. It is important that assemblies produced by these algorithms are evaluated accurately and efficiently.This is necessary to make the right choice of the genome assembler to be used for all the downstream research and analysis. A fast genome assembly evaluator is a key factor when a new genome assembler is developed, to quickly evaluate the outcome of the algorithm. I present a fast and efficient genome assembly evaluator called LASER (Large genome ASsembly EvaluatoR), which is based on a leading genome assembly evaluator QUAST, but significantly more efficient both in terms of memory and run time.

The NGS technologies limit the potential of genome assembly algorithms because of short read lengths and nonuniform coverage. Recently, third generation sequencing technologies have been proposed which promise very long reads and a uniform coverage. However, this technology comes with its own drawback of high error rate of 10 - 15% consisting mostly of indels. The long read sequencing data is useful only after error correction obtained using self read alignment (or read overlapping) techniques. I propose a new self read alignment algorithm for Pacific Biosciences sequencing data: HISEA (Hierarchical SEed Aligner), which has very high sensitivity and precision as compared to other state-of-the-art aligners. HISEA is also integrated into Canu assembly pipeline. Canu+HISEA produces better assemblies than Canu with its default aligner MHAP, at a much lower coverage.

Share

COinS