Electronic Thesis and Dissertation Repository


Doctor of Philosophy


Computer Science


Dr. Lucian Ilie

2nd Supervisor

Dr. Roberto Solis-Oba

Joint Supervisor


The enormous amount of short reads produced by next generation sequencing (NGS) techniques such as Roche/454, Illumina/Solexa and SOLiD sequencing opened the possibility of de novo genome assembly. Some of the de novo genome assemblers (e.g., Edena, SGA) use an overlap graph approach to assemble a genome, while others (e.g., ABySS and SOAPdenovo) use a de Bruijn graph approach. Currently, the approaches based on the de Bruijn graph are the most successful, yet their performance is far from being able to assemble entire genomic sequences. We developed a new overlap graph based genome assembler called Paired-End Genome ASsembly Using Short-sequences (PEGASUS) for paired-end short reads produced by NGS techniques. PEGASUS uses a minimum cost network flow approach to predict the copy count of the input reads more precisely than other algorithms. With the help of accurate copy count and mate pair support, PEGASUS can accurately unscramble the paths in the overlap graph that correspond to DNA sequences. PEGASUS exhibits comparable and in many cases better performance than the leading genome assemblers.