Improvements on Seeding Based Protein Sequence Similarity Search

Weiming Li, The University of Western OntarioFollow

Degree

Doctor of Philosophy

Program

Computer Science

Supervisor

Dr. Bin Ma

2nd Supervisor

Dr. Kaizhong Zhang

Joint Supervisor

Abstract

The primary goal of bioinformatics is to increase an understanding in the biology of organisms. Computational, statistical, and mathematical theories and techniques have been developed on formal and practical problems that assist to achieve this primary goal. For the past three decades, the primary application of bioinformatics has been biological data analysis. The DNA or protein sequence similarity search is perhaps the most common, yet vitally important task for analyzing biological data.

The sequence similarity search is a process of finding optimal sequence alignments. On the theoretical level, the problem of sequence similarity search is complex. On the applicational level, the sequences similarity search onto a biological database has been one of the most basic tasks today. Using traditional quadratic time complexity solutions becomes a challenge due to the size of the database. Seeding (or filtration) based approaches, which trade sensitivity for speed, are a popular choice among those available. Two main phases usually exist in a seeding based approach. The first phase is referred to as the hit generation, and the second phase is referred to as the hit extension.

In this thesis, two improvements on the seeding based protein sequence similarity search are presented. First, for the hit generation, a new seeding idea, namely spaced k-mer neighbors, is presented. We present our effective algorithms to find a good set of spaced k-mer neighbors. Secondly, for the hit generation, a new method, namely HexFilter, is proposed to reduce the number of hit extensions while achieving better selectivity. We show our HexFilters with optimized configurations.

Recommended Citation

Li, Weiming, "Improvements on Seeding Based Protein Sequence Similarity Search" (2012). Electronic Thesis and Dissertation Repository. 988.
https://ir.lib.uwo.ca/etd/988

Download

Included in

Computer Sciences Commons

COinS

Improvements on Seeding Based Protein Sequence Similarity Search

Degree

Program

Supervisor

2nd Supervisor

Abstract

Recommended Citation

Included in

Links

Browse

Author Corner

Links

Improvements on Seeding Based Protein Sequence Similarity Search

Author

Degree

Program

Supervisor

2nd Supervisor

Abstract

Recommended Citation

Included in

Share

Links

Browse

Author Corner

Links