Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science


Ilie, Lucian


Similarity search is one of the most important problem in bioinformatics, with application in read mapping, homology search, oligonucleotide design, etc. Similarity search is time and memory intensive, hence heuristic methods using multiple spaced seeds are commonly employed. A spaced seed is a string of 1 and *, where 1 represents a match position and * represent don't care position. Seeds are used to discover regions with identity, thus, it is imperative to design seeds of high sensitivity, so as to maximize the number of hits.

We present SpEED2, a software program to generate multiple spaced seeds of high sensitivity. It uses a novel seed optimization approach and it outperforms all the leading programs used for designing multiple spaced seeds like Iedera, AcoSeeD, and rasbhari. Our algorithm will benefit several software that is dependent on good quality seeds for its operation like PatternHunter for similarity search, SHRiMP and BFAST for read mapping, bestPrimer for designing primers, and many more.

Summary for Lay Audience

Multiple spaced seed is a set of one or more spaced seeds and they are used to find similar regions between two biological sequences. A spaced seed is a string of 1 and *, where 1 represents a match position and * represent don't care position. Two sequences are similar if they are highly identical. When a spaced seed is arranged with the two biological sequences to identify regions of similarity, a hit occurs only if both the sequences are identical in all the match positions. However, we are not interested in the don't care positions. A measurement metric called sensitivity is used to differentiate good seeds from bad ones. Sensitivity is the probability of detecting similar regions between sequences, good multiple spaced seeds have high sensitivity whereas bad ones will have low sensitivity.

We have developed a software program named SpEED2 that generates multiple spaced seeds of high sensitivity. In this dissertation, we first look into previous related works, then we explain our algorithm and finally compare our results with other leading software concerned with generating multiple spaced seeds and show our software program to be better than the others.