Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science


Kaizhong Zhang


Sequence local alignment is to find two subsequences from the input two sequences respectively, which can produce the highest similarity degree among all other pairs of subsequences. The Smith-Waterman algorithm is one of the most important technique in sequence local alignment, especially in computational molecular biology. This algorithm can guarantee that the optimal local alignment can be found with respect to the distance or similarity metric. However, the optimal solution obtained by Smith-Waterman is not biologically meaningful, since it may contain small pieces of irrelevant segments, but as long as they are not strong enough, the algorithm still take them in account as part of the optimal solution.

In this thesis, we propose a new algorithm firstly focusing on finding segment with highest similarity degree rather than similarity score, and save all efficient optimal solutions for any subsequence pairs ending at specific positions for later use. Thereby, there is no need to run the dynamic programming over again if different normalization functions are applied for different applications. Also, several similarity score normalization methods will be introduced and discussed in this thesis. At last, a phylogenetic tree which represents the evolutionary relationships over 20 species is built, based on the data retrieved from our algorithm.

Summary for Lay Audience

Sequence comparison tools are widely used in many areas, such as the studies regarding DNA or protein sequences. The target is to find the most similar regions between any two sequences. Usually, an optimal similar region should consist of identical parts and some dissimilar fragments. Many existing applications may only focus on including identical regions as many as possible; however, the dissimilar fragments also need to be considered to measure the similarity of two subsequences. Our research provides a new approach to find the consistently similar region of input sequences, based on a normalized similarity metric.

A similarity metric will be applied to measure the similarity for each element pair from input sequences, and a score will be given. Typically, people measure the similarity of two subsequences by summing up the score of each element pair, and the solution should have the highest similarity score. However, besides the similarity score, our method also considered the segment scale to measure the similarity. A high score segment that contains a large poor fragment may not be ideal in our method. The solution found by our method is meaningful because there will not be any relatively large dissimilar fragments that exist in our solution.