Date of Award

2006

Degree Type

Thesis

Degree Name

Master of Science

Program

Computer Science

Supervisor

Lila Kari

Second Advisor

Dr. Kathleen Hill

Abstract

In this thesis we define a new complexity measure for biological (DNA) sequences. Our research was guided by the hypothesis that genomic sequences have some structure that makes them different from random sequences. We aimed thus to find a syntactical complexity measure that is minimized (in absolute value) by random strings and applied it to biological sequences. The complexity measure that we found satisfied this and several other constraints is the n-subword complexity of a word w, defined as the number of different subwords of length n present in w. In addition we defined a new measure that is related to the n-subword complexity called n-subword deficit^ which measures the difference between a word w and a random string r of the same length. Theoretical results and computational experiments indicated that the 6-subword deficit has the potential to bring out the highest variability between gene sequences. We focused thus on this measure and used it to try to answer the following questions: 1) Is there a difference between the n-subword complexity of a given gene and the n-subword complexity of random strings of the same length? 2) Using the n-subword complexity, can we find a difference between genetic sequences from different species based on the type of the species? 111 3) Can we compare the structure and the ‘complexity ’ of the same gene for different species, using the n-subword complexity measure? 4) Is there a difference between the n-subword complexity of coding and non-coding sequences ? Our results give some answers to these questions. We found out, for example, that the 6-subword deficit of human genes is on average 10% higher than that of random strings of the same length. In addition, graphing the 6-subword deficit of genes from different species produced results that are compatible with the expected “hierarchy”, with human genes at the top of the graph and bacteria genes at the bottom.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.