Date of Award

2006

Degree Type

Thesis

Degree Name

Master of Science

Program

Computer Science

Supervisor

Lila Kari

Second Advisor

Dr. Kathleen Hill

Abstract

In this thesis we define a new complexity measure for biological (DNA) sequences. Our research was guided by the hypothesis that genomic sequences have some structure that makes them different from random sequences. We aimed thus to find a syntactical complexity measure that is minimized (in absolute value) by random strings and applied it to biological sequences. The complexity measure that we found satisfied this and several other constraints is the n-subword complexity of a word w, defined as the number of different subwords of length n present in w. In addition we defined a new measure that is related to the n-subword complexity called n-subword deficit^ which measures the difference between a word w and a random string r of the same length. Theoretical results and computational experiments indicated that the 6-subword deficit has the potential to bring out the highest variability between gene sequences. We focused thus on this measure and used it to try to answer the following questions: 1) Is there a difference between the n-subword complexity of a given gene and the n-subword complexity of random strings of the same length? 2) Using the n-subword complexity, can we find a difference between genetic sequences from different species based on the type of the species? 111 3) Can we compare the structure and the ‘complexity ’ of the same gene for different species, using the n-subword complexity measure? 4) Is there a difference between the n-subword complexity of coding and non-coding sequences ? Our results give some answers to these questions. We found out, for example, that the 6-subword deficit of human genes is on average 10% higher than that of random strings of the same length. In addition, graphing the 6-subword deficit of genes from different species produced results that are compatible with the expected “hierarchy”, with human genes at the top of the graph and bacteria genes at the bottom.

Recommended Citation

Anbeer, Hebatallah, "Complexity Measures of Biological Sequences" (2006). Digitized Theses. 4914.
https://ir.lib.uwo.ca/digitizedtheses/4914

Download

COinS

Digitized Theses

Complexity Measures of Biological Sequences

Date of Award

Degree Type

Degree Name

Program

Supervisor

Second Advisor

Abstract

Recommended Citation

Links

Browse

Author Corner

Links

Digitized Theses

Complexity Measures of Biological Sequences

Author

Date of Award

Degree Type

Degree Name

Program

Supervisor

Second Advisor

Abstract

Recommended Citation

Share

Links

Browse

Author Corner

Links