## Date of Award

2006

## Degree Type

Thesis

## Degree Name

Master of Science

## Program

Computer Science

## Supervisor

Lila Kari

## Second Advisor

Dr. Kathleen Hill

## Abstract

In this thesis we define a new complexity measure for biological (DNA) sequences. Our research was guided by the hypothesis that genomic sequences have some structure that makes them different from random sequences. We aimed thus to find a syntactical complexity measure that is minimized (in absolute value) by random strings and applied it to biological sequences. The complexity measure that we found satisfied this and several other constraints is the n-subword complexity of a word w, defined as the number of different subwords of length n present in w. In addition we defined a new measure that is related to the n-subword complexity called n-subword deficit^ which measures the difference between a word w and a random string r of the same length. Theoretical results and computational experiments indicated that the 6-subword deficit has the potential to bring out the highest variability between gene sequences. We focused thus on this measure and used it to try to answer the following questions: 1) Is there a difference between the n-subword complexity of a given gene and the n-subword complexity of random strings of the same length? 2) Using the n-subword complexity, can we find a difference between genetic sequences from different species based on the type of the species? 111 3) Can we compare the structure and the ‘complexity ’ of the same gene for different species, using the n-subword complexity measure? 4) Is there a difference between the n-subword complexity of coding and non-coding sequences ? Our results give some answers to these questions. We found out, for example, that the 6-subword deficit of human genes is on average 10% higher than that of random strings of the same length. In addition, graphing the 6-subword deficit of genes from different species produced results that are compatible with the expected “hierarchy”, with human genes at the top of the graph and bacteria genes at the bottom.

## Recommended Citation

Anbeer, Hebatallah, "Complexity Measures of Biological Sequences" (2006). *Digitized Theses*. 4914.

https://ir.lib.uwo.ca/digitizedtheses/4914