Date of Award
Master of Science
String complexity has many definitions: Kolmogorov complexity ; Lempel-Ziv complexity  ; Linguistic complexity , Subword complexity  etc. In this thesis we will consider the n-subword complexity studied in  and . The n-subword complexity Pw(n) of a genomic sequence w was defined in  as the number of distinct factors (subwords) of length n that occur in w. In  a new measure called the n-subword deficit was defined as the difference between the number of subwords of length n of a genomic sequence w and of a random genomic sequence of the same length. This definition was applied to short sequences (2000 base pairs). In this thesis, we will expand this definition to be applied, in addition to short sequences, also to very long sequences (from 100 base pairs to 200,000 base pairs). The aim of our work is to answer the following questions: 1. Do biological sequences show an n-subword deficit, and is their n- subword deficit length dependent? 2. Is the n-subword deficit gene specific? 3. Is the n-subword deficit genome specific? Our results indicate that the answers to questions 1 — 3 appears to be Yes, No, and No respectively. Moreover, it was found that the insects Apis mellifera and Drosophila melanogaster have genomes with the lowest maximal n-subword deficit value among other genomes in all experiments that have been conducted.
Shihadeh, Noman, "n-Subword Complexity Measure of DNA Sequences" (2009). Digitized Theses. 3841.