Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article

Degree

Doctor of Philosophy

Program

Computer Science

Supervisor

Kari, Lila

Affiliation

The University of Waterloo

Abstract

In the field of bioinformatics, taxonomic classification is the scientific practice of identifying, naming, and grouping of organisms based on their similarities and differences. The problem of taxonomic classification is of immense importance considering that nearly 86% of existing species on Earth and 91% of marine species remain unclassified. Due to the magnitude of the datasets, the need exists for an approach and software tool that is scalable enough to handle large datasets and can be used for rapid sequence comparison and analysis. We propose ML-DSP, a stand-alone alignment-free software tool that uses Machine Learning and Digital Signal Processing to classify genomic sequences. ML-DSP uses numerical representations to map genomic sequences to discrete numerical series (genomic signals), Discrete Fourier Transform (DFT) to obtain magnitude spectra from the genomic signals, Pearson Correlation Coefficient (PCC) as a dissimilarity measure to compute pairwise distances between magnitude spectra of any two genomic signals, and supervised machine learning for the classification and prediction of the labels of new sequences. We first test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of > 97%. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4710 bacterial genomes into phyla with 95.5% accuracy. Second, we propose another tool, MLDSP-GUI, where additional features include: a user-friendly Graphical User Interface, Chaos Game Representation (CGR) to numerically represent DNA sequences, Euclidean and Manhattan distances as additional distance measures, phylogenetic tree output, oligomer frequency information to study the under- and over-representation of any particular sub-sequence in a selected sequence, and inter-cluster distances analysis, among others. We test MLDSP-GUI by classifying 7881 complete genomes of Flavivirus genus into species with 100% classification accuracy. Third, we provide a proof of principle that MLDSP-GUI is able to classify newly discovered organisms by classifying the novel COVID-19 virus.

Summary for Lay Audience

Sequence classification is the scientific practice of identifying, naming, and grouping organisms based on their differences and similarities. Considering that most of the existing species (nearly 86% of species on Earth and 91% of marine species) remain unclassified, the problem of sequence classification is of immense importance. Due to the magnitude of the datasets, the problem of sequence comparison and analysis for the purpose of classification remains challenging. Sequence (dis)similarity analysis has multiple possible applications including taxonomic classification (classify organisms on the basis of shared characteristics), virus-subtype classification (assign viral sequences to their subtypes), disease classification (classify human genomic sequences on the basis of disease type), human haplogroup classification (assign human mitochondrial on the basis of maternal lineage), etc. The need exists for an approach and software tool that is scalable enough to handle large datasets and is able to provide accurate classifications within a short time period. We propose a machine learning-based methodology, ML-DSP, that is effective in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity. We also propose MLDSP-GUI, an extension of ML-DSP with multiple additional valuable features. Lastly, we show the applicability of our approach to taxonomy classification, virus-subtype classification and provide a proof of principle that our approach is able to classify newly discovered organisms by classifying the previously unclassified novel coronavirus (COVID-19 virus) sequences.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS