Electronic Thesis and Dissertation Repository

Degree

Master of Science

Program

Applied Mathematics

Supervisor

L. M. Wahl

Abstract

Metagenomic sequencing techniques have made it possible to determine the composition of bacterial microbiota of the human body. Clustering algorithms have been used to search for core microbiota types in the vagina, but results have been inconsistent, possibly due to methodological differences. We performed an extensive comparison of six commonly-used clustering algorithms and four distance metrics, using clinical data from 777 vaginal samples across 5 studies, and 36,000 synthetic datasets based on these clinical data. We found that centroid-based clustering algorithms (K-means and Partitioning around Medoids), with Euclidean or Manhattan distance metrics, performed well. They were best at correctly clustering and determining the number of clusters in synthetic datasets and were also top performers for predicting vaginal pH and bacterial vaginosis by clustering clinical data. Hierarchical clustering algorithms, particularly neighbour joining and average linkage, performed less well, failing unequivocally on many datasets.

Share

COinS