A Framework for Characterising Performance in Multi-Class Classification Problems with Applications in Cancer Single Cell RNA Sequencing
Master of Science
In many real-world scenarios, we need to use multi-class classifiers to properly identify all classes in a dataset. To evaluate performance of multi-class classifiers, we need to take various parameters into account. I created a framework that can be used to drill into the differences between algorithms in specific scenarios and better compare multiple classifiers. This allows researchers to better identify strengths and weaknesses of particular classifiers. Single-cell RNA-seq allows cancer researchers to define complex cell types (i.e. classes) in the tumour micro-environments (TME). Using eight datasets, I assessed performance of 26 methods from different perspectives, such as the ability to identify under-represented or imbalanced classes or identify distinct but related subgroups that have not been seen before within a population. This study can be used to select the best methods for multi-class classifications of complex datasets, such as scRNA-seq TME datasets, and provides avenues for future work.
Summary for Lay Audience
Supervised learning is the process of teaching an algorithm how to predict a result given a set of observations. The simplest case is a binary classifier which can only choose between two different results. Most real world problems, however, have more than two results, and multi-class classifiers are better suited for these problems. Multi-class classifiers are able to predict more than two results. Unfortunately, with many results it is hard to gauge how well a classifier predicts the correct result or solves a problem. I have created a set of guidelines that can be used to test these classifiers in a more complete manner, and better understand how they perform relative to each other. This allows users to compare classifiers and choose the best one that solves a particular problem, or identify shortcomings in a newly developed classifier in order to improve how a problem is solved. Tumour micro-environments (TME) contain a variety of cell types which can affect cancer progression, making accurate identification of cell types important. Single-cell RNA sequencing (scRNA-seq) measures gene expression profiles of individual cells, yet analysis of scRNA-seq data involves manual cell type identification, leading to potentially inaccurate predictions and irreproducible results. Automated algorithms exist, but are mainly tested on normal tissues. To understand how they perform on TME, I evaluated 26 automated cell-type labelling methods using 8 cancer datasets. I found that algorithms which learn from individual cells within a sample perform better than those using cell clusters for prediction. Additionally, the cell-based methods are better able to identify malignant cells in the TME, while cluster-based algorithms have higher performance on non-malignant cell types than malignant ones. My study provides guidelines for the selection of a cell type identification method.
Christensen, Erik R., "A Framework for Characterising Performance in Multi-Class Classification Problems with Applications in Cancer Single Cell RNA Sequencing" (2021). Electronic Thesis and Dissertation Repository. 8278.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.