Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science


Shooshtari, Parisa


In recent years, technological developments have enabled the comprehensive transcriptional profiling of thousands of single cells in a single experiment. However, there is still much to be gained from the integration of datasets from different donors, studies, and technological platforms. One major challenge in this regard is the technical variability introduced by handling different batches, known as batch effects, which can obscure biological variations. Assessing batch effects within a dataset has been the focus of various studies seeking to establish reliable criteria for selecting a batch effect removal method. However, these methods do not always perform reliably. This study provides a comprehensive review of both batch effect removal and assessment methods and introduces a novel method for batch effect removal assessment. The performance of the proposed method is evaluated by comparing it to four other batch effect assessment methods using eleven test datasets. The results showed that the proposed method consistently outperformed the other methods, successfully passing all challenges while the other methods failed at least one test. The proposed method was applied to three biological integrated datasets to evaluate its performance on real-world data. The results of the evaluation showed that the proposed method demonstrated the highest correlation with the expert’s assessment of the datasets, indicating that it was able to accurately identify batch effects in the data.

Summary for Lay Audience

In recent years, scientists have been able to study the genetic activity of individual cells in unprecedented detail. However, when combining data from different studies, researchers often encounter a problem known as "batch effects." These are differences in the way that samples were processed or analyzed that can obscure real biological differences between cells. In this thesis, we review existing methods for identifying and removing batch effects and propose a new method for assessing the effectiveness of batch effect removal techniques. To test our new method, we applied it to twelve different synthetic datasets and compared its performance to four other methods. Our method consistently outperformed the others, successfully identifying and removing batch effects in all cases. We also applied our method to three real-world datasets and found that it accurately identified batch effects that were missed by other methods. Overall, our research provides a promising solution for addressing batch effects in large-scale studies of genetic activity. By improving our ability to combine data from different sources, we can gain a more comprehensive understanding of how genes are regulated in different cell types and under different conditions. This could ultimately lead to new insights into the causes of diseases and the development of more effective treatments.