Electronic Thesis and Dissertation Repository

Thesis Format



Doctor of Philosophy


Statistics and Actuarial Sciences


De Souza, Camila


The Expectation-Maximization (EM) algorithm is an iterative algorithm for finding the maximum likelihood estimates in problems involving missing data or latent variables. The EM algorithm can be applied to problems consisting of evidently incomplete data or missingness situations, such as truncated distributions, censored or grouped observations, and also to problems in which the missingness of the data is not natural or evident, such as mixed-effects models, mixture models, log-linear models, and latent variables. In Chapter 2 of this thesis, we apply the EM algorithm to grouped data, a problem in which incomplete data are evident. Nowadays, data confidentiality is of great importance for many companies and organizations. For this reason, they may prefer not to release exact data but instead to grant researchers access to approximate data. For example, rather than providing the exact measurements of their clients, they may only provide researchers with grouped data, that is, the number of clients falling in each of a set of non-overlapping measurement intervals. The challenge is to estimate the mean and variance structure of the hidden ungrouped data based on the observed grouped data. To tackle this problem, this work considers the exact observed data likelihood and applies the EM and Monte-Carlo EM (MCEM) algorithms for the cases where the hidden data follow a univariate, bivariate, or multivariate normal distribution. The well-known Galton data and simulated datasets are used to evaluate the statistical properties of the proposed EM and MCEM algorithms. In Chapters 3, 4 and 5, we apply the EM algorithm to a case in which the missingness of the data is not evident by considering mixture models and latent variables to propose a novel model-based clustering approach for single-cell RNA sequencing data. In biology, cells can be distinguished by their phenotype, such as size and shape, or at the molecular level, based on their genome, epigenome, and transcriptome. In this thesis, we focus on the transcriptome, which includes all RNA transcripts in a given cell population, indicating the genes being expressed at a certain time. We consider single-cell RNA sequencing data and develop a novel model-based clustering method to group cells based on their transcriptome profiles. The proposed clustering approach takes into account the large proportion of zeros present in the data, which can be either true biological zeros or technological noise. The assumed model for clustering is a mixture of either zero-inflated Poisson or zero-inflated negative binomial distributions, and inference is conducted via the EM algorithm. The performance of the proposed methodology is evaluated via simulation studies and analyses of published real datasets.

Summary for Lay Audience

The Expectation-Maximization (EM) algorithm has many applications in Statistics for estimation purposes. In this thesis, we study the application of the EM algorithm from two perspectives. One, for the situations in which the incomplete data are evident, and the other for the cases that missingness of data is not evident. In Chapter 2, we consider the application of the EM algorithm when the missingness in the data is evident such as grouped data, in which we know the intervals of the data and the frequencies over each interval. However, the exact raw data are not available. Assuming that the data follow a normal distribution, we find the mean and variance estimates of the normal distribution by applying the EM algorithm framework. We consider the cases of univariate, bivariate, and multivariate normal grouped data. We evaluate the performance of the proposed EM framework with simulated data and a publicly available dataset. In the Chapters 3, 4, and 5, we study another application of the EM algorithm in which the incomplete data is not evident such as mixture models. We consider the finite mixtures of zero-inflated models to cluster cells based on their gene expression profiles by applying the EM algorithm to estimate the model parameters. Our proposed clustering approach considers the large proportion of zeros in the data, which can be either true biological zeros or technological noise. Simulation studies are implemented to evaluate the performance of our proposed method under different controlled scenarios. We also analyze publicly available biological datasets as examples of applications.