
Integrative Data Analysis to Uncover Mechanisms of Autoimmune diseases
Abstract
Autoimmune diseases are types of complex diseases which arise from the interplay between multiple biological factors including genetic, epigenetic, and environmental factors. Genome-wide association studies (GWAS) have discovered thousands of genetic risk variants for autoimmune diseases; however, the mechanisms of action of the genetic risk variants are unclear. Specifically, these variants mostly affect the gene regulatory elements which have indirect effect on the functions of cells through regulating the activity of genes. Open chromatin sites are among the gene regulatory elements. These regulatory elements are regions on the DNA where the chromatin is open and allows for binding of regulatory proteins known as transcription factors. The interplay between transcription factors and the DNA sequence of their binding sites lead to the regulation of activity level of related genes. Disruptions in these processes due to the genetic variations or environment factors could lead to the emergence of autoimmune diseases.
Uncovering the biological mechanisms of autoimmune diseases requires integration of various biological data types to address the factors involved in their pathogenesis. In this project, I developed various computational methods for integrating biological data types to shed light on the mechanisms of autoimmune diseases. Specifically, I developed statistical frameworks for identifying the effects of genetic risk variants of autoimmune diseases on gene regulatory elements including open chromatin sites and transcription factors. In addition, various computational models including classical correlation-based methods and state-of-the-art deep learning models were used for identifying the links between disease-affected regulatory elements and genes being dysregulated by them. External data sources revealing the links between regulatory sites and genes including Hi-C and eQTL were incorporated into the models for confirming the predicting regulatory links.
My first method applies Fisher’s exact test to GWAS data of nine different autoimmune diseases and binding patterns of transcription factors in various tissues and cell types. This method identified the transcription factors being significantly affected by genetic risk variants of these autoimmune diseases and the immune cell types including such effects. The second method is a comprehensive data integration pipeline which incorporates various biological data types including GWAS data, sequence models of transcription factors, single cell open chromatin data, and external data sources (HiC and eQTL) for confirmation. This method identifies the open chromatin sites and binding patterns of transcription factors in cell populations even the rare ones from the single cell open chromatin data. Further, GWAS data is integrated with them to identify the disease-affected regulatory elements in cell populations. Eventually, using a regression-based algorithm and external data sources, the disease-affected regulatory elements are linked to the genes under their effects. The third method employs a deep learning model based on Attention mechanism to improve the accuracy of identifying the links between disease-affected regulatory sites and genes. My methods are mainly used for identifying the genetic and epigenetic factors of immune disease in immune cell types. However, these methods are general and can be used for identifying the mechanisms of other common complex disease using their GWAS data and chromatin accessibility data of relevant cell types.