'Columbia University Libraries/Information Services'
Doi
Abstract
Identifying disease risk genes is a central topic of human genetics. Cost-effective exome and whole genome sequencing enabled large-scale discovery of genetic variations. However, the statistical power of finding new risk genes through rare genetic variation is fundamentally limited by sample sizes. As a result, we have an incomplete understanding of genetic architecture and molecular etiology of most of human conditions and diseases. In this thesis, I developed new computational methods that integrate functional genomics data sets, such as epigenomic profiles and single-cell transcriptomics, to improve power for identifying genetic risks and gain more insights on etiology of developmental disorders. The overall hypothesis that disease risk genes contributing to developmental disorders are bottleneck genes under normal development and subject to precise transcriptional regulations to maintain spatiotemporal specific expression during development. In this thesis I describe two major research projects. The first project, Episcore, predicts haploinsufficient genes based on a large integrated epigenomic profiles from multiple tissues and cell lines by supervised machine learning methods. The second one, A-risk, predicts plausibility of being risk genes of autism spectrum disorder based on single-cell RNA-seq data collected in human fetal midbrain and prefrontal cortex. Both methods were shown to be able to improve gene discovery in analysis of de novo mutations in developmental disorders. Overall, my thesis represents an effort to integrate functional genomics data by machine learning to facilitate both discovery and interpretation of genetic studies of human diseases. We believe that such integrative analysis can help us better understand genetic variants and disease etiology