Computational methods for the analysis of next generation sequencing data

Abstract

Recently, next generation sequencing (NGS) technology has emerged as a powerful approach and dramatically transformed biomedical research in an unprecedented scale. NGS is expected to replace the traditional hybridization-based microarray technology because of its affordable cost and high digital resolution. Although NGS has significantly extended the ability to study the human genome and to better understand the biology of genomes, the new technology has required profound changes to the data analysis. There is a substantial need for computational methods that allow a convenient analysis of these overwhelmingly high-throughput data sets and address an increasing number of compelling biological questions which are now approachable by NGS technology. This dissertation focuses on the development of computational methods for NGS data analyses. First, two methods are developed and implemented for detecting variants in analysis of individual or pooled DNA sequencing data. SNVer formulates variant calling as a hypothesis testing problem and employs a binomial-binomial model to test the significance of observed allele frequency by taking account of sequencing error. SNVerGUI is a GUI-based desktop tool that is built upon the SNVer model to facilitate the main users of NGS data, such as biologists, geneticists and clinicians who often lack of the programming expertise. Second, collapsing singletons strategy is explored for associating rare variants in a DNA sequencing study. Specifically, a gene-based genome-wide scan based on singleton collapsing is performed to analyze a whole genome sequencing data set, suggesting that collapsing singletons may boost signals for association studies of rare variants in sequencing study. Third, two approaches are proposed to address the 3’UTR switching problem. PolyASeeker is a novel bioinformatics pipeline for identifying polyadenylation cleavage sites from RNA sequencing data, which helps to enhance the knowledge of alternative polyadenylation mechanisms and their roles in gene regulation. A change-point model based on a likelihood ratio test is also proposed to solve such problem in analysis of RNA sequencing data. To date, this is the first method for detecting 3’UTR switching without relying on any prior knowledge of polyadenylation cleavage sites

    Similar works