thesis

Evaluation of statistical methods, modeling, and multiple testing in RNA-seq studies

Abstract

Recent Next Generation Sequencing methods provide a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Due to this feature of RNA sequencing (RNA-seq) data, appropriate statistical inference methods are required. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA-seq data, its appropriateness in the application to genetic studies has not been exhaustively evaluated. Additionally, adjusting for covariates that have an unknown relationship with expression of a gene has not been extensively evaluated in RNA-seq studies using the NB framework. Finally, the dependent structures in RNA-Seq data may violate the assumptions of some multiple testing correction methods. In this dissertation, we suggest an alternative regression method, evaluate the effect of covariates, and compare various multiple testing correction methods. We conduct simulation studies and apply these methods to a real data set. First, we suggest Firth’s logistic regression for detecting differentially expressed genes in RNA-seq data. We also recommend the data adaptive method that estimates a recalibrated distribution of test statistics. Firth’ logistic regression exhibits an appropriately controlled Type-I error rate using the data adaptive method and shows comparable power to NB regression in simulation studies. Next, we evaluate the effect of disease-associated covariates where the relationship between the covariate and gene expression is unknown. Although the power of NB and Firth’s logistic regression is decreased as disease-associated covariates are added in a model, Type-I error rates are well controlled in Firth’ logistic regression if the relationship between a covariate and disease is not strong. Finally, we compare multiple testing correction methods that control family-wise error rates and impose false discovery rates. The evaluation reveals that an understanding of study designs, RNA-seq data, and the consequences of applying specific regression and multiple testing correction methods are very important factors to control family-wise error rates or false discovery rates. We believe our statistical investigations will enrich gene expression studies and influence related statistical methods

    Similar works