University of North Carolina at Chapel Hill Graduate School
Doi
Abstract
Precision medicine and genomics data provide chances for better decision making in the public health domain. In this dissertation, we develop some important elements of precision medicine and address some aspects of genomics data. The first element is developing a nonparametric regression method for interval censored data. We develop a method called Interval Censored Recursive Forests (ICRF), an iterative random forest survival estimator for interval censored data. This method solves the splitting bias problem in tree-based methods for censored data. For this task, we develop consistent splitting rules and employ a recursion technique. This estimator is uniformly consistent and shows high prediction accuracy in simulations and data analyses. Second, we develop an estimator of the optimal dynamic treatment regime (DTR) for survival outcomes with dependent censoring. When one wants to maximize the survival time or the survival probability of cancer patients who go through multiple rounds of chemotherapies, finding the dynamic optimal treatment regime is complicated by the incompleteness of the survival information. Some patients may drop out or face failure before going through all the preplanned treatment stages, which results in a different number of treatment stages for different patients. To address this issue, we generalize the Q-learning approach and the random survival forest framework. This new method also overcomes limitations of the existing methods---independent censoring or a strong modeling structure of the failure time. We show consistency of the value of the estimator and illustrate the performance of the method through simulations and analysis of the leukemia patient data and the national mortality data. Third, we develop a method that measures gene-gene associations after adjusting for the dropout events in single cell RNA sequencing (scRNA-seq) data. Posing a bivariate zero-inflated negative binomial (BZINB) model, we estimate the dropout probability and measure the underlying correlation after controlling for the dropout effects. The gene-gene association measured in this way can serve as a building block of gene set testing methods. The BZINB model has a straightforward latent variable interpretation and is estimated using the EM algorithm.Doctor of Philosoph