Design and Analysis of Statistical Learning Algorithms which Control False Discoveries

Abstract

In this thesis, general theoretical tools are constructed which can be applied to develop ma- chine learning algorithms which are consistent, with fast convergence and which minimize the generalization error by asymptotically controlling the rate of false discoveries (FDR) of features, especially for high dimensional datasets. Even though the main inspiration of this work comes from biological applications, where the data is extremely high dimensional and often hard to obtain, the developed methods are applicable to any general statistical learning problem. In this work, the various machine learning tasks like hypothesis testing, classification, regression, etc are formulated as risk minimization algorithms. This allows such learning tasks to be viewed as optimization problems, which can be solved using first order optimization techniques in case of large data scenarios, while one could use faster converging second order techniques for small to moderately sized data sets. Further, such a formulation allows us to estimate the first order convergence rates of an empirical risk estimator for any arbitrary learning problem, using techniques from large deviation theory. In many scientific applications, robust discovery of factors affecting an outcome or a phe- notype, is more important than the accuracy of predictions. Hence, it is essential to find an appropriate approach to regularize an under-determined estimation problem and thereby control the generalization error. In this work, the use of local probability of false discovery is explored as such a regularization parameter, which forces the optimized solution towards functions with a lower probability to be a false discovery. Again, techniques from large devi- ation theory and the Gibbs principle allow the derivation of an appropriately regularized cost function. These two theoretical results are then used to develop concrete applications. First, the problem of multi-classification is analyzed, which classifies a sample from an arbitrary proba- bility measure into a finite number of categories, based on a given training data set. A general risk functional is derived, which can be used to learn Bayes optimal classifiers controlling the false discovery rate. Secondly, the problem of model selection in the regression context is considered, aiming to select a subset of given regressors which explains most of the observed variation i.e. perform ANOVA. Again, using techniques mentioned above, a risk function is derived which when optimized, controls the rate of false discoveries. This technique is shown to outperform the popular LASSO algorithm, which can be proven to not control the FDR, but only the FWER. Finally, the problem of inferring under-sampled and partially observed non-negative dis- crete random variables is addressed, which has applications to analyzing RNA sequencing data. By assuming infinite divisibility of the underlying random variable, its characterization as being a discrete Compound Poisson Measure (DCP), is derived. This allows construction of a non-parametric Bayesian model of DCPs with a Pitman-Yor Mixture process prior, which is shown to allow for consistent inference under Kullback-Liebler and Renyi divergences even in the under-sampled regime

    Similar works