Search CORE

1,373 research outputs found

Supervised Classification Using Sparse Fisher's LDA

Author: Booth James G.
Gaynanova Irina
Wells Martin T.
Publication venue
Publication date: 16/09/2014
Field of study

It is well known that in a supervised classification setting when the number of features is smaller than the number of observations, Fisher's linear discriminant rule is asymptotically Bayes. However, there are numerous modern applications where classification is needed in the high-dimensional setting. Naive implementation of Fisher's rule in this case fails to provide good results because the sample covariance matrix is singular. Moreover, by constructing a classifier that relies on all features the interpretation of the results is challenging. Our goal is to provide robust classification that relies only on a small subset of important features and accounts for the underlying correlation structure. We apply a lasso-type penalty to the discriminant vector to ensure sparsity of the solution and use a shrinkage type estimator for the covariance matrix. The resulting optimization problem is solved using an iterative coordinate ascent algorithm. Furthermore, we analyze the effect of nonconvexity on the sparsity level of the solution and highlight the difference between the penalized and the constrained versions of the problem. The simulation results show that the proposed method performs favorably in comparison to alternatives. The method is used to classify leukemia patients based on DNA methylation features

arXiv.org e-Print Archive

CiteSeerX

Integrative Model-based clustering of microarray methylation and expression data

Author: Booth James G.
Figueroa Maria E.
Kormaksson Matthias
Melnick Ari
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2012
Field of study

In many fields, researchers are interested in large and complex biological processes. Two important examples are gene expression and DNA methylation in genetics. One key problem is to identify aberrant patterns of these processes and discover biologically distinct groups. In this article we develop a model-based method for clustering such data. The basis of our method involves the construction of a likelihood for any given partition of the subjects. We introduce cluster specific latent indicators that, along with some standard assumptions, impose a specific mixture distribution on each cluster. Estimation is carried out using the EM algorithm. The methods extend naturally to multiple data types of a similar nature, which leads to an integrated analysis over multiple data platforms, resulting in higher discriminating power.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS533 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

University of Miami: Scholarship Miami

Empirical supremum rejection sampling

Author: Booth James G.
Caffo Brian S.
Davison A. C.
Publication venue
Publication date: 02/08/2017
Field of study

Rejection sampling thins out samples from a candidate density from which it is easy to simulate, to obtain samples from a more awkward target density. A prerequisite is knowledge of the finite supremum of the ratio of the target and candidate densities. This severely restricts application of the method because it can be difficult to calculate the supremum. We use theoretical argument and numerical work to show that a practically perfect sample may be obtained by replacing the exact supremum with the maximum obtained from simulated candidates. We also provide diagnostics for failure of the method caused by a bad choice of candidate distribution. The implication is that essentially no theoretical work is required to apply rejection sampling in many practical case

RERO DOC Digital Library

Exoplanet atmosphere evolution: emulation with random forests

Author: Booth Richard A.
Muñoz Clàudia Janó
Owen James E.
Rogers James G.
Publication venue
Publication date: 28/10/2021
Field of study

Atmospheric mass-loss is known to play a leading role in sculpting the demographics of small, close-in exoplanets. Understanding the impact of such mass-loss driven evolution requires modelling large populations of planets to compare with the observed exoplanet distributions. As the quality of planet observations increases, so should the accuracy of the models used to understand them. However, to date, only simple semi-analytic models have been used in such comparisons since modelling populations of planets with high accuracy demands a high computational cost. To address this, we turn to machine learning. We implement random forests trained on atmospheric evolution models, including XUV photoevaporation, to predict a given planet's final radius and atmospheric mass. This evolution emulator is found to have an RMS fractional radius error of 1

\%

from the original models and is

\sim 400

times faster to evaluate. As a test case, we use the emulator to infer the initial properties of Kepler-36b and c, confirming that their architecture is consistent with atmospheric mass loss. Our new approach opens the door to highly sophisticated models of atmospheric evolution being used in demographic analysis, which will yield further insight into planet formation and evolution.Comment: 5 pages, 3 figures. Submitted to MNRAS letter

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Closed form GLM cumulants and GLMM fitting with a SQUAR-EM-LA 2 algorithm

Author: James G Booth
Vadim Zipunnikov
Publication venue
Publication date: 11/04/2020
Field of study

Abstract We find closed form expressions for the standardized cumulants of generalized linear models. This reduces the complexity of their calculation from O(p 6 ) to O(p 2 ) operations which allows efficient construction of second-order saddlepoint approximations to the pdf of sufficient statistics. We adapt the result to obtain a closed form expression for the second-order Laplace approximation for a GLMM likelihood. Using this approximation, we develop a computationally highly efficient accelerated EM procedure, SQUAR-EM-LA 2 . The procedure is illustrated by fitting a GLMM to a well-known data set. Extensive simulations show the phenomenal performance of the approach. Matlab software is provided for implementing the proposed algorithm

CiteSeerX