1,373 research outputs found
Supervised Classification Using Sparse Fisher's LDA
It is well known that in a supervised classification setting when the number
of features is smaller than the number of observations, Fisher's linear
discriminant rule is asymptotically Bayes. However, there are numerous modern
applications where classification is needed in the high-dimensional setting.
Naive implementation of Fisher's rule in this case fails to provide good
results because the sample covariance matrix is singular. Moreover, by
constructing a classifier that relies on all features the interpretation of the
results is challenging. Our goal is to provide robust classification that
relies only on a small subset of important features and accounts for the
underlying correlation structure. We apply a lasso-type penalty to the
discriminant vector to ensure sparsity of the solution and use a shrinkage type
estimator for the covariance matrix. The resulting optimization problem is
solved using an iterative coordinate ascent algorithm. Furthermore, we analyze
the effect of nonconvexity on the sparsity level of the solution and highlight
the difference between the penalized and the constrained versions of the
problem. The simulation results show that the proposed method performs
favorably in comparison to alternatives. The method is used to classify
leukemia patients based on DNA methylation features
Integrative Model-based clustering of microarray methylation and expression data
In many fields, researchers are interested in large and complex biological
processes. Two important examples are gene expression and DNA methylation in
genetics. One key problem is to identify aberrant patterns of these processes
and discover biologically distinct groups. In this article we develop a
model-based method for clustering such data. The basis of our method involves
the construction of a likelihood for any given partition of the subjects. We
introduce cluster specific latent indicators that, along with some standard
assumptions, impose a specific mixture distribution on each cluster. Estimation
is carried out using the EM algorithm. The methods extend naturally to multiple
data types of a similar nature, which leads to an integrated analysis over
multiple data platforms, resulting in higher discriminating power.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS533 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Empirical supremum rejection sampling
Rejection sampling thins out samples from a candidate density from which it is easy to simulate, to obtain samples from a more awkward target density. A prerequisite is knowledge of the finite supremum of the ratio of the target and candidate densities. This severely restricts application of the method because it can be difficult to calculate the supremum. We use theoretical argument and numerical work to show that a practically perfect sample may be obtained by replacing the exact supremum with the maximum obtained from simulated candidates. We also provide diagnostics for failure of the method caused by a bad choice of candidate distribution. The implication is that essentially no theoretical work is required to apply rejection sampling in many practical case
Exoplanet atmosphere evolution: emulation with random forests
Atmospheric mass-loss is known to play a leading role in sculpting the
demographics of small, close-in exoplanets. Understanding the impact of such
mass-loss driven evolution requires modelling large populations of planets to
compare with the observed exoplanet distributions. As the quality of planet
observations increases, so should the accuracy of the models used to understand
them. However, to date, only simple semi-analytic models have been used in such
comparisons since modelling populations of planets with high accuracy demands a
high computational cost. To address this, we turn to machine learning. We
implement random forests trained on atmospheric evolution models, including XUV
photoevaporation, to predict a given planet's final radius and atmospheric
mass. This evolution emulator is found to have an RMS fractional radius error
of 1 from the original models and is times faster to evaluate.
As a test case, we use the emulator to infer the initial properties of
Kepler-36b and c, confirming that their architecture is consistent with
atmospheric mass loss. Our new approach opens the door to highly sophisticated
models of atmospheric evolution being used in demographic analysis, which will
yield further insight into planet formation and evolution.Comment: 5 pages, 3 figures. Submitted to MNRAS letter
Closed form GLM cumulants and GLMM fitting with a SQUAR-EM-LA 2 algorithm
Abstract We find closed form expressions for the standardized cumulants of generalized linear models. This reduces the complexity of their calculation from O(p 6 ) to O(p 2 ) operations which allows efficient construction of second-order saddlepoint approximations to the pdf of sufficient statistics. We adapt the result to obtain a closed form expression for the second-order Laplace approximation for a GLMM likelihood. Using this approximation, we develop a computationally highly efficient accelerated EM procedure, SQUAR-EM-LA 2 . The procedure is illustrated by fitting a GLMM to a well-known data set. Extensive simulations show the phenomenal performance of the approach. Matlab software is provided for implementing the proposed algorithm
- …