7,593 research outputs found
Correcting for selection bias via cross-validation in the classification of microarray data
There is increasing interest in the use of diagnostic rules based on
microarray data. These rules are formed by considering the expression levels of
thousands of genes in tissue samples taken on patients of known classification
with respect to a number of classes, representing, say, disease status or
treatment strategy. As the final versions of these rules are usually based on a
small subset of the available genes, there is a selection bias that has to be
corrected for in the estimation of the associated error rates. We consider the
problem using cross-validation. In particular, we present explicit formulae
that are useful in explaining the layers of validation that have to be
performed in order to avoid improperly cross-validated estimates.Comment: Published in at http://dx.doi.org/10.1214/193940307000000284 the IMS
Collections (http://www.imstat.org/publications/imscollections.htm) by the
Institute of Mathematical Statistics (http://www.imstat.org
Deep Gaussian Mixture Models
Deep learning is a hierarchical inference method formed by subsequent
multiple layers of learning able to more efficiently describe complex
relationships. In this work, Deep Gaussian Mixture Models are introduced and
discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers
of latent variables, where, at each layer, the variables follow a mixture of
Gaussian distributions. Thus, the deep mixture model consists of a set of
nested mixtures of linear models, which globally provide a nonlinear model able
to describe the data in a very flexible way. In order to avoid
overparameterized solutions, dimension reduction by factor models can be
applied at each layer of the architecture thus resulting in deep mixtures of
factor analysers.Comment: 19 pages, 4 figure
EMMIXcskew: an R Package for the Fitting of a Mixture of Canonical Fundamental Skew t-Distributions
This paper presents an R package EMMIXcskew for the fitting of the canonical
fundamental skew t-distribution (CFUST) and finite mixtures of this
distribution (FM-CFUST) via maximum likelihood (ML). The CFUST distribution
provides a flexible family of models to handle non-normal data, with parameters
for capturing skewness and heavy-tails in the data. It formally encompasses the
normal, t, and skew-normal distributions as special and/or limiting cases. A
few other versions of the skew t-distributions are also nested within the CFUST
distribution. In this paper, an Expectation-Maximization (EM) algorithm is
described for computing the ML estimates of the parameters of the FM-CFUST
model, and different strategies for initializing the algorithm are discussed
and illustrated. The methodology is implemented in the EMMIXcskew package, and
examples are presented using two real datasets. The EMMIXcskew package contains
functions to fit the FM-CFUST model, including procedures for generating
different initial values. Additional features include random sample generation
and contour visualization in 2D and 3D
Linear Mixed Models with Marginally Symmetric Nonparametric Random Effects
Linear mixed models (LMMs) are used as an important tool in the data analysis
of repeated measures and longitudinal studies. The most common form of LMMs
utilize a normal distribution to model the random effects. Such assumptions can
often lead to misspecification errors when the random effects are not normal.
One approach to remedy the misspecification errors is to utilize a point-mass
distribution to model the random effects; this is known as the nonparametric
maximum likelihood-fitted (NPML) model. The NPML model is flexible but requires
a large number of parameters to characterize the random-effects distribution.
It is often natural to assume that the random-effects distribution be at least
marginally symmetric. The marginally symmetric NPML (MSNPML) random-effects
model is introduced, which assumes a marginally symmetric point-mass
distribution for the random effects. Under the symmetry assumption, the MSNPML
model utilizes half the number of parameters to characterize the same number of
point masses as the NPML model; thus the model confers an advantage in economy
and parsimony. An EM-type algorithm is presented for the maximum likelihood
(ML) estimation of LMMs with MSNPML random effects; the algorithm is shown to
monotonically increase the log-likelihood and is proven to be convergent to a
stationary point of the log-likelihood function in the case of convergence.
Furthermore, it is shown that the ML estimator is consistent and asymptotically
normal under certain conditions, and the estimation of quantities such as the
random-effects covariance matrix and individual a posteriori expectations is
demonstrated
EMMIX-uskew: An R Package for Fitting Mixtures of Multivariate Skew t-distributions via the EM Algorithm
This paper describes an algorithm for fitting finite mixtures of unrestricted
Multivariate Skew t (FM-uMST) distributions. The package EMMIX-uskew implements
a closed-form expectation-maximization (EM) algorithm for computing the maximum
likelihood (ML) estimates of the parameters for the (unrestricted) FM-MST model
in R. EMMIX-uskew also supports visualization of fitted contours in two and
three dimensions, and random sample generation from a specified FM-uMST
distribution.
Finite mixtures of skew t-distributions have proven to be useful in modelling
heterogeneous data with asymmetric and heavy tail behaviour, for example,
datasets from flow cytometry. In recent years, various versions of mixtures
with multivariate skew t (MST) distributions have been proposed. However, these
models adopted some restricted characterizations of the component MST
distributions so that the E-step of the EM algorithm can be evaluated in closed
form. This paper focuses on mixtures with unrestricted MST components, and
describes an iterative algorithm for the computation of the ML estimates of its
model parameters.
The usefulness of the proposed algorithm is demonstrated in three
applications to real data sets. The first example illustrates the use of the
main function fmmst in the package by fitting a MST distribution to a bivariate
unimodal flow cytometric sample. The second example fits a mixture of MST
distributions to the Australian Institute of Sport (AIS) data, and demonstrate
that EMMIX-uskew can provide better clustering results than mixtures with
restricted MST components. In the third example, EMMIX-uskew is applied to
classify cells in a trivariate flow cytometric dataset. Comparisons with other
available methods suggests that the EMMIX-uskew result achieved a lower
misclassification rate with respect to the labels given by benchmark gating
analysis
Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization--Minimization Algorithm Approach
Support vector machines (SVMs) are an important tool in modern data analysis.
Traditionally, support vector machines have been fitted via quadratic
programming, either using purpose-built or off-the-shelf algorithms. We present
an alternative approach to SVM fitting via the majorization--minimization (MM)
paradigm. Algorithms that are derived via MM algorithm constructions can be
shown to monotonically decrease their objectives at each iteration, as well as
be globally convergent to stationary points. We demonstrate the construction of
iteratively-reweighted least-squares (IRLS) algorithms, via the MM paradigm,
for SVM risk minimization problems involving the hinge, least-square,
squared-hinge, and logistic losses, and 1-norm, 2-norm, and elastic net
penalizations. Successful implementations of our algorithms are presented via
some numerical examples
- …