7,593 research outputs found

    Correcting for selection bias via cross-validation in the classification of microarray data

    Full text link
    There is increasing interest in the use of diagnostic rules based on microarray data. These rules are formed by considering the expression levels of thousands of genes in tissue samples taken on patients of known classification with respect to a number of classes, representing, say, disease status or treatment strategy. As the final versions of these rules are usually based on a small subset of the available genes, there is a selection bias that has to be corrected for in the estimation of the associated error rates. We consider the problem using cross-validation. In particular, we present explicit formulae that are useful in explaining the layers of validation that have to be performed in order to avoid improperly cross-validated estimates.Comment: Published in at http://dx.doi.org/10.1214/193940307000000284 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

    Deep Gaussian Mixture Models

    Get PDF
    Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Thus, the deep mixture model consists of a set of nested mixtures of linear models, which globally provide a nonlinear model able to describe the data in a very flexible way. In order to avoid overparameterized solutions, dimension reduction by factor models can be applied at each layer of the architecture thus resulting in deep mixtures of factor analysers.Comment: 19 pages, 4 figure

    EMMIXcskew: an R Package for the Fitting of a Mixture of Canonical Fundamental Skew t-Distributions

    Get PDF
    This paper presents an R package EMMIXcskew for the fitting of the canonical fundamental skew t-distribution (CFUST) and finite mixtures of this distribution (FM-CFUST) via maximum likelihood (ML). The CFUST distribution provides a flexible family of models to handle non-normal data, with parameters for capturing skewness and heavy-tails in the data. It formally encompasses the normal, t, and skew-normal distributions as special and/or limiting cases. A few other versions of the skew t-distributions are also nested within the CFUST distribution. In this paper, an Expectation-Maximization (EM) algorithm is described for computing the ML estimates of the parameters of the FM-CFUST model, and different strategies for initializing the algorithm are discussed and illustrated. The methodology is implemented in the EMMIXcskew package, and examples are presented using two real datasets. The EMMIXcskew package contains functions to fit the FM-CFUST model, including procedures for generating different initial values. Additional features include random sample generation and contour visualization in 2D and 3D

    Linear Mixed Models with Marginally Symmetric Nonparametric Random Effects

    Full text link
    Linear mixed models (LMMs) are used as an important tool in the data analysis of repeated measures and longitudinal studies. The most common form of LMMs utilize a normal distribution to model the random effects. Such assumptions can often lead to misspecification errors when the random effects are not normal. One approach to remedy the misspecification errors is to utilize a point-mass distribution to model the random effects; this is known as the nonparametric maximum likelihood-fitted (NPML) model. The NPML model is flexible but requires a large number of parameters to characterize the random-effects distribution. It is often natural to assume that the random-effects distribution be at least marginally symmetric. The marginally symmetric NPML (MSNPML) random-effects model is introduced, which assumes a marginally symmetric point-mass distribution for the random effects. Under the symmetry assumption, the MSNPML model utilizes half the number of parameters to characterize the same number of point masses as the NPML model; thus the model confers an advantage in economy and parsimony. An EM-type algorithm is presented for the maximum likelihood (ML) estimation of LMMs with MSNPML random effects; the algorithm is shown to monotonically increase the log-likelihood and is proven to be convergent to a stationary point of the log-likelihood function in the case of convergence. Furthermore, it is shown that the ML estimator is consistent and asymptotically normal under certain conditions, and the estimation of quantities such as the random-effects covariance matrix and individual a posteriori expectations is demonstrated

    EMMIX-uskew: An R Package for Fitting Mixtures of Multivariate Skew t-distributions via the EM Algorithm

    Get PDF
    This paper describes an algorithm for fitting finite mixtures of unrestricted Multivariate Skew t (FM-uMST) distributions. The package EMMIX-uskew implements a closed-form expectation-maximization (EM) algorithm for computing the maximum likelihood (ML) estimates of the parameters for the (unrestricted) FM-MST model in R. EMMIX-uskew also supports visualization of fitted contours in two and three dimensions, and random sample generation from a specified FM-uMST distribution. Finite mixtures of skew t-distributions have proven to be useful in modelling heterogeneous data with asymmetric and heavy tail behaviour, for example, datasets from flow cytometry. In recent years, various versions of mixtures with multivariate skew t (MST) distributions have been proposed. However, these models adopted some restricted characterizations of the component MST distributions so that the E-step of the EM algorithm can be evaluated in closed form. This paper focuses on mixtures with unrestricted MST components, and describes an iterative algorithm for the computation of the ML estimates of its model parameters. The usefulness of the proposed algorithm is demonstrated in three applications to real data sets. The first example illustrates the use of the main function fmmst in the package by fitting a MST distribution to a bivariate unimodal flow cytometric sample. The second example fits a mixture of MST distributions to the Australian Institute of Sport (AIS) data, and demonstrate that EMMIX-uskew can provide better clustering results than mixtures with restricted MST components. In the third example, EMMIX-uskew is applied to classify cells in a trivariate flow cytometric dataset. Comparisons with other available methods suggests that the EMMIX-uskew result achieved a lower misclassification rate with respect to the labels given by benchmark gating analysis

    Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization--Minimization Algorithm Approach

    Full text link
    Support vector machines (SVMs) are an important tool in modern data analysis. Traditionally, support vector machines have been fitted via quadratic programming, either using purpose-built or off-the-shelf algorithms. We present an alternative approach to SVM fitting via the majorization--minimization (MM) paradigm. Algorithms that are derived via MM algorithm constructions can be shown to monotonically decrease their objectives at each iteration, as well as be globally convergent to stationary points. We demonstrate the construction of iteratively-reweighted least-squares (IRLS) algorithms, via the MM paradigm, for SVM risk minimization problems involving the hinge, least-square, squared-hinge, and logistic losses, and 1-norm, 2-norm, and elastic net penalizations. Successful implementations of our algorithms are presented via some numerical examples
    • …
    corecore