6,082 research outputs found

    A robust and sparse K-means clustering algorithm

    Full text link
    In many situations where the interest lies in identifying clusters one might expect that not all available variables carry information about these groups. Furthermore, data quality (e.g. outliers or missing entries) might present a serious and sometimes hard-to-assess problem for large and complex datasets. In this paper we show that a small proportion of atypical observations might have serious adverse effects on the solutions found by the sparse clustering algorithm of Witten and Tibshirani (2010). We propose a robustification of their sparse K-means algorithm based on the trimmed K-means algorithm of Cuesta-Albertos et al. (1997) Our proposal is also able to handle datasets with missing values. We illustrate the use of our method on microarray data for cancer patients where we are able to identify strong biological clusters with a much reduced number of genes. Our simulation studies show that, when there are outliers in the data, our robust sparse K-means algorithm performs better than other competing methods both in terms of the selection of features and also the identified clusters. This robust sparse K-means algorithm is implemented in the R package RSKC which is publicly available from the CRAN repository

    An exact algorithm for weighted-mean trimmed regions in any dimension

    Get PDF
    Trimmed regions are a powerful tool of multivariate data analysis. They describe a probability distribution in Euclidean d-space regarding location, dispersion, and shape, and they order multivariate data with respect to their centrality. Dyckerhoff and Mosler (201x) have introduced the class of weighted-mean trimmed regions, which possess attractive properties regarding continuity, subadditivity, and monotonicity. We present an exact algorithm to compute the weighted-mean trimmed regions of a given data cloud in arbitrary dimension d. These trimmed regions are convex polytopes in Rd. To calculate them, the algorithm builds on methods from computational geometry. A characterization of a region's facets is used, and information about the adjacency of the facets is extracted from the data. A key problem consists in ordering the facets. It is solved by the introduction of a tree-based order. The algorithm has been programmed in C++ and is available as an R package. --central regions,data depth,multivariate data analysis,convex polytope,computational geometry,algorithm,C++, R

    Testing the Capital Asset Pricing Model Efficiently Under Elliptical Symmetry: A Semiparametric Approach

    Get PDF
    We develop new tests of the capital asset pricing model (CAPM) that take account of and are valid under the assumption that the distribution generating returns is elliptically symmetric; this assumption is necessary and sufficient for the validity of the CAPM. Our test is based on semiparametric efficient estimation procedures for a seemingly unrelated regression model where the multivariate error density is elliptically symmetric, but otherwise unrestricted. The elliptical symmetry assumption allows us to avert the curse of dimensionality problem that typically arises in multivariate semiparametric estimation procedures, because the multivariate elliptically symmetric density function can be written as a function of a scalar transformation of the observed multivariate data. The elliptically symmetric family includes a number of thick-tailed distributions and so is potentially relevant in financial applications. Our estimated betas are lower than the OLS estimates, and our parameter estimates are much less consistent with the CAPM restrictions than the corresponding OLS estimates. Nous développons de nouveaux tests du modèle d'évaluation des actifs financiers (" CAPM ") qui tiennent compte de, et sont valides sous, l'hypothèse que les retours des actifs découlent d'un loi de probabilité elliptiquement symétrique. Cette hypothèse est nécessaire et suffisante pour la validité du CAPM. Notre test utilise un estimateur des paramètres du modèle qui a l'efficacité semiparamétrique quand on a un modèle de régression apparemment sans relation et qui a des erreurs qui suivent une loi elliptiquement symétrique. L'hypothèse de la symétrie elliptique nous permet d'éviter le problème d'estimer non-paramétriquement une fonction de haute dimension parce qu'on peut écrire la densité d'une loi elliptique comme une fonction d'une transformation unidimensionnelle de la variable aléatoire multidimensionnelle. La famille des lois elliptiquement symétriques inclue plusieurs lois leptokurtiques, donc elle est pertinente à des applications financières. Les bêtas obtenus avec notre estimateur sont plus bas que ceux qui sont obtenus en utilisant des moindres carrés, et sont moins compatibles avec le CAPM.Adaptive estimation, capital asset pricing model, elliptical symmetry, semiparametric efficiency

    Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering

    Get PDF
    The two main topics of this paper are the introduction of the "optimally tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant density for modelling outliers and noise. This can be chosen optimally so that the non-noise part of the data looks as close to a Gaussian mixture as possible. Some deviation from Gaussianity can be traded in for lowering the estimated noise proportion. Covariance matrix constraints and computation of the OTRIMLE are also treated. In the simulation study, all methods are confronted with setups in which their model assumptions are not exactly fulfilled, and in order to evaluate the experiments in a standardized way by misclassification rates, a new model-based definition of "true clusters" is introduced that deviates from the usual identification of mixture components with clusters. In the study, every method turns out to be superior for one or more setups, but the OTRIMLE achieves the most satisfactory overall performance. The methods are also applied to two real datasets, one without and one with known "true" clusters
    corecore