6,082 research outputs found
A robust and sparse K-means clustering algorithm
In many situations where the interest lies in identifying clusters one might
expect that not all available variables carry information about these groups.
Furthermore, data quality (e.g. outliers or missing entries) might present a
serious and sometimes hard-to-assess problem for large and complex datasets. In
this paper we show that a small proportion of atypical observations might have
serious adverse effects on the solutions found by the sparse clustering
algorithm of Witten and Tibshirani (2010). We propose a robustification of
their sparse K-means algorithm based on the trimmed K-means algorithm of
Cuesta-Albertos et al. (1997) Our proposal is also able to handle datasets with
missing values. We illustrate the use of our method on microarray data for
cancer patients where we are able to identify strong biological clusters with a
much reduced number of genes. Our simulation studies show that, when there are
outliers in the data, our robust sparse K-means algorithm performs better than
other competing methods both in terms of the selection of features and also the
identified clusters. This robust sparse K-means algorithm is implemented in the
R package RSKC which is publicly available from the CRAN repository
An exact algorithm for weighted-mean trimmed regions in any dimension
Trimmed regions are a powerful tool of multivariate data analysis. They describe a probability distribution in Euclidean d-space regarding location, dispersion, and shape, and they order multivariate data with respect to their centrality. Dyckerhoff and Mosler (201x) have introduced the class of weighted-mean trimmed regions, which possess attractive properties regarding continuity, subadditivity, and monotonicity. We present an exact algorithm to compute the weighted-mean trimmed regions of a given data cloud in arbitrary dimension d. These trimmed regions are convex polytopes in Rd. To calculate them, the algorithm builds on methods from computational geometry. A characterization of a region's facets is used, and information about the adjacency of the facets is extracted from the data. A key problem consists in ordering the facets. It is solved by the introduction of a tree-based order. The algorithm has been programmed in C++ and is available as an R package. --central regions,data depth,multivariate data analysis,convex polytope,computational geometry,algorithm,C++, R
Testing the Capital Asset Pricing Model Efficiently Under Elliptical Symmetry: A Semiparametric Approach
We develop new tests of the capital asset pricing model (CAPM) that take account of and are valid under the assumption that the distribution generating returns is elliptically symmetric; this assumption is necessary and sufficient for the validity of the CAPM. Our test is based on semiparametric efficient estimation procedures for a seemingly unrelated regression model where the multivariate error density is elliptically symmetric, but otherwise unrestricted. The elliptical symmetry assumption allows us to avert the curse of dimensionality problem that typically arises in multivariate semiparametric estimation procedures, because the multivariate elliptically symmetric density function can be written as a function of a scalar transformation of the observed multivariate data. The elliptically symmetric family includes a number of thick-tailed distributions and so is potentially relevant in financial applications. Our estimated betas are lower than the OLS estimates, and our parameter estimates are much less consistent with the CAPM restrictions than the corresponding OLS estimates. Nous développons de nouveaux tests du modèle d'évaluation des actifs financiers (" CAPM ") qui tiennent compte de, et sont valides sous, l'hypothèse que les retours des actifs découlent d'un loi de probabilité elliptiquement symétrique. Cette hypothèse est nécessaire et suffisante pour la validité du CAPM. Notre test utilise un estimateur des paramètres du modèle qui a l'efficacité semiparamétrique quand on a un modèle de régression apparemment sans relation et qui a des erreurs qui suivent une loi elliptiquement symétrique. L'hypothèse de la symétrie elliptique nous permet d'éviter le problème d'estimer non-paramétriquement une fonction de haute dimension parce qu'on peut écrire la densité d'une loi elliptique comme une fonction d'une transformation unidimensionnelle de la variable aléatoire multidimensionnelle. La famille des lois elliptiquement symétriques inclue plusieurs lois leptokurtiques, donc elle est pertinente à des applications financières. Les bêtas obtenus avec notre estimateur sont plus bas que ceux qui sont obtenus en utilisant des moindres carrés, et sont moins compatibles avec le CAPM.Adaptive estimation, capital asset pricing model, elliptical symmetry, semiparametric efficiency
Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering
The two main topics of this paper are the introduction of the "optimally
tuned improper maximum likelihood estimator" (OTRIMLE) for robust clustering
based on the multivariate Gaussian model for clusters, and a comprehensive
simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian
mixtures with and without noise component, mixtures of t-distributions, and the
TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant
density for modelling outliers and noise. This can be chosen optimally so that
the non-noise part of the data looks as close to a Gaussian mixture as
possible. Some deviation from Gaussianity can be traded in for lowering the
estimated noise proportion. Covariance matrix constraints and computation of
the OTRIMLE are also treated. In the simulation study, all methods are
confronted with setups in which their model assumptions are not exactly
fulfilled, and in order to evaluate the experiments in a standardized way by
misclassification rates, a new model-based definition of "true clusters" is
introduced that deviates from the usual identification of mixture components
with clusters. In the study, every method turns out to be superior for one or
more setups, but the OTRIMLE achieves the most satisfactory overall
performance. The methods are also applied to two real datasets, one without and
one with known "true" clusters
- …