123 research outputs found
A geometric approach to archetypal analysis and non-negative matrix factorization
Archetypal analysis and non-negative matrix factorization (NMF) are staples
in a statisticians toolbox for dimension reduction and exploratory data
analysis. We describe a geometric approach to both NMF and archetypal analysis
by interpreting both problems as finding extreme points of the data cloud. We
also develop and analyze an efficient approach to finding extreme points in
high dimensions. For modern massive datasets that are too large to fit on a
single machine and must be stored in a distributed setting, our approach makes
only a small number of passes over the data. In fact, it is possible to obtain
the NMF or perform archetypal analysis with just two passes over the data.Comment: 36 pages, 13 figure
Probabilistic Archetypal Analysis
Archetypal analysis represents a set of observations as convex combinations
of pure patterns, or archetypes. The original geometric formulation of finding
archetypes by approximating the convex hull of the observations assumes them to
be real valued. This, unfortunately, is not compatible with many practical
situations. In this paper we revisit archetypal analysis from the basic
principles, and propose a probabilistic framework that accommodates other
observation types such as integers, binary, and probability vectors. We
corroborate the proposed methodology with convincing real-world applications on
finding archetypal winter tourists based on binary survey data, archetypal
disaster-affected countries based on disaster count data, and document
archetypes based on term-frequency data. We also present an appropriate
visualization tool to summarize archetypal analysis solution better.Comment: 24 pages; added literature review and visualizatio
SAGA: Sparse And Geometry-Aware non-negative matrix factorization through non-linear local embedding
International audienceThis paper presents a new non-negative matrix factorization technique which (1) allows the decomposition of the original data on multiple latent factors accounting for the geometrical structure of the manifold embedding the data; (2) provides an optimal representation with a controllable level of sparsity; (3) has an overall linear complexity allowing handling in tractable time large and high dimensional datasets. It operates by coding the data with respect to local neighbors with non-linear weights. This locality is obtained as a consequence of the simultaneous sparsity and convexity constraints. Our method is demonstrated over several experiments, including a feature extraction and classification task, where it achieves better performances than the state-of-the-art factorization methods, with a shorter computational time
- …