1,503 research outputs found
Clustering and Latent Semantic Indexing Aspects of the Nonnegative Matrix Factorization
This paper provides a theoretical support for clustering aspect of the
nonnegative matrix factorization (NMF). By utilizing the Karush-Kuhn-Tucker
optimality conditions, we show that NMF objective is equivalent to graph
clustering objective, so clustering aspect of the NMF has a solid
justification. Different from previous approaches which usually discard the
nonnegativity constraints, our approach guarantees the stationary point being
used in deriving the equivalence is located on the feasible region in the
nonnegative orthant. Additionally, since clustering capability of a matrix
decomposition technique can sometimes imply its latent semantic indexing (LSI)
aspect, we will also evaluate LSI aspect of the NMF by showing its capability
in solving the synonymy and polysemy problems in synthetic datasets. And more
extensive evaluation will be conducted by comparing LSI performances of the NMF
and the singular value decomposition (SVD), the standard LSI method, using some
standard datasets.Comment: 28 pages, 5 figure
Data Clustering And Visualization Through Matrix Factorization
Clustering is traditionally an unsupervised task which is to find natural groupings or clusters in multidimensional data based on perceived similarities among the patterns. The purpose of clustering is to extract useful information
from unlabeled data.
In order to present the extracted useful knowledge obtained by clustering in a meaningful way, data visualization becomes a popular and growing area of research field. Visualization can provide a qualitative overview of large and complex data sets, which help us the desired insight in truly understanding the phenomena of interest in data.
The contribution of this dissertation is two-fold: Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for data clustering/co-clustering and Exemplar-based data Visualization (EV) through matrix factorization. Compared to traditional data mining models,
matrix-based methods are fast, easy to understand and implement, especially suitable to solve large-scale challenging problems in text mining, image grouping, medical diagnosis, and bioinformatics.
In this dissertation, we present two effective matrix-based solutions
in the new directions of data clustering and visualization.
First, in many practical learning domains,
there is a large supply of unlabeled data but limited labeled data, and in most cases it might
be expensive to generate large amounts of labeled data. Traditional clustering algorithms completely ignore these valuable labeled data and thus are inapplicable to these problems. Consequently, semi-supervised clustering, which can incorporate the domain knowledge to guide a clustering algorithm, has become a topic of significant recent interest.
Thus, we develop a Non-negative Matrix Factorization
(NMF) based framework to incorporate prior knowledge into data clustering. Moreover, with the fast growth of Internet and computational technologies in the past decade, many data mining applications have advanced swiftly from the simple clustering of one data type to the co-clustering of multiple data types, usually involving high heterogeneity. To this end, we extend SS-NMF to perform heterogeneous data co-clustering. From a theoretical perspective, SS-NMF for data clustering/co-clustering is mathematically rigorous. The convergence and correctness of our algorithms are proved.
In addition, we discuss the relationship between SS-NMF with other well-known clustering and co-clustering models.
Second, most of current clustering models only provide the centroids (e.g., mathematical means of the clusters)
without inferring the representative exemplars from real data, thus they are unable to better summarize or visualize the raw data.
A new method, Exemplar-based Visualization (EV), is proposed to cluster and visualize an extremely large-scale data.
Capitalizing on recent advances in matrix approximation and factorization, EV provides a means
to visualize large scale data with high accuracy (in
retaining neighbor relations), high efficiency (in computation), and
high flexibility (through the use of exemplars).
Empirically, we demonstrate the superior performance of our matrix-based data clustering and visualization models
through extensive experiments performed on the publicly available large scale data sets
Non-negative mixtures
This is the author's accepted pre-print of the article, first published as M. D. Plumbley, A. Cichocki and R. Bro. Non-negative mixtures. In P. Comon and C. Jutten (Ed), Handbook of Blind Source Separation: Independent Component Analysis and Applications. Chapter 13, pp. 515-547. Academic Press, Feb 2010. ISBN 978-0-12-374726-6 DOI: 10.1016/B978-0-12-374726-6.00018-7file: Proof:p\PlumbleyCichockiBro10-non-negative.pdf:PDF owner: markp timestamp: 2011.04.26file: Proof:p\PlumbleyCichockiBro10-non-negative.pdf:PDF owner: markp timestamp: 2011.04.2
Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience.
Identifying low-dimensional features that describe large-scale neural recordings is a major challenge in neuroscience. Repeated temporal patterns (sequences) are thought to be a salient feature of neural dynamics, but are not succinctly captured by traditional dimensionality reduction techniques. Here, we describe a software toolbox-called seqNMF-with new methods for extracting informative, non-redundant, sequences from high-dimensional neural data, testing the significance of these extracted patterns, and assessing the prevalence of sequential structure in data. We test these methods on simulated data under multiple noise conditions, and on several real neural and behavioral datas. In hippocampal data, seqNMF identifies neural sequences that match those calculated manually by reference to behavioral events. In songbird data, seqNMF discovers neural sequences in untutored birds that lack stereotyped songs. Thus, by identifying temporal structure directly from neural data, seqNMF enables dissection of complex neural circuits without relying on temporal references from stimuli or behavioral outputs
BROCCOLI: overlapping and outlier-robust biclustering through proximal stochastic gradient descent
Matrix tri-factorization subject to binary constraints is a versatile and powerful framework for the simultaneous clustering of observations and features, also known as biclustering. Applications for biclustering encompass the clustering of high-dimensional data and explorative data mining, where the selection of the most important features is relevant. Unfortunately, due to the lack of suitable methods for the optimization subject to binary constraints, the powerful framework of biclustering is typically constrained to clusterings which partition the set of observations or features. As a result, overlap between clusters cannot be modelled and every item, even outliers in the data, have to be assigned to exactly one cluster. In this paper we propose Broccoli, an optimization scheme for matrix factorization subject to binary constraints, which is based on the theoretically well-founded optimization scheme of proximal stochastic gradient descent. Thereby, we do not impose any restrictions on the obtained clusters. Our experimental evaluation, performed on both synthetic and real-world data, and against 6 competitor algorithms, show reliable and competitive performance, even in presence of a high amount of noise in the data. Moreover, a qualitative analysis of the identified clusters shows that Broccoli may provide meaningful and interpretable clustering structures
Dictionary-based Tensor Canonical Polyadic Decomposition
To ensure interpretability of extracted sources in tensor decomposition, we
introduce in this paper a dictionary-based tensor canonical polyadic
decomposition which enforces one factor to belong exactly to a known
dictionary. A new formulation of sparse coding is proposed which enables high
dimensional tensors dictionary-based canonical polyadic decomposition. The
benefits of using a dictionary in tensor decomposition models are explored both
in terms of parameter identifiability and estimation accuracy. Performances of
the proposed algorithms are evaluated on the decomposition of simulated data
and the unmixing of hyperspectral images
- …