1,537 research outputs found
Recent Developments in Document Clustering
This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed
Probabilistic Clustering of Time-Evolving Distance Data
We present a novel probabilistic clustering model for objects that are
represented via pairwise distances and observed at different time points. The
proposed method utilizes the information given by adjacent time points to find
the underlying cluster structure and obtain a smooth cluster evolution. This
approach allows the number of objects and clusters to differ at every time
point, and no identification on the identities of the objects is needed.
Further, the model does not require the number of clusters being specified in
advance -- they are instead determined automatically using a Dirichlet process
prior. We validate our model on synthetic data showing that the proposed method
is more accurate than state-of-the-art clustering methods. Finally, we use our
dynamic clustering model to analyze and illustrate the evolution of brain
cancer patients over time
Data Clustering And Visualization Through Matrix Factorization
Clustering is traditionally an unsupervised task which is to find natural groupings or clusters in multidimensional data based on perceived similarities among the patterns. The purpose of clustering is to extract useful information
from unlabeled data.
In order to present the extracted useful knowledge obtained by clustering in a meaningful way, data visualization becomes a popular and growing area of research field. Visualization can provide a qualitative overview of large and complex data sets, which help us the desired insight in truly understanding the phenomena of interest in data.
The contribution of this dissertation is two-fold: Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for data clustering/co-clustering and Exemplar-based data Visualization (EV) through matrix factorization. Compared to traditional data mining models,
matrix-based methods are fast, easy to understand and implement, especially suitable to solve large-scale challenging problems in text mining, image grouping, medical diagnosis, and bioinformatics.
In this dissertation, we present two effective matrix-based solutions
in the new directions of data clustering and visualization.
First, in many practical learning domains,
there is a large supply of unlabeled data but limited labeled data, and in most cases it might
be expensive to generate large amounts of labeled data. Traditional clustering algorithms completely ignore these valuable labeled data and thus are inapplicable to these problems. Consequently, semi-supervised clustering, which can incorporate the domain knowledge to guide a clustering algorithm, has become a topic of significant recent interest.
Thus, we develop a Non-negative Matrix Factorization
(NMF) based framework to incorporate prior knowledge into data clustering. Moreover, with the fast growth of Internet and computational technologies in the past decade, many data mining applications have advanced swiftly from the simple clustering of one data type to the co-clustering of multiple data types, usually involving high heterogeneity. To this end, we extend SS-NMF to perform heterogeneous data co-clustering. From a theoretical perspective, SS-NMF for data clustering/co-clustering is mathematically rigorous. The convergence and correctness of our algorithms are proved.
In addition, we discuss the relationship between SS-NMF with other well-known clustering and co-clustering models.
Second, most of current clustering models only provide the centroids (e.g., mathematical means of the clusters)
without inferring the representative exemplars from real data, thus they are unable to better summarize or visualize the raw data.
A new method, Exemplar-based Visualization (EV), is proposed to cluster and visualize an extremely large-scale data.
Capitalizing on recent advances in matrix approximation and factorization, EV provides a means
to visualize large scale data with high accuracy (in
retaining neighbor relations), high efficiency (in computation), and
high flexibility (through the use of exemplars).
Empirically, we demonstrate the superior performance of our matrix-based data clustering and visualization models
through extensive experiments performed on the publicly available large scale data sets
Evidence Transfer for Improving Clustering Tasks Using External Categorical Evidence
In this paper we introduce evidence transfer for clustering, a deep learning
method that can incrementally manipulate the latent representations of an
autoencoder, according to external categorical evidence, in order to improve a
clustering outcome. By evidence transfer we define the process by which the
categorical outcome of an external, auxiliary task is exploited to improve a
primary task, in this case representation learning for clustering. Our proposed
method makes no assumptions regarding the categorical evidence presented, nor
the structure of the latent space. We compare our method, against the baseline
solution by performing k-means clustering before and after its deployment.
Experiments with three different kinds of evidence show that our method
effectively manipulates the latent representations when introduced with real
corresponding evidence, while remaining robust when presented with low quality
evidence
- …