2,321 research outputs found
Kernel Spectral Clustering and applications
In this chapter we review the main literature related to kernel spectral
clustering (KSC), an approach to clustering cast within a kernel-based
optimization setting. KSC represents a least-squares support vector machine
based formulation of spectral clustering described by a weighted kernel PCA
objective. Just as in the classifier case, the binary clustering model is
expressed by a hyperplane in a high dimensional space induced by a kernel. In
addition, the multi-way clustering can be obtained by combining a set of binary
decision functions via an Error Correcting Output Codes (ECOC) encoding scheme.
Because of its model-based nature, the KSC method encompasses three main steps:
training, validation, testing. In the validation stage model selection is
performed to obtain tuning parameters, like the number of clusters present in
the data. This is a major advantage compared to classical spectral clustering
where the determination of the clustering parameters is unclear and relies on
heuristics. Once a KSC model is trained on a small subset of the entire data,
it is able to generalize well to unseen test points. Beyond the basic
formulation, sparse KSC algorithms based on the Incomplete Cholesky
Decomposition (ICD) and , , Group Lasso regularization are
reviewed. In that respect, we show how it is possible to handle large scale
data. Also, two possible ways to perform hierarchical clustering and a soft
clustering method are presented. Finally, real-world applications such as image
segmentation, power load time-series clustering, document clustering and big
data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms
Anytime Hierarchical Clustering
We propose a new anytime hierarchical clustering method that iteratively
transforms an arbitrary initial hierarchy on the configuration of measurements
along a sequence of trees we prove for a fixed data set must terminate in a
chain of nested partitions that satisfies a natural homogeneity requirement.
Each recursive step re-edits the tree so as to improve a local measure of
cluster homogeneity that is compatible with a number of commonly used (e.g.,
single, average, complete) linkage functions. As an alternative to the standard
batch algorithms, we present numerical evidence to suggest that appropriate
adaptations of this method can yield decentralized, scalable algorithms
suitable for distributed/parallel computation of clustering hierarchies and
online tracking of clustering trees applicable to large, dynamically changing
databases and anomaly detection.Comment: 13 pages, 6 figures, 5 tables, in preparation for submission to a
conferenc
Mining Extremes through Fuzzy Clustering
Archetypes are extreme points that synthesize data representing "pure" individual types.
Archetypes are assigned by the most discriminating features of data points, and are almost
always useful in applications when one is interested in extremes and not on commonalities.
Recent applications include talent analysis in sports and science, fraud detection,
profiling of users and products in recommendation systems, climate extremes, as well as
other machine learning applications.
The furthest-sum Archetypal Analysis (FS-AA) (Mørup and Hansen, 2012) and the
Fuzzy Clustering with Proportional Membership (FCPM) (Nascimento, 2005) propose
distinct models to find clusters with extreme prototypes. Even though the FCPM model
does not impose its prototypes to lie in the convex hull of data, it belongs to the framework
of data recovery from clustering (Mirkin, 2005), a powerful property for unsupervised
cluster analysis. The baseline version of FCPM, FCPM-0, provides central prototypes
whereas its smooth version, FCPM-2 provides extreme prototypes as AA archetypes.
The comparative study between FS-AA and FCPM algorithms conducted in this dissertation
covers the following aspects. First, the analysis of FS-AA on data recovery from
clustering using a collection of 100 data sets of diverse dimensionalities, generated with
a proper data generator (FCPM-DG) as well as 14 real world data. Second, testing the
robustness of the clustering algorithms in the presence of outliers, with the peculiar behaviour
of FCPM-0 on removing the proper number of prototypes from data. Third, a
collection of five popular fuzzy validation indices are explored on accessing the quality
of clustering results. Forth, the algorithms undergo a study to evaluate how different
initializations affect their convergence as well as the quality of the clustering partitions.
The Iterative Anomalous Pattern (IAP) algorithm allows to improve the convergence of
FCPM algorithm as well as to fine-tune the level of resolution to look at clustering results,
which is an advantage from FS-AA. Proper visualization functionalities for FS-AA and
FCPM support the easy interpretation of the clustering results
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
- …