75,268 research outputs found
Graph-RAT: Combining data sources in music recommendation systems
The complexity of music recommendation systems has increased rapidly in recent years, drawing upon different sources of information: content analysis, web-mining, social tagging, etc. Unfortunately, the tools to scientifically evaluate such integrated systems are not readily available; nor are the base algorithms available. This article describes Graph-RAT (Graph-based Relational Analysis Toolkit), an open source toolkit that provides a framework for developing and evaluating novel hybrid systems. While this toolkit is designed for music recommendation, it has applications outside its discipline as well. An experiment—indicative of the sort of procedure that can be configured using the toolkit—is provided to illustrate its usefulness
Methods of Hierarchical Clustering
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference
Anytime Hierarchical Clustering
We propose a new anytime hierarchical clustering method that iteratively
transforms an arbitrary initial hierarchy on the configuration of measurements
along a sequence of trees we prove for a fixed data set must terminate in a
chain of nested partitions that satisfies a natural homogeneity requirement.
Each recursive step re-edits the tree so as to improve a local measure of
cluster homogeneity that is compatible with a number of commonly used (e.g.,
single, average, complete) linkage functions. As an alternative to the standard
batch algorithms, we present numerical evidence to suggest that appropriate
adaptations of this method can yield decentralized, scalable algorithms
suitable for distributed/parallel computation of clustering hierarchies and
online tracking of clustering trees applicable to large, dynamically changing
databases and anomaly detection.Comment: 13 pages, 6 figures, 5 tables, in preparation for submission to a
conferenc
Axiomatic Construction of Hierarchical Clustering in Asymmetric Networks
This paper considers networks where relationships between nodes are
represented by directed dissimilarities. The goal is to study methods for the
determination of hierarchical clusters, i.e., a family of nested partitions
indexed by a connectivity parameter, induced by the given dissimilarity
structures. Our construction of hierarchical clustering methods is based on
defining admissible methods to be those methods that abide by the axioms of
value - nodes in a network with two nodes are clustered together at the maximum
of the two dissimilarities between them - and transformation - when
dissimilarities are reduced, the network may become more clustered but not
less. Several admissible methods are constructed and two particular methods,
termed reciprocal and nonreciprocal clustering, are shown to provide upper and
lower bounds in the space of admissible methods. Alternative clustering
methodologies and axioms are further considered. Allowing the outcome of
hierarchical clustering to be asymmetric, so that it matches the asymmetry of
the original data, leads to the inception of quasi-clustering methods. The
existence of a unique quasi-clustering method is shown. Allowing clustering in
a two-node network to proceed at the minimum of the two dissimilarities
generates an alternative axiomatic construction. There is a unique clustering
method in this case too. The paper also develops algorithms for the computation
of hierarchical clusters using matrix powers on a min-max dioid algebra and
studies the stability of the methods proposed. We proved that most of the
methods introduced in this paper are such that similar networks yield similar
hierarchical clustering results. Algorithms are exemplified through their
application to networks describing internal migration within states of the
United States (U.S.) and the interrelation between sectors of the U.S. economy.Comment: This is a largely extended version of the previous conference
submission under the same title. The current version contains the material in
the previous version (published in ICASSP 2013) as well as material presented
at the Asilomar Conference on Signal, Systems, and Computers 2013, GlobalSIP
2013, and ICML 2014. Also, unpublished material is included in the current
versio
Exhaustive and Efficient Constraint Propagation: A Semi-Supervised Learning Perspective and Its Applications
This paper presents a novel pairwise constraint propagation approach by
decomposing the challenging constraint propagation problem into a set of
independent semi-supervised learning subproblems which can be solved in
quadratic time using label propagation based on k-nearest neighbor graphs.
Considering that this time cost is proportional to the number of all possible
pairwise constraints, our approach actually provides an efficient solution for
exhaustively propagating pairwise constraints throughout the entire dataset.
The resulting exhaustive set of propagated pairwise constraints are further
used to adjust the similarity matrix for constrained spectral clustering. Other
than the traditional constraint propagation on single-source data, our approach
is also extended to more challenging constraint propagation on multi-source
data where each pairwise constraint is defined over a pair of data points from
different sources. This multi-source constraint propagation has an important
application to cross-modal multimedia retrieval. Extensive results have shown
the superior performance of our approach.Comment: The short version of this paper appears as oral paper in ECCV 201
Representation Learning for Clustering: A Statistical Framework
We address the problem of communicating domain knowledge from a user to the
designer of a clustering algorithm. We propose a protocol in which the user
provides a clustering of a relatively small random sample of a data set. The
algorithm designer then uses that sample to come up with a data representation
under which -means clustering results in a clustering (of the full data set)
that is aligned with the user's clustering. We provide a formal statistical
model for analyzing the sample complexity of learning a clustering
representation with this paradigm. We then introduce a notion of capacity of a
class of possible representations, in the spirit of the VC-dimension, showing
that classes of representations that have finite such dimension can be
successfully learned with sample size error bounds, and end our discussion with
an analysis of that dimension for classes of representations induced by linear
embeddings.Comment: To be published in Proceedings of UAI 201
Distance Dependent Chinese Restaurant Processes
We develop the distance dependent Chinese restaurant process (CRP), a
flexible class of distributions over partitions that allows for
non-exchangeability. This class can be used to model many kinds of dependencies
between data in infinite clustering models, including dependencies across time
or space. We examine the properties of the distance dependent CRP, discuss its
connections to Bayesian nonparametric mixture models, and derive a Gibbs
sampler for both observed and mixture settings. We study its performance with
three text corpora. We show that relaxing the assumption of exchangeability
with distance dependent CRPs can provide a better fit to sequential data. We
also show its alternative formulation of the traditional CRP leads to a
faster-mixing Gibbs sampling algorithm than the one based on the original
formulation
Uncovering Group Level Insights with Accordant Clustering
Clustering is a widely-used data mining tool, which aims to discover
partitions of similar items in data. We introduce a new clustering paradigm,
\emph{accordant clustering}, which enables the discovery of (predefined) group
level insights. Unlike previous clustering paradigms that aim to understand
relationships amongst the individual members, the goal of accordant clustering
is to uncover insights at the group level through the analysis of their
members. Group level insight can often support a call to action that cannot be
informed through previous clustering techniques. We propose the first accordant
clustering algorithm, and prove that it finds near-optimal solutions when data
possesses inherent cluster structure. The insights revealed by accordant
clusterings enabled experts in the field of medicine to isolate successful
treatments for a neurodegenerative disease, and those in finance to discover
patterns of unnecessary spending.Comment: accepted to SDM 2017 (oral
- …