11,180 research outputs found
Methods of Hierarchical Clustering
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference
Discriminative Link Prediction using Local Links, Node Features and Community Structure
A link prediction (LP) algorithm is given a graph, and has to rank, for each
node, other nodes that are candidates for new linkage. LP is strongly motivated
by social search and recommendation applications. LP techniques often focus on
global properties (graph conductance, hitting or commute times, Katz score) or
local properties (Adamic-Adar and many variations, or node feature vectors),
but rarely combine these signals. Furthermore, neither of these extremes
exploit link densities at the intermediate level of communities. In this paper
we describe a discriminative LP algorithm that exploits two new signals. First,
a co-clustering algorithm provides community level link density estimates,
which are used to qualify observed links with a surprise value. Second, links
in the immediate neighborhood of the link to be predicted are not interpreted
at face value, but through a local model of node feature similarities. These
signals are combined into a discriminative link predictor. We evaluate the new
predictor using five diverse data sets that are standard in the literature. We
report on significant accuracy boosts compared to standard LP methods
(including Adamic-Adar and random walk). Apart from the new predictor, another
contribution is a rigorous protocol for benchmarking and reporting LP
algorithms, which reveals the regions of strengths and weaknesses of all the
predictors studied here, and establishes the new proposal as the most robust.Comment: 10 pages, 5 figure
Local Variation as a Statistical Hypothesis Test
The goal of image oversegmentation is to divide an image into several pieces,
each of which should ideally be part of an object. One of the simplest and yet
most effective oversegmentation algorithms is known as local variation (LV)
(Felzenszwalb and Huttenlocher 2004). In this work, we study this algorithm and
show that algorithms similar to LV can be devised by applying different
statistical models and decisions, thus providing further theoretical
justification and a well-founded explanation for the unexpected high
performance of the LV approach. Some of these algorithms are based on
statistics of natural images and on a hypothesis testing decision; we denote
these algorithms probabilistic local variation (pLV). The best pLV algorithm,
which relies on censored estimation, presents state-of-the-art results while
keeping the same computational complexity of the LV algorithm
A Robust Clustering Method Using Compositional Data Restrictions: Studying Wood Properties in the Reforestation of Portugal
Classification of multivariate observations while preserving the data’s natural restriction is a challenge. Special properties such as identifiability, interpretability, and others need to be cared for to build a new approach. To avoid these complications, many transformation algorithms have been developed to use traditional models.In this context, the aim of this work is to propose a robust probabilistic distance algorithm to classify compositional data. Based on the probabilistic distance (PD) clustering approach, the proposal identifies clusters minimizing a joint distance function, JDF, which is part of a dissimilarity measure. This measure combines the PD clustering approach with the density of the Dirichlet distribution. This procedure allows us to create clusters, and define the number of clusters by accommodating the data’s natural data compositional restriction.This work was motivated by the forestry area in the restoration context.The composition dataset of the populations of Pinus nigra was analyzed via the proposed robust probabilistic distance clustering algorithm. The proposed method allows us to classify the new physical, chemical, and mechanical P. nigra’ properties into clusters. The main results identify compositional clusters which provide support for wider areas’ recognition. In addition, the results can be used in decisions to spread sustainable forest management
A Short Survey on Data Clustering Algorithms
With rapidly increasing data, clustering algorithms are important tools for
data analytics in modern research. They have been successfully applied to a
wide range of domains; for instance, bioinformatics, speech recognition, and
financial analysis. Formally speaking, given a set of data instances, a
clustering algorithm is expected to divide the set of data instances into the
subsets which maximize the intra-subset similarity and inter-subset
dissimilarity, where a similarity measure is defined beforehand. In this work,
the state-of-the-arts clustering algorithms are reviewed from design concept to
methodology; Different clustering paradigms are discussed. Advanced clustering
algorithms are also discussed. After that, the existing clustering evaluation
metrics are reviewed. A summary with future insights is provided at the end
- …