3,206 research outputs found
Dissimilarity Clustering by Hierarchical Multi-Level Refinement
We introduce in this paper a new way of optimizing the natural extension of
the quantization error using in k-means clustering to dissimilarity data. The
proposed method is based on hierarchical clustering analysis combined with
multi-level heuristic refinement. The method is computationally efficient and
achieves better quantization errors than theComment: 20-th European Symposium on Artificial Neural Networks, Computational
Intelligence and Machine Learning (ESANN 2012), Bruges : Belgium (2012
Fast Hierarchical Clustering and Other Applications of Dynamic Closest Pairs
We develop data structures for dynamic closest pair problems with arbitrary
distance functions, that do not necessarily come from any geometric structure
on the objects. Based on a technique previously used by the author for
Euclidean closest pairs, we show how to insert and delete objects from an
n-object set, maintaining the closest pair, in O(n log^2 n) time per update and
O(n) space. With quadratic space, we can instead use a quadtree-like structure
to achieve an optimal time bound, O(n) per update. We apply these data
structures to hierarchical clustering, greedy matching, and TSP heuristics, and
discuss other potential applications in machine learning, Groebner bases, and
local improvement algorithms for partition and placement problems. Experiments
show our new methods to be faster in practice than previously used heuristics.Comment: 20 pages, 9 figures. A preliminary version of this paper appeared at
the 9th ACM-SIAM Symp. on Discrete Algorithms, San Francisco, 1998, pp.
619-628. For source code and experimental results, see
http://www.ics.uci.edu/~eppstein/projects/pairs
A new hierarchical clustering algorithm to identify non-overlapping like-minded communities
A network has a non-overlapping community structure if the nodes of the
network can be partitioned into disjoint sets such that each node in a set is
densely connected to other nodes inside the set and sparsely connected to the
nodes out- side it. There are many metrics to validate the efficacy of such a
structure, such as clustering coefficient, betweenness, centrality, modularity
and like-mindedness. Many methods have been proposed to optimize some of these
metrics, but none of these works well on the recently introduced metric
like-mindedness. To solve this problem, we propose a be- havioral property
based algorithm to identify communities that optimize the like-mindedness
metric and compare its performance on this metric with other behavioral data
based methodologies as well as community detection methods that rely only on
structural data. We execute these algorithms on real-life datasets of
Filmtipset and Twitter and show that our algorithm performs better than the
existing algorithms with respect to the like-mindedness metric
Identifying Overlapping and Hierarchical Thematic Structures in Networks of Scholarly Papers: A Comparison of Three Approaches
We implemented three recently proposed approaches to the identification of
overlapping and hierarchical substructures in graphs and applied the
corresponding algorithms to a network of 492 information-science papers coupled
via their cited sources. The thematic substructures obtained and overlaps
produced by the three hierarchical cluster algorithms were compared to a
content-based categorisation, which we based on the interpretation of titles
and keywords. We defined sets of papers dealing with three topics located on
different levels of aggregation: h-index, webometrics, and bibliometrics. We
identified these topics with branches in the dendrograms produced by the three
cluster algorithms and compared the overlapping topics they detected with one
another and with the three pre-defined paper sets. We discuss the advantages
and drawbacks of applying the three approaches to paper networks in research
fields.Comment: 18 pages, 9 figure
Steganographer Identification
Conventional steganalysis detects the presence of steganography within single
objects. In the real-world, we may face a complex scenario that one or some of
multiple users called actors are guilty of using steganography, which is
typically defined as the Steganographer Identification Problem (SIP). One might
use the conventional steganalysis algorithms to separate stego objects from
cover objects and then identify the guilty actors. However, the guilty actors
may be lost due to a number of false alarms. To deal with the SIP, most of the
state-of-the-arts use unsupervised learning based approaches. In their
solutions, each actor holds multiple digital objects, from which a set of
feature vectors can be extracted. The well-defined distances between these
feature sets are determined to measure the similarity between the corresponding
actors. By applying clustering or outlier detection, the most suspicious
actor(s) will be judged as the steganographer(s). Though the SIP needs further
study, the existing works have good ability to identify the steganographer(s)
when non-adaptive steganographic embedding was applied. In this chapter, we
will present foundational concepts and review advanced methodologies in SIP.
This chapter is self-contained and intended as a tutorial introducing the SIP
in the context of media steganography.Comment: A tutorial with 30 page
Delete or merge regressors for linear model selection
We consider a problem of linear model selection in the presence of both
continuous and categorical predictors. Feasible models consist of subsets of
numerical variables and partitions of levels of factors. A new algorithm called
delete or merge regressors (DMR) is presented which is a stepwise backward
procedure involving ranking the predictors according to squared t-statistics
and choosing the final model minimizing BIC. In the article we prove
consistency of DMR when the number of predictors tends to infinity with the
sample size and describe a simulation study using a pertaining R package. The
results indicate significant advantage in time complexity and selection
accuracy of our algorithm over Lasso-based methods described in the literature.
Moreover, a version of DMR for generalized linear models is proposed
A cost function for similarity-based hierarchical clustering
The development of algorithms for hierarchical clustering has been hampered
by a shortage of precise objective functions. To help address this situation,
we introduce a simple cost function on hierarchies over a set of points, given
pairwise similarities between those points. We show that this criterion behaves
sensibly in canonical instances and that it admits a top-down construction
procedure with a provably good approximation ratio
- …