3,055 research outputs found
Methods of Hierarchical Clustering
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference
Coresets for the Nearest-Neighbor Rule
Given a training set of labeled points, the nearest-neighbor rule
predicts the class of an unlabeled query point as the label of its closest
point in the set. To improve the time and space complexity of classification, a
natural question is how to reduce the training set without significantly
affecting the accuracy of the nearest-neighbor rule. Nearest-neighbor
condensation deals with finding a subset such that for every
point , 's nearest-neighbor in has the same label as . This
relates to the concept of coresets, which can be broadly defined as subsets of
the set, such that an exact result on the coreset corresponds to an approximate
result on the original set. However, the guarantees of a coreset hold for any
query point, and not only for the points of the training set.
This paper introduces the concept of coresets for nearest-neighbor
classification. We extend existing criteria used for condensation, and prove
sufficient conditions to correctly classify any query point when using these
subsets. Additionally, we prove that finding such subsets of minimum
cardinality is NP-hard, and propose quadratic-time approximation algorithms
with provable upper-bounds on the size of their selected subsets. Moreover, we
show how to improve one of these algorithms to have subquadratic runtime, being
the first of this kind for condensation
k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)
Perhaps the most straightforward classifier in the arsenal or machine
learning techniques is the Nearest Neighbour Classifier -- classification is
achieved by identifying the nearest neighbours to a query example and using
those neighbours to determine the class of the query. This approach to
classification is of particular importance because issues of poor run-time
performance is not such a problem these days with the computational power that
is available. This paper presents an overview of techniques for Nearest
Neighbour classification focusing on; mechanisms for assessing similarity
(distance), computational issues in identifying nearest neighbours and
mechanisms for reducing the dimension of the data.
This paper is the second edition of a paper previously published as a
technical report. Sections on similarity measures for time-series, retrieval
speed-up and intrinsic dimensionality have been added. An Appendix is included
providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN
MRPR: a MapReduce solution for prototype reduction in big data classification
In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data
FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Fil: Basgall, MarÃa José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - La Plata; ArgentinaFil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Fernández, Alberto. Universidad de Granada; Españ
Evolution of networks
We review the recent fast progress in statistical physics of evolving
networks. Interest has focused mainly on the structural properties of random
complex networks in communications, biology, social sciences and economics. A
number of giant artificial networks of such a kind came into existence
recently. This opens a wide field for the study of their topology, evolution,
and complex processes occurring in them. Such networks possess a rich set of
scaling properties. A number of them are scale-free and show striking
resilience against random breakdowns. In spite of large sizes of these
networks, the distances between most their vertices are short -- a feature
known as the ``small-world'' effect. We discuss how growing networks
self-organize into scale-free structures and the role of the mechanism of
preferential linking. We consider the topological and structural properties of
evolving networks, and percolation in these networks. We present a number of
models demonstrating the main features of evolving networks and discuss current
approaches for their simulation and analytical study. Applications of the
general results to particular networks in Nature are discussed. We demonstrate
the generic connections of the network growth processes with the general
problems of non-equilibrium physics, econophysics, evolutionary biology, etc.Comment: 67 pages, updated, revised, and extended version of review, submitted
to Adv. Phy
Distributed Detection and Estimation in Wireless Sensor Networks
In this article we consider the problems of distributed detection and
estimation in wireless sensor networks. In the first part, we provide a general
framework aimed to show how an efficient design of a sensor network requires a
joint organization of in-network processing and communication. Then, we recall
the basic features of consensus algorithm, which is a basic tool to reach
globally optimal decisions through a distributed approach. The main part of the
paper starts addressing the distributed estimation problem. We show first an
entirely decentralized approach, where observations and estimations are
performed without the intervention of a fusion center. Then, we consider the
case where the estimation is performed at a fusion center, showing how to
allocate quantization bits and transmit powers in the links between the nodes
and the fusion center, in order to accommodate the requirement on the maximum
estimation variance, under a constraint on the global transmit power. We extend
the approach to the detection problem. Also in this case, we consider the
distributed approach, where every node can achieve a globally optimal decision,
and the case where the decision is taken at a central node. In the latter case,
we show how to allocate coding bits and transmit power in order to maximize the
detection probability, under constraints on the false alarm rate and the global
transmit power. Then, we generalize consensus algorithms illustrating a
distributed procedure that converges to the projection of the observation
vector onto a signal subspace. We then address the issue of energy consumption
in sensor networks, thus showing how to optimize the network topology in order
to minimize the energy necessary to achieve a global consensus. Finally, we
address the problem of matching the topology of the network to the graph
describing the statistical dependencies among the observed variables.Comment: 92 pages, 24 figures. To appear in E-Reference Signal Processing, R.
Chellapa and S. Theodoridis, Eds., Elsevier, 201
- …