Search CORE

3,055 research outputs found

Methods of Hierarchical Clustering

Author: Contreras Pedro
Murtagh Fionn
Publication venue
Publication date: 01/01/2011
Field of study

We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference

arXiv.org e-Print Archive

Royal Holloway Research Online

Royal Holloway - Pure

Coresets for the Nearest-Neighbor Rule

Author: Flores-Velazco Alejandro
Mount David M.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th Annual European Symposium on Algorithms (ESA 2020)
Publication date: 01/01/2020
Field of study

Given a training set

P

of labeled points, the nearest-neighbor rule predicts the class of an unlabeled query point as the label of its closest point in the set. To improve the time and space complexity of classification, a natural question is how to reduce the training set without significantly affecting the accuracy of the nearest-neighbor rule. Nearest-neighbor condensation deals with finding a subset

R \subseteq P

such that for every point

p \in P

p

's nearest-neighbor in

R

has the same label as

p

. This relates to the concept of coresets, which can be broadly defined as subsets of the set, such that an exact result on the coreset corresponds to an approximate result on the original set. However, the guarantees of a coreset hold for any query point, and not only for the points of the training set. This paper introduces the concept of coresets for nearest-neighbor classification. We extend existing criteria used for condensation, and prove sufficient conditions to correctly classify any query point when using these subsets. Additionally, we prove that finding such subsets of minimum cardinality is NP-hard, and propose quadratic-time approximation algorithms with provable upper-bounds on the size of their selected subsets. Moreover, we show how to improve one of these algorithms to have subquadratic runtime, being the first of this kind for condensation

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

Author: Cunningham Padraig
Delany Sarah Jane
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/04/2020
Field of study

Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

arXiv.org e-Print Archive

Arrow@TUDublin

MRPR: a MapReduce solution for prototype reduction in big data classification

Author: Alpaydin
Angiulli
Bacardit
Cano
Caruana
Chang
Chen
Cover
Daniel Peralta
Dean
Dean
Derrac
Derrac
Derrac
Francisco Herrera
García
García
García-Pedrajas
García-Pedrajas
Hart
He
Isaac Triguero
Jaume Bacardit
Kohonen
Lam
Marx
Minelli
Mollineda
Nanni
Neri
Palit
Price
Pyle
Sakr
Salvador García
Snir
Srinivasan
Sánchez
Sánchez
Triguero
Triguero
Triguero
White
Wilson
Wilson
Witten
Woniak
Zhao
Publication venue: 'Elsevier BV'
Publication date: 03/03/2014
Field of study

In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data

Nottingham ePrints

Nottingham eTheses

Crossref

Repository@Nottingham

Repositorio Institucional Universidad de Granada

FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems

Author: Basgall María José
Fernández Alberto
Naiouf Ricardo Marcelo
Publication venue: 'MDPI AG'
Publication date: 01/08/2021
Field of study

In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; ArgentinaFil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Fernández, Alberto. Universidad de Granada; Españ

CONICET Digital

Evolution of networks

Author: Carlos Riveros
Natashia Bol
Patrick Guiney
Re Mendes
Publication venue: 'Informa UK Limited'
Publication date: 07/09/2001
Field of study

We review the recent fast progress in statistical physics of evolving networks. Interest has focused mainly on the structural properties of random complex networks in communications, biology, social sciences and economics. A number of giant artificial networks of such a kind came into existence recently. This opens a wide field for the study of their topology, evolution, and complex processes occurring in them. Such networks possess a rich set of scaling properties. A number of them are scale-free and show striking resilience against random breakdowns. In spite of large sizes of these networks, the distances between most their vertices are short -- a feature known as the ``small-world'' effect. We discuss how growing networks self-organize into scale-free structures and the role of the mechanism of preferential linking. We consider the topological and structural properties of evolving networks, and percolation in these networks. We present a number of models demonstrating the main features of evolving networks and discuss current approaches for their simulation and analytical study. Applications of the general results to particular networks in Nature are discussed. We demonstrate the generic connections of the network growth processes with the general problems of non-equilibrium physics, econophysics, evolutionary biology, etc.Comment: 67 pages, updated, revised, and extended version of review, submitted to Adv. Phy

arXiv.org e-Print Archive

CiteSeerX

Crossref

Distributed Detection and Estimation in Wireless Sensor Networks

Author: Barbarossa Sergio
Di Lorenzo Paolo
Sardellitti Stefania
Publication venue
Publication date: 05/07/2013
Field of study

In this article we consider the problems of distributed detection and estimation in wireless sensor networks. In the first part, we provide a general framework aimed to show how an efficient design of a sensor network requires a joint organization of in-network processing and communication. Then, we recall the basic features of consensus algorithm, which is a basic tool to reach globally optimal decisions through a distributed approach. The main part of the paper starts addressing the distributed estimation problem. We show first an entirely decentralized approach, where observations and estimations are performed without the intervention of a fusion center. Then, we consider the case where the estimation is performed at a fusion center, showing how to allocate quantization bits and transmit powers in the links between the nodes and the fusion center, in order to accommodate the requirement on the maximum estimation variance, under a constraint on the global transmit power. We extend the approach to the detection problem. Also in this case, we consider the distributed approach, where every node can achieve a globally optimal decision, and the case where the decision is taken at a central node. In the latter case, we show how to allocate coding bits and transmit power in order to maximize the detection probability, under constraints on the false alarm rate and the global transmit power. Then, we generalize consensus algorithms illustrating a distributed procedure that converges to the projection of the observation vector onto a signal subspace. We then address the issue of energy consumption in sensor networks, thus showing how to optimize the network topology in order to minimize the energy necessary to achieve a global consensus. Finally, we address the problem of matching the topology of the network to the graph describing the statistical dependencies among the observed variables.Comment: 92 pages, 24 figures. To appear in E-Reference Signal Processing, R. Chellapa and S. Theodoridis, Eds., Elsevier, 201

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza