3,055 research outputs found

    Methods of Hierarchical Clustering

    Get PDF
    We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference

    Coresets for the Nearest-Neighbor Rule

    Get PDF
    Given a training set PP of labeled points, the nearest-neighbor rule predicts the class of an unlabeled query point as the label of its closest point in the set. To improve the time and space complexity of classification, a natural question is how to reduce the training set without significantly affecting the accuracy of the nearest-neighbor rule. Nearest-neighbor condensation deals with finding a subset R⊆PR \subseteq P such that for every point p∈Pp \in P, pp's nearest-neighbor in RR has the same label as pp. This relates to the concept of coresets, which can be broadly defined as subsets of the set, such that an exact result on the coreset corresponds to an approximate result on the original set. However, the guarantees of a coreset hold for any query point, and not only for the points of the training set. This paper introduces the concept of coresets for nearest-neighbor classification. We extend existing criteria used for condensation, and prove sufficient conditions to correctly classify any query point when using these subsets. Additionally, we prove that finding such subsets of minimum cardinality is NP-hard, and propose quadratic-time approximation algorithms with provable upper-bounds on the size of their selected subsets. Moreover, we show how to improve one of these algorithms to have subquadratic runtime, being the first of this kind for condensation

    k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

    MRPR: a MapReduce solution for prototype reduction in big data classification

    Get PDF
    In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data

    FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems

    Get PDF
    In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; ArgentinaFil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Fernández, Alberto. Universidad de Granada; Españ

    Evolution of networks

    Full text link
    We review the recent fast progress in statistical physics of evolving networks. Interest has focused mainly on the structural properties of random complex networks in communications, biology, social sciences and economics. A number of giant artificial networks of such a kind came into existence recently. This opens a wide field for the study of their topology, evolution, and complex processes occurring in them. Such networks possess a rich set of scaling properties. A number of them are scale-free and show striking resilience against random breakdowns. In spite of large sizes of these networks, the distances between most their vertices are short -- a feature known as the ``small-world'' effect. We discuss how growing networks self-organize into scale-free structures and the role of the mechanism of preferential linking. We consider the topological and structural properties of evolving networks, and percolation in these networks. We present a number of models demonstrating the main features of evolving networks and discuss current approaches for their simulation and analytical study. Applications of the general results to particular networks in Nature are discussed. We demonstrate the generic connections of the network growth processes with the general problems of non-equilibrium physics, econophysics, evolutionary biology, etc.Comment: 67 pages, updated, revised, and extended version of review, submitted to Adv. Phy

    Distributed Detection and Estimation in Wireless Sensor Networks

    Full text link
    In this article we consider the problems of distributed detection and estimation in wireless sensor networks. In the first part, we provide a general framework aimed to show how an efficient design of a sensor network requires a joint organization of in-network processing and communication. Then, we recall the basic features of consensus algorithm, which is a basic tool to reach globally optimal decisions through a distributed approach. The main part of the paper starts addressing the distributed estimation problem. We show first an entirely decentralized approach, where observations and estimations are performed without the intervention of a fusion center. Then, we consider the case where the estimation is performed at a fusion center, showing how to allocate quantization bits and transmit powers in the links between the nodes and the fusion center, in order to accommodate the requirement on the maximum estimation variance, under a constraint on the global transmit power. We extend the approach to the detection problem. Also in this case, we consider the distributed approach, where every node can achieve a globally optimal decision, and the case where the decision is taken at a central node. In the latter case, we show how to allocate coding bits and transmit power in order to maximize the detection probability, under constraints on the false alarm rate and the global transmit power. Then, we generalize consensus algorithms illustrating a distributed procedure that converges to the projection of the observation vector onto a signal subspace. We then address the issue of energy consumption in sensor networks, thus showing how to optimize the network topology in order to minimize the energy necessary to achieve a global consensus. Finally, we address the problem of matching the topology of the network to the graph describing the statistical dependencies among the observed variables.Comment: 92 pages, 24 figures. To appear in E-Reference Signal Processing, R. Chellapa and S. Theodoridis, Eds., Elsevier, 201
    • …
    corecore