11 research outputs found

    Methods of Hierarchical Clustering

    Get PDF
    We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference

    Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement

    Full text link
    We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of data measurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections.Comment: 17 pages, 45 citations, 2 figure

    Algorithms for Hierarchical Clustering: An Overview, II

    Get PDF
    We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm. This review adds to the earlier version, Murtagh and Contreras (2012)

    A Cybernetics Update for Competitive Deep Learning System

    Get PDF
    A number of recent reports in the peer-reviewed literature have discussed irreproducibility of results in biomedical research. Some of these articles suggest that the inability of independent research laboratories to replicate published results has a negative impact on the development of, and confidence in, the biomedical research enterprise. To get more resilient data and to achieve higher reproducible result, we present an adaptive and learning system reference architecture for smart learning system interface. To get deeper inspiration, we focus our attention on mammalian brain neurophysiology. In fact, from a neurophysiological point of view, neuroscientist LeDoux finds two preferential amygdala pathways in the brain of the laboratory mouse. The low road is a pathway which is able to transmit a signal from a stimulus to the thalamus, and then to the amygdala, which then activates a fast-response in the body. The high road is activated simultaneously. This is a slower road which also includes the cortical parts of the brain, thus creating a conscious impression of what the stimulus is (to develop a rational mechanism of defense for instance). To mimic this biological reality, our main idea is to use a new input node able to bind known information to the unknown one coherently. Then, unknown "environmental noise" or/and local "signal input" information can be aggregated to known "system internal control status" information, to provide a landscape of attractor points, which either fast or slow and deeper system response can computed from. In this way, ideal cybernetics system interaction levels can be matched exactly to practical system modeling interaction styles, with no paradigmatic operational ambiguity and minimal information loss. The present paper is a relevant contribute to classic cybernetics updating towards a new General Theory of Systems, a post-Bertalanffy Systemics

    Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding

    Get PDF
    Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means

    Search and Retrieval in Massive Data Collections

    Get PDF
    The main goal of this research is to produce a novel and efficient searching application by means of best match and proximity searching with particular application to very large numeric and textual data stores. In today’s world a huge amount of information is produced. Almost every part of our society is touched by systems that collect, store and analyse data. As an example I mention the case of scientific instrumentation: new sensors capture massive amounts of information (e.g. new telescopes acquiring data from different regions of the spectrum). Description of biological and chemical interactions also produce complex and large amounts of data. It is in this context that a big challenge for current analysis algorithms is presented. Many of the traditional methods for data analysis do not scale well in massive data sets nor in very high dimensional spaces. In this work I introduce a novel (ultrametric) distance called Baire based on the longest common prefix and show how it can be used to produce clusters through grouping data in ’bins’ taking linear or O(n) computational time. Furthermore, it follows that this distance can be strictly fitted to a hierarchy tree. This is a property that proves very useful for classifying, storing, accessing and retrieving information. I go further to apply this methodology on data from different scientific areas such as astronomy and chemistry to create groups or clusters. Additionally I apply this method to document sets for clustering and retrieval. In particular, I look into the new area of enterprise search to propose a new method to support scalable search and clustering
    corecore