Search CORE

36,708 research outputs found

Recommended from our members

Incremental Non-Greedy Clustering at Scale

Author: Monath Nicholas
Publication venue: ScholarWorks@UMass Amherst
Publication date: 18/03/2022
Field of study

Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions of data, (3) support for any user-specified similarity functions. Hierarchical clusterings are often desired as they represent multiple alternative flat clusterings (e.g., at different granularity levels). These tree-structured clusterings provide for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all three of the aforementioned desiderata. Work on incremental hierarchical clustering often makes greedy, irrevocable clustering decisions that are regretted in the presence of future data. Work on scalable hierarchical clustering does not support incremental additions or deletions. These methods often make requirements on the similarity functions used and/or empirically tend to over merge clusters, which can lead to inaccurate clusterings. In this thesis, we present incremental and scalable methods for hierarchical clustering to empirically satisfy the above desiderata. Our work aims to represent uncertainty and meaningful alternative clusterings, to efficiently reconsider past decisions in the incremental case, and to use parallelism to scale to massive datasets. Our method, Grinch, handles incrementally arriving data in a non-greedy fashion, by reconsidering past decisions using tree structure re-arrangements (e.g., rotations and grafts) invoked in accordance with the user’s specified similarity function. To achieve scalability to massive datasets, our method, SCC, builds a hierarchical clusterings in a level-wise bottom-up manner. Certain clustering decisions are made independently in parallel within each level, and a global similarity threshold schedule prevents greedy over-merging. We show how SCC can be combined with the tree-structure re-arrangements in Grinch to form a mini-batch algorithm achieving both scalable and incremental performance. Lastly, we generalize our hierarchical clustering approaches to DAG-structured ones, which can better represent uncertainty in clustering by representing overlapping clusters. We introduce an efficient bottom-up method for DAG-structured clustering, Llama. For each of the proposed methods, we provide both a theoretical and empirical analysis. Empirically, our methods achieve state-of-the-art results on clustering benchmarks in both the batch and the incremental settings, including multiple point improvements in dendrogram purity and scalability to billions of points

ScholarWorks@UMass Amherst

Incremental Hierarchical Clustering driven Automatic Annotations for Unifying IoT Streaming Data

Author: Balakrishna Sivadi
Núñez-Valdez Edward
Solanki Vijender Kumar
Thirumaran M
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 28/03/2022
Field of study

In the Internet of Things (IoT), Cyber-Physical Systems (CPS), and sensor technologies huge and variety of streaming sensor data is generated. The unification of streaming sensor data is a challenging problem. Moreover, the huge amount of raw data has implied the insufficiency of manual and semi-automatic annotation and leads to an increase of the research of automatic semantic annotation. However, many of the existing semantic annotation mechanisms require many joint conditions that could generate redundant processing of transitional results for annotating the sensor data using SPARQL queries. In this paper, we present an Incremental Clustering Driven Automatic Annotation for IoT Streaming Data (IHC-AA-IoTSD) using SPARQL to improve the annotation efficiency. The processes and corresponding algorithms of the incremental hierarchical clustering driven automatic annotation mechanism are presented in detail, including data classification, incremental hierarchical clustering, querying the extracted data, semantic data annotation, and semantic data integration. The IHCAA-IoTSD has been implemented and experimented on three healthcare datasets and compared with leading approaches namely- Agent-based Text Labelling and Automatic Selection (ATLAS), Fuzzy-based Automatic Semantic Annotation Method (FBASAM), and an Ontology-based Semantic Annotation Approach (OBSAA), yielding encouraging results with Accuracy of 86.67%, Precision of 87.36%, Recall of 85.48%, and F-score of 85.92% at 100k triple data

Re-UNIR

CORECLUSTER: A Degeneracy Based Graph Clustering Framework

Author: Giatsidis Christos
Malliaros Fragkiskos
Thilikos Dimitrios M.
Vazirgiannis Michalis
Publication venue: HAL CCSD
Publication date: 29/07/2014
Field of study

International audienceGraph clustering or community detection constitutes an important task forinvestigating the internal structure of graphs, with a plethora of applications in several domains. Traditional tools for graph clustering, such asspectral methods, typically suffer from high time and space complexity. In thisarticle, we present \textsc{CoreCluster}, an efficient graph clusteringframework based on the concept of graph degeneracy, that can be used along withany known graph clustering algorithm. Our approach capitalizes on processing thegraph in a hierarchical manner provided by its core expansion sequence, anordered partition of the graph into different levels according to the

k

-coredecomposition. Such a partition provides a way to process the graph inan incremental manner that preserves its clustering structure, whilemaking the execution of the chosen clustering algorithm much faster due to thesmaller size of the graph's partitions onto which the algorithm operates

HAL-Polytechnique

CLUSTERING LONG-DISTANCE RUNNERS BASED ON THEIR TECHNIQUE AT ONE SINGLE SPEED DOES NOT GENERALISE TO MULTIPLE SPEEDS

Author: Cazzola Dario
Chen Xi
Preatoni Ezio
Rivadulla Adrian R
Trewartha Grant
Publication venue: NMU Commons
Publication date: 12/07/2023
Field of study

The aim of this study was to assess whether clustering runners based on their technique resulted in consistent group allocations across multiple speeds. Eighty-four runners (34 females) completed four 4-minute running stages at 10, 11, 12 and 13 km/h. For each stage, running technique was characterised using a set of continuous variables in the sagittal plane and discrete stride-based variables. An autoencoder neural network was used for dimensionality reduction and agglomerative hierarchical clustering was applied to identify groups of runners with a similar technique. Two clusters for each speed were selected and the clustering partitions at different incremental speeds were compared. Our results showed that partitions were inconsistent across speeds, and therefore clustering results at one single speed do not generalise to the range of speeds an athlete typically runs at. Single speed clustering may be limited to drive the design of cluster-specific running training interventions and different clustering approaches are needed to better capture runners’ technique at their typical speeds

Northern Michigan University: The Commons

Web Document Clustering Using Document Index Graph

Author: A A Chaudhari
B F Momin
P J Kulkarni
Publication venue
Publication date: 06/03/2020
Field of study

Document Clustering is an important tool for many Information Retrieval (IR) tasks. The huge increase in amount of information present on web poses new challenges in clustering regarding to underlying data model and nature of clustering algorithm. Document clustering techniques mostly rely on single term analysis of document data set. To achieve more accurate document clustering, more informative feature such as phrases are important in this scenario. Hence first part of the paper presents phrase-based model, Document Index Graph (DIG), which allows incremental phrase-based encoding of documents and efficient phrase matching. It emphasizes on effectiveness of phrase-based similarity measure over traditional single term based similarities. In the second part, a Document Index Graph based Clustering (DIGBC) algorithm is proposed to enhance the DIG model for incremental and soft clustering. This algorithm incrementally clusters documents based on proposed clusterdocument similarity measure. It allows assignment of a document to more than one cluster. The DIGBC algorithm is more efficient as compared to existing clustering algorithms such as single pass, K-NN and Hierarchical Agglomerative Clustering (HAC) algorithm

CiteSeerX

Hierarchical Co-Clustering: Off-line and Incremental Approaches

Author: Ienco Dino
Meo Rosa
Pensa Ruggero Gaetano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceClustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus "incorporating" data gradually into an on-going co-clustering solutio

Institutional Research Information System University of Turin