Search CORE

485 research outputs found

Scalable Hierarchical Clustering with Tree Grafting

Author: Glass Michael
Kobren Ari
Krishnamurthy Akshay
McCallum Andrew
Monath Nicholas
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/12/2019
Field of study

We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical clustering with general linkage functions that compute arbitrary similarity between two point sets. The key components of Grinch are its rotate and graft subroutines that efficiently reconfigure the hierarchy as new points arrive, supporting discovery of clusters with complex structure. Grinch is motivated by a new notion of separability for clustering with linkage functions: we prove that when the model is consistent with a ground-truth clustering, Grinch is guaranteed to produce a cluster tree containing the ground-truth, independent of data arrival order. Our empirical results on benchmark and author coreference datasets (with standard and learned linkage functions) show that Grinch is more accurate than other scalable methods, and orders of magnitude faster than hierarchical agglomerative clustering.Comment: 23 pages (appendix included), published at KDD 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

Incremental Non-Greedy Clustering at Scale

Author: Monath Nicholas
Publication venue: ScholarWorks@UMass Amherst
Publication date: 18/03/2022
Field of study

Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions of data, (3) support for any user-specified similarity functions. Hierarchical clusterings are often desired as they represent multiple alternative flat clusterings (e.g., at different granularity levels). These tree-structured clusterings provide for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all three of the aforementioned desiderata. Work on incremental hierarchical clustering often makes greedy, irrevocable clustering decisions that are regretted in the presence of future data. Work on scalable hierarchical clustering does not support incremental additions or deletions. These methods often make requirements on the similarity functions used and/or empirically tend to over merge clusters, which can lead to inaccurate clusterings. In this thesis, we present incremental and scalable methods for hierarchical clustering to empirically satisfy the above desiderata. Our work aims to represent uncertainty and meaningful alternative clusterings, to efficiently reconsider past decisions in the incremental case, and to use parallelism to scale to massive datasets. Our method, Grinch, handles incrementally arriving data in a non-greedy fashion, by reconsidering past decisions using tree structure re-arrangements (e.g., rotations and grafts) invoked in accordance with the user’s specified similarity function. To achieve scalability to massive datasets, our method, SCC, builds a hierarchical clusterings in a level-wise bottom-up manner. Certain clustering decisions are made independently in parallel within each level, and a global similarity threshold schedule prevents greedy over-merging. We show how SCC can be combined with the tree-structure re-arrangements in Grinch to form a mini-batch algorithm achieving both scalable and incremental performance. Lastly, we generalize our hierarchical clustering approaches to DAG-structured ones, which can better represent uncertainty in clustering by representing overlapping clusters. We introduce an efficient bottom-up method for DAG-structured clustering, Llama. For each of the proposed methods, we provide both a theoretical and empirical analysis. Empirically, our methods achieve state-of-the-art results on clustering benchmarks in both the batch and the incremental settings, including multiple point improvements in dendrogram purity and scalability to billions of points

ScholarWorks@UMass Amherst

Objective-Based Hierarchical Clustering of Deep Embedding Vectors

Author: Avdiukhin Dmitrii
Naumov Stanislav
Yaroslavtsev Grigory
Publication venue
Publication date: 15/12/2020
Field of study

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to

4.5

million entries with embedding dimensions up to

2048

. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a

0.74

-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial

2/3

-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of

\approx 2/3 + 0.0004

was due to Charikar et al. (SODA'19)

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Recommended from our members

Reasoning About User Feedback Under Identity Uncertainty in Knowledge Base Construction

Author: Kobren Ariel
Publication venue: ScholarWorks@UMass Amherst
Publication date: 18/12/2020
Field of study

Intelligent, automated systems that are intertwined with everyday life---such as Google Search and virtual assistants like Amazon’s Alexa or Apple’s Siri---are often powered in part by knowledge bases (KBs), i.e., structured data repositories of entities, their attributes, and the relationships among them. Despite a wealth of research focused on automated KB construction methods, KBs are inevitably imperfect, with errors stemming from various points in the construction pipeline. Making matters more challenging, new data is created daily and must be integrated with existing KBs so that they remain up-to-date. As the primary consumers of KBs, human users have tremendous potential to aid in KB construction by contributing feedback that identifies spurious and missing entity attributes and relations. However, correctly integrating user feedback with an existing KB is complicated by the necessity to resolve identity uncertainty, i.e., uncertainty regarding to which real-world entity a piece of data refers. Identity uncertainty abounds in the collection of raw evidence from which a KB is built. Moreover, it also gives rise to identity uncertainty in user feedback, when KB entities, which were affected by user feedback, are split or merged. In this dissertation, we present a continuous reasoning framework capable of integrating user feedback with a KB, under identity certainty. To begin, we introduce Grinch, an online entity resolution (ER) algorithm---with provable correctness guarantees---capable of merging and splitting KB entities as new data arrives. We show that Grinch is efficient and achieves state-of-the-art performance in ER as well as in clustering. Next, we propose a method for using Grinch to resolve identity uncertainty in a KB\u27s underlying data as well as in user feedback. Our approach is based on representing user feedback as mentions, i.e., first class KB objects that participate in all parts of KB construction. Furthermore, we introduce a structured representation for feedback comprised of packaging and payload, which facilitates recovery from KB errors that stem from both identity uncertainty and noisy data. Finally, we evaluate our framework\u27s efficacy using data from the KB that supports OpenReview.net---a deployed, conference management system that solicits feedback from users. The demands of OpenReview.net lead us to develop XGrinch-Shallow (XGS), a variant of Grinch that builds trees with arbitrary branching factors, and subsequently instantiates 60% fewer internal nodes than Grinch. Empirically, we show that XGS is efficient, and is able to effectively utilize user feedback to improve the correctness and completeness of the OpenReview.net KB. We conclude with 7 concrete suggestions for future research on this topic

ScholarWorks@UMass Amherst

A New Scalable, Portable, and Memory-Efficient Predictive Analytics Framework for Predicting Time-to-Event Outcomes in Healthcare

Author: Manyam Ramesh
Publication venue: ScholarWorks @ Georgia State University
Publication date: 16/12/2019
Field of study

Time-to-event outcomes are prevalent in medical research. To handle these outcomes, as well as censored observations, statistical and survival regression methods are widely used based on the assumptions of linear association; however, clinicopathological features often exhibit nonlinear correlations. Machine learning (ML) algorithms have been recently adapted to effectively handle nonlinear correlations. One drawback of ML models is that they can model idiosyncratic features of a training dataset. Due to this overlearning, ML models perform well on the training data but are not so striking on test data. The features that we choose indirectly influence the performance of ML prediction models. With the expansion of big data in biomedical informatics, appropriate feature engineering and feature selection are vital to ML success. Also, an ensemble learning algorithm helps decrease bias and variance by combining the predictions of multiple models. In this study, we newly constructed a scalable, portable, and memory-efficient predictive analytics framework, fitting four components (feature engineering, survival analysis, feature selection, and ensemble learning) together. Our framework first employs feature engineering techniques, such as binarization, discretization, transformation, and normalization on raw dataset. The normalized feature set was applied to the Cox survival regression that produces highly correlated features relevant to the outcome.The resultant feature set was deployed to “eXtreme gradient boosting ensemble learning” (XGBoost) and Recursive Feature Elimination algorithms. XGBoost uses a gradient boosting decision tree algorithm in which new models are created sequentially that predict the residuals of prior models, which are then added together to make the final prediction. In our experiments, we analyzed a cohort of cardiac surgery patients drawn from a multi-hospital academic health system. The model evaluated 72 perioperative variables that impact an event of readmission within 30 days of discharge, derived 48 significant features, and demonstrated optimum predictive ability with feature sets ranging from 16 to 24. The area under the receiver operating characteristics observed for the feature set of 16 were 0.8816, and 0.9307 at the 35th, and 151st iteration respectively. Our model showed improved performance compared to state-of-the-art models and could be more useful for decision support in clinical settings

ScholarWorks @ Georgia State University