485 research outputs found
Scalable Hierarchical Clustering with Tree Grafting
We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical
clustering with general linkage functions that compute arbitrary similarity
between two point sets. The key components of Grinch are its rotate and graft
subroutines that efficiently reconfigure the hierarchy as new points arrive,
supporting discovery of clusters with complex structure. Grinch is motivated by
a new notion of separability for clustering with linkage functions: we prove
that when the model is consistent with a ground-truth clustering, Grinch is
guaranteed to produce a cluster tree containing the ground-truth, independent
of data arrival order. Our empirical results on benchmark and author
coreference datasets (with standard and learned linkage functions) show that
Grinch is more accurate than other scalable methods, and orders of magnitude
faster than hierarchical agglomerative clustering.Comment: 23 pages (appendix included), published at KDD 201
Recommended from our members
Incremental Non-Greedy Clustering at Scale
Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions of data, (3) support for any user-specified similarity functions.
Hierarchical clusterings are often desired as they represent multiple alternative flat clusterings (e.g., at different granularity levels). These tree-structured clusterings provide for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all three of the aforementioned desiderata. Work on incremental hierarchical clustering often makes greedy, irrevocable clustering decisions that are regretted in the presence of future data. Work on scalable hierarchical clustering does not support incremental additions or deletions. These methods often make requirements on the similarity functions used and/or empirically tend to over merge clusters, which can lead to inaccurate clusterings.
In this thesis, we present incremental and scalable methods for hierarchical clustering to empirically satisfy the above desiderata. Our work aims to represent uncertainty and meaningful alternative clusterings, to efficiently reconsider past decisions in the incremental case, and to use parallelism to scale to massive datasets. Our method, Grinch, handles incrementally arriving data in a non-greedy fashion, by reconsidering past decisions using tree structure re-arrangements (e.g., rotations and grafts) invoked in accordance with the userâs specified similarity function. To achieve scalability to massive datasets, our method, SCC, builds a hierarchical clusterings in a level-wise bottom-up manner. Certain clustering decisions are made independently in parallel within each level, and a global similarity threshold schedule prevents greedy over-merging. We show how SCC can be combined with the tree-structure re-arrangements in Grinch to form a mini-batch algorithm achieving both scalable and incremental performance. Lastly, we generalize our hierarchical clustering approaches to DAG-structured ones, which can better represent uncertainty in clustering by representing overlapping clusters. We introduce an efficient bottom-up method for DAG-structured clustering, Llama. For each of the proposed methods, we provide both a theoretical and empirical analysis. Empirically, our methods achieve state-of-the-art results on clustering benchmarks in both the batch and the incremental settings, including multiple point improvements in dendrogram purity and scalability to billions of points
Objective-Based Hierarchical Clustering of Deep Embedding Vectors
We initiate a comprehensive experimental study of objective-based
hierarchical clustering methods on massive datasets consisting of deep
embedding vectors from computer vision and NLP applications. This includes a
large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word
embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from
several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our
study includes datasets with up to million entries with embedding
dimensions up to .
In order to address the challenge of scaling up hierarchical clustering to
such large datasets we propose a new practical hierarchical clustering
algorithm B++&C. It gives a 5%/20% improvement on average for the popular
Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared
to a wide range of classic methods and recent heuristics. We also introduce a
theoretical algorithm B2SAT&C which achieves a -approximation for the
CKMM objective in polynomial time. This is the first substantial improvement
over the trivial -approximation achieved by a random binary tree. Prior to
this work, the best poly-time approximation of was due
to Charikar et al. (SODA'19)
Recommended from our members
Reasoning About User Feedback Under Identity Uncertainty in Knowledge Base Construction
Intelligent, automated systems that are intertwined with everyday life---such as Google Search and virtual assistants like Amazonâs Alexa or Appleâs Siri---are often powered in part by knowledge bases (KBs), i.e., structured data repositories of entities, their attributes, and the relationships among them. Despite a wealth of research focused on automated KB construction methods, KBs are inevitably imperfect, with errors stemming from various points in the construction pipeline. Making matters more challenging, new data is created daily and must be integrated with existing KBs so that they remain up-to-date. As the primary consumers of KBs, human users have tremendous potential to aid in KB construction by contributing feedback that identifies spurious and missing entity attributes and relations. However, correctly integrating user feedback with an existing KB is complicated by the necessity to resolve identity uncertainty, i.e., uncertainty regarding to which real-world entity a piece of data refers. Identity uncertainty abounds in the collection of raw evidence from which a KB is built. Moreover, it also gives rise to identity uncertainty in user feedback, when KB entities, which were affected by user feedback, are split or merged.
In this dissertation, we present a continuous reasoning framework capable of integrating user feedback with a KB, under identity certainty. To begin, we introduce Grinch, an online entity resolution (ER) algorithm---with provable correctness guarantees---capable of merging and splitting KB entities as new data arrives. We show that Grinch is efficient and achieves state-of-the-art performance in ER as well as in clustering. Next, we propose a method for using Grinch to resolve identity uncertainty in a KB\u27s underlying data as well as in user feedback. Our approach is based on representing user feedback as mentions, i.e., first class KB objects that participate in all parts of KB construction. Furthermore, we introduce a structured representation for feedback comprised of packaging and payload, which facilitates recovery from KB errors that stem from both identity uncertainty and noisy data. Finally, we evaluate our framework\u27s efficacy using data from the KB that supports OpenReview.net---a deployed, conference management system that solicits feedback from users. The demands of OpenReview.net lead us to develop XGrinch-Shallow (XGS), a variant of Grinch that builds trees with arbitrary branching factors, and subsequently instantiates 60% fewer internal nodes than Grinch. Empirically, we show that XGS is efficient, and is able to effectively utilize user feedback to improve the correctness and completeness of the OpenReview.net KB. We conclude with 7 concrete suggestions for future research on this topic
A New Scalable, Portable, and Memory-Efficient Predictive Analytics Framework for Predicting Time-to-Event Outcomes in Healthcare
Time-to-event outcomes are prevalent in medical research. To handle these outcomes, as well as censored observations, statistical and survival regression methods are widely used based on the assumptions of linear association; however, clinicopathological features often exhibit nonlinear correlations. Machine learning (ML) algorithms have been recently adapted to effectively handle nonlinear correlations. One drawback of ML models is that they can model idiosyncratic features of a training dataset. Due to this overlearning, ML models perform well on the training data but are not so striking on test data. The features that we choose indirectly influence the performance of ML prediction models. With the expansion of big data in biomedical informatics, appropriate feature engineering and feature selection are vital to ML success. Also, an ensemble learning algorithm helps decrease bias and variance by combining the predictions of multiple models.
In this study, we newly constructed a scalable, portable, and memory-efficient predictive analytics framework, fitting four components (feature engineering, survival analysis, feature selection, and ensemble learning) together. Our framework first employs feature engineering techniques, such as binarization, discretization, transformation, and normalization on raw dataset. The normalized feature set was applied to the Cox survival regression that produces highly correlated features relevant to the outcome.The resultant feature set was deployed to âeXtreme gradient boosting ensemble learningâ (XGBoost) and Recursive Feature Elimination algorithms. XGBoost uses a gradient boosting decision tree algorithm in which new models are created sequentially that predict the residuals of prior models, which are then added together to make the final prediction.
In our experiments, we analyzed a cohort of cardiac surgery patients drawn from a multi-hospital academic health system. The model evaluated 72 perioperative variables that impact an event of readmission within 30 days of discharge, derived 48 significant features, and demonstrated optimum predictive ability with feature sets ranging from 16 to 24. The area under the receiver operating characteristics observed for the feature set of 16 were 0.8816, and 0.9307 at the 35th, and 151st iteration respectively. Our model showed improved performance compared to state-of-the-art models and could be more useful for decision support in clinical settings
- âŠ