Search CORE

1,632 research outputs found

Recommended from our members

Incremental Non-Greedy Clustering at Scale

Author: Monath Nicholas
Publication venue: ScholarWorks@UMass Amherst
Publication date: 18/03/2022
Field of study

Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions of data, (3) support for any user-specified similarity functions. Hierarchical clusterings are often desired as they represent multiple alternative flat clusterings (e.g., at different granularity levels). These tree-structured clusterings provide for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all three of the aforementioned desiderata. Work on incremental hierarchical clustering often makes greedy, irrevocable clustering decisions that are regretted in the presence of future data. Work on scalable hierarchical clustering does not support incremental additions or deletions. These methods often make requirements on the similarity functions used and/or empirically tend to over merge clusters, which can lead to inaccurate clusterings. In this thesis, we present incremental and scalable methods for hierarchical clustering to empirically satisfy the above desiderata. Our work aims to represent uncertainty and meaningful alternative clusterings, to efficiently reconsider past decisions in the incremental case, and to use parallelism to scale to massive datasets. Our method, Grinch, handles incrementally arriving data in a non-greedy fashion, by reconsidering past decisions using tree structure re-arrangements (e.g., rotations and grafts) invoked in accordance with the user’s specified similarity function. To achieve scalability to massive datasets, our method, SCC, builds a hierarchical clusterings in a level-wise bottom-up manner. Certain clustering decisions are made independently in parallel within each level, and a global similarity threshold schedule prevents greedy over-merging. We show how SCC can be combined with the tree-structure re-arrangements in Grinch to form a mini-batch algorithm achieving both scalable and incremental performance. Lastly, we generalize our hierarchical clustering approaches to DAG-structured ones, which can better represent uncertainty in clustering by representing overlapping clusters. We introduce an efficient bottom-up method for DAG-structured clustering, Llama. For each of the proposed methods, we provide both a theoretical and empirical analysis. Empirically, our methods achieve state-of-the-art results on clustering benchmarks in both the batch and the incremental settings, including multiple point improvements in dendrogram purity and scalability to billions of points

ScholarWorks@UMass Amherst

Research approach towards the profitability of future FTTH business models

Author: Casier Koen
Forzati Marco
Kind Mario
Lannoo Bart
Mas Machuca Carmen
Monath Thomas
Verbrugge Sofie
Publication venue: International Information Management Corporation (IIMC)
Publication date: 01/01/2011
Field of study

Ghent University Academic Bibliography

Yellow fever vaccine-reply to S. arya

Author: Fierros EG
Giesberg JA
Monath TP
Publication venue: Centers for Disease Control
Publication date: 01/06/1999
Field of study

Directory of Open Access Journals

PubMed Central

Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR Decomposition

Author: McCallum Andrew
Monath Nicholas
Yadav Nishant
Zaheer Manzil
Publication venue
Publication date: 23/10/2023
Field of study

Cross-encoder models, which jointly encode and score a query-item pair, are prohibitively expensive for direct k-nearest neighbor (k-NN) search. Consequently, k-NN search typically employs a fast approximate retrieval (e.g. using BM25 or dual-encoder vectors), followed by reranking with a cross-encoder; however, the retrieval approximation often has detrimental recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent work that employs a cross-encoder only, making search efficient using a relatively small number of anchor items, and a CUR matrix factorization. While ANNCUR's one-time selection of anchors tends to approximate the cross-encoder distances on average, doing so forfeits the capacity to accurately estimate distances to items near the query, leading to regret in the crucial end-task: recall of top-k items. In this paper, we propose ADACUR, a method that adaptively, iteratively, and efficiently minimizes the approximation error for the practically important top-k neighbors. It does so by iteratively performing k-NN search using the anchors available so far, then adding these retrieved nearest neighbors to the anchor set for the next round. Empirically, on multiple datasets, in comparison to previous traditional and state-of-the-art methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed approach ADACUR consistently reduces recall error-by up to 70% on the important k = 1 setting-while using no more compute than its competitors.Comment: Findings of EMNLP 202

arXiv.org e-Print Archive

Entity Linking and Discovery via Arborescence-based Supervised Clustering

Author: Agarwal Dhruv
Angell Rico
McCallum Andrew
Monath Nicholas
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 10/05/2022
Field of study

Previous work has shown promising results in performing entity linking by measuring not only the affinities between mentions and entities but also those amongst mentions. In this paper, we present novel training and inference procedures that fully utilize mention-to-mention affinities by building minimum arborescences (i.e., directed spanning trees) over mentions and entities across documents in order to make linking decisions. We also show that this method gracefully extends to entity discovery, enabling the clustering of mentions that do not have an associated entity in the knowledge base. We evaluate our approach on the Zero-Shot Entity Linking dataset and MedMentions, the largest publicly available biomedical dataset, and show significant improvements in performance for both entity linking and discovery compared to identically parameterized models. We further show significant efficiency improvements with only a small loss in accuracy over previous work, which use more computationally expensive models.Comment: Updated reference

arXiv.org e-Print Archive

On Array Noncomputable Degrees, Maximal Pairs and Simplicity Properties

Author: Monath Martin
Publication venue
Publication date: 01/01/2019
Field of study

In this thesis, we give contributions to topics which are related to array noncomputable (a.n.c.) Turing degrees, maximal pairs and to simplicity properties. The outline is as follows. In Chapter 2, we introduce a subclass of the a.n.c. Turing degrees, the so called completely array noncomputable (c.a.n.c. for short) Turing degrees. Here, a computably enumerable (c.e.) Turing degree a is c.a.n.c. if any c.e. set A ∈ a is weak truth-table (wtt) equivalent to an a.n.c. set. We show in Section 2.3 that these degrees exist (indeed, there exist infinitely many low c.a.n.c. degrees) and that they cannot be high. Moreover, we apply some of the ideas used to show the existence of c.a.n.c. Turing degrees to show the stronger result that there exists a c.e. Turing degree whose c.e. members are halves of maximal pairs in the c.e. computably Lipschitz (cl) degrees, thereby solving the first part of the first open problem given in the paper by Ambos-Spies, Ding, Fan and Merkle [ASDFM13]. In Chapter 3, we present an approach to extending the notion of array noncomputability to the setting of almost-c.e. sets (these are the sets which correspond to binary representations of left-c.e. reals). This approach is initiated by the Heidelberg Logic Group and it is worked out in detail in an upcoming paper by Ambos-Spies, Losert and Monath [ASLM18], in the thesis of Losert [Los18] and in [ASFL+]. In [ASLM18], the authors introduce the class of sets with the universal similarity property (u.s.p. for short; throughout this thesis, sets with the u.s.p. will shortly be called u.s.p. sets) which is a strong form of array noncomputability in the setting of almost-c.e. sets and they show that sets with this property exist precisely in the c.e. not totally ω-c.e. degrees. Then it is shown that, using u.s.p. sets, one obtains a simplified method for showing the existence of almost-c.e. sets with a property P (for certain properties P) that are contained in c.e. not totally ω-c.e. degrees, namely by showing that u.s.p. sets have property P. This is demonstrated by showing that u.s.p. sets are computably bounded random (CB-random), thereby extending a result from Brodhead, Downey and Ng [BDN12]. Moreover, it is shown that the c.e. not totally ω-c.e. degrees can be characterized as those c.e. degrees which contain an almost-c.e. set which is not cl-reducible to any complex almost-c.e. set. This affirmatively answers a conjecture by Greenberg. For the if-direction of the latter result, we prove a new result on maximal pairs in the almost-c.e. sets by showing the existence of locally almost-c.e. sets which are halves of maximal pairs in the almost-c.e. sets such that the second half can be chosen to be c.e. and arbitrarily sparse. This extends Yun Fan’s result on maximal pairs [Fan09]. By our result, we also get a new proof of one of the main results in Barmpalias, Downey and Greenberg [BDG10], namely that in any c.e. a.n.c. degree there is a left-c.e. real which is not cl-reducible to any ML-random left-c.e. real. In this thesis, we give an overview of some of the results from [ASLM18] and sketch some of the proofs to illustrate this new methodology and, subsequently, we give a detailed proof of the above maximal pair result. In Chapter 4, we look at the interaction between a.n.c. wtt-degrees and the most commonly known simplicity properties by showing that there exists an a.n.c. wtt-degree which contains an r-maximal set. By this result together with the result by Ambos-Spies [AS18] that no a.n.c. wtt-degree contains a dense simple set, we obtain a complete characterization which of the classical simplicity properties may hold for a.n.c. wtt-degrees. The guiding theme for Chapter 5 is a theorem by Barmpalias, Downey and Greenberg [BDG10] in which they characterize the c.e. not totally ω-c.e. degrees as the c.e. degrees which contain a c.e. set which is not wtt-reducible to any hypersimple set. So Ambos-Spies asked what the above characterization would look like if we replaced hypersimple sets by maximal sets in the above theorem. In other words, what are the c.e. Turing degrees that contain c.e. sets which are not wtt-reducible to any maximal set. We completely solve this question on the set level by introducing the new class of eventually uniformly wtt-array computable (e.u.wtt-a.c.) sets and by showing that the c.e. sets with this property are precisely those c.e. sets which are wtt-reducible to maximal sets. Indeed, this characterization can be extended in that we can replace wtt-reducible by ibT-reducible and maximal sets by dense simple sets. By showing that the c.e. e.u.wtt-a.c. sets are closed downwards under wtt-reductions and under the join operation, it follows that the c.e. wtt-degrees containing e.u.wtt-a.c. sets form an ideal in the upper semilattice of the c.e. wtt-degrees and, further, we obtain a characterization of the c.e. wtt-degrees which contain c.e. sets that are not wtt-reducible to any maximal set. Moreover, we give upper and lower bounds (with respect to ⊆) for the class of the c.e. e.u.wtt-a.c. sets. For the upper bound, we show that any c.e. e.u.wtt-a.c. set has array computable wtt-degree. For the lower bound, we introduce the notion of a wtt-superlow set and show that any wtt-superlow c.e. set is e.u.wtt-a.c. Besides, we show that the wtt-superlow c.e. sets can be characterized as the c.e. sets whose bounded jump is ω-computably approximable (ω-c.a. for short); hence, they are precisely the bounded low sets as introduced in the paper by Anderson, Csima and Lange [ACL17]. Furthermore, we prove a hierarchy theorem for the wtt-superlow c.e. sets and we show that there exists a Turing complete set which lies in the intersection of that hierarchy. Finally, it is shown that the above bounds are strict, i.e., there exist c.e. e.u.wtta. c. sets which are not wtt-superlow and that there exist c.e. sets whose wtt-degree is array computable and which are not e.u.wtt-a.c. (where here, we obtain the separation even on the level of Turing degrees). The results from Chapter 5 will be included in a paper which is in preparation by Ambos-Spies, Downey and Monath [ASDM19]

Heidelberger Dokumentenserver

Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining

Author: Allen Kelsey
McCallum Andrew
Monath Nicholas
Zaheer Manzil
Publication venue
Publication date: 27/03/2023
Field of study

Dual encoder models are ubiquitous in modern classification and retrieval. Crucial for training such dual encoders is an accurate estimation of gradients from the partition function of the softmax over the large output space; this requires finding negative targets that contribute most significantly ("hard negatives"). Since dual encoder model parameters change during training, the use of traditional static nearest neighbor indexes can be sub-optimal. These static indexes (1) periodically require expensive re-building of the index, which in turn requires (2) expensive re-encoding of all targets using updated model parameters. This paper addresses both of these challenges. First, we introduce an algorithm that uses a tree structure to approximate the softmax with provable bounds and that dynamically maintains the tree. Second, we approximate the effect of a gradient update on target encodings with an efficient Nystrom low-rank approximation. In our empirical study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining. Furthermore, our method surpasses prior state-of-the-art while using 150x less accelerator memory.Comment: To appear at AISTATS 202

arXiv.org e-Print Archive