1,632 research outputs found
Recommended from our members
Incremental Non-Greedy Clustering at Scale
Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions of data, (3) support for any user-specified similarity functions.
Hierarchical clusterings are often desired as they represent multiple alternative flat clusterings (e.g., at different granularity levels). These tree-structured clusterings provide for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all three of the aforementioned desiderata. Work on incremental hierarchical clustering often makes greedy, irrevocable clustering decisions that are regretted in the presence of future data. Work on scalable hierarchical clustering does not support incremental additions or deletions. These methods often make requirements on the similarity functions used and/or empirically tend to over merge clusters, which can lead to inaccurate clusterings.
In this thesis, we present incremental and scalable methods for hierarchical clustering to empirically satisfy the above desiderata. Our work aims to represent uncertainty and meaningful alternative clusterings, to efficiently reconsider past decisions in the incremental case, and to use parallelism to scale to massive datasets. Our method, Grinch, handles incrementally arriving data in a non-greedy fashion, by reconsidering past decisions using tree structure re-arrangements (e.g., rotations and grafts) invoked in accordance with the user’s specified similarity function. To achieve scalability to massive datasets, our method, SCC, builds a hierarchical clusterings in a level-wise bottom-up manner. Certain clustering decisions are made independently in parallel within each level, and a global similarity threshold schedule prevents greedy over-merging. We show how SCC can be combined with the tree-structure re-arrangements in Grinch to form a mini-batch algorithm achieving both scalable and incremental performance. Lastly, we generalize our hierarchical clustering approaches to DAG-structured ones, which can better represent uncertainty in clustering by representing overlapping clusters. We introduce an efficient bottom-up method for DAG-structured clustering, Llama. For each of the proposed methods, we provide both a theoretical and empirical analysis. Empirically, our methods achieve state-of-the-art results on clustering benchmarks in both the batch and the incremental settings, including multiple point improvements in dendrogram purity and scalability to billions of points
Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR Decomposition
Cross-encoder models, which jointly encode and score a query-item pair, are
prohibitively expensive for direct k-nearest neighbor (k-NN) search.
Consequently, k-NN search typically employs a fast approximate retrieval (e.g.
using BM25 or dual-encoder vectors), followed by reranking with a
cross-encoder; however, the retrieval approximation often has detrimental
recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent
work that employs a cross-encoder only, making search efficient using a
relatively small number of anchor items, and a CUR matrix factorization. While
ANNCUR's one-time selection of anchors tends to approximate the cross-encoder
distances on average, doing so forfeits the capacity to accurately estimate
distances to items near the query, leading to regret in the crucial end-task:
recall of top-k items. In this paper, we propose ADACUR, a method that
adaptively, iteratively, and efficiently minimizes the approximation error for
the practically important top-k neighbors. It does so by iteratively performing
k-NN search using the anchors available so far, then adding these retrieved
nearest neighbors to the anchor set for the next round. Empirically, on
multiple datasets, in comparison to previous traditional and state-of-the-art
methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed
approach ADACUR consistently reduces recall error-by up to 70% on the important
k = 1 setting-while using no more compute than its competitors.Comment: Findings of EMNLP 202
Entity Linking and Discovery via Arborescence-based Supervised Clustering
Previous work has shown promising results in performing entity linking by
measuring not only the affinities between mentions and entities but also those
amongst mentions. In this paper, we present novel training and inference
procedures that fully utilize mention-to-mention affinities by building minimum
arborescences (i.e., directed spanning trees) over mentions and entities across
documents in order to make linking decisions. We also show that this method
gracefully extends to entity discovery, enabling the clustering of mentions
that do not have an associated entity in the knowledge base. We evaluate our
approach on the Zero-Shot Entity Linking dataset and MedMentions, the largest
publicly available biomedical dataset, and show significant improvements in
performance for both entity linking and discovery compared to identically
parameterized models. We further show significant efficiency improvements with
only a small loss in accuracy over previous work, which use more
computationally expensive models.Comment: Updated reference
On Array Noncomputable Degrees, Maximal Pairs and Simplicity Properties
In this thesis, we give contributions to topics which are related to array noncomputable
(a.n.c.) Turing degrees, maximal pairs and to simplicity properties. The
outline is as follows. In Chapter 2, we introduce a subclass of the a.n.c. Turing
degrees, the so called completely array noncomputable (c.a.n.c. for short) Turing
degrees. Here, a computably enumerable (c.e.) Turing degree a is c.a.n.c. if any
c.e. set A ∈ a is weak truth-table (wtt) equivalent to an a.n.c. set. We show
in Section 2.3 that these degrees exist (indeed, there exist infinitely many low
c.a.n.c. degrees) and that they cannot be high. Moreover, we apply some of the
ideas used to show the existence of c.a.n.c. Turing degrees to show the stronger
result that there exists a c.e. Turing degree whose c.e. members are halves of
maximal pairs in the c.e. computably Lipschitz (cl) degrees, thereby solving the
first part of the first open problem given in the paper by Ambos-Spies, Ding,
Fan and Merkle [ASDFM13].
In Chapter 3, we present an approach to extending the notion of array
noncomputability to the setting of almost-c.e. sets (these are the sets which
correspond to binary representations of left-c.e. reals). This approach is initiated
by the Heidelberg Logic Group and it is worked out in detail in an upcoming
paper by Ambos-Spies, Losert and Monath [ASLM18], in the thesis of Losert
[Los18] and in [ASFL+]. In [ASLM18], the authors introduce the class of sets
with the universal similarity property (u.s.p. for short; throughout this thesis,
sets with the u.s.p. will shortly be called u.s.p. sets) which is a strong form of
array noncomputability in the setting of almost-c.e. sets and they show that sets
with this property exist precisely in the c.e. not totally ω-c.e. degrees. Then it
is shown that, using u.s.p. sets, one obtains a simplified method for showing
the existence of almost-c.e. sets with a property P (for certain properties P)
that are contained in c.e. not totally ω-c.e. degrees, namely by showing that
u.s.p. sets have property P. This is demonstrated by showing that u.s.p. sets
are computably bounded random (CB-random), thereby extending a result from
Brodhead, Downey and Ng [BDN12]. Moreover, it is shown that the c.e. not
totally ω-c.e. degrees can be characterized as those c.e. degrees which contain
an almost-c.e. set which is not cl-reducible to any complex almost-c.e. set. This
affirmatively answers a conjecture by Greenberg.
For the if-direction of the latter result, we prove a new result on maximal
pairs in the almost-c.e. sets by showing the existence of locally almost-c.e. sets
which are halves of maximal pairs in the almost-c.e. sets such that the second
half can be chosen to be c.e. and arbitrarily sparse. This extends Yun Fan’s
result on maximal pairs [Fan09]. By our result, we also get a new proof of one of
the main results in Barmpalias, Downey and Greenberg [BDG10], namely that
in any c.e. a.n.c. degree there is a left-c.e. real which is not cl-reducible to any
ML-random left-c.e. real.
In this thesis, we give an overview of some of the results from [ASLM18] and
sketch some of the proofs to illustrate this new methodology and, subsequently,
we give a detailed proof of the above maximal pair result.
In Chapter 4, we look at the interaction between a.n.c. wtt-degrees and the
most commonly known simplicity properties by showing that there exists an
a.n.c. wtt-degree which contains an r-maximal set. By this result together with
the result by Ambos-Spies [AS18] that no a.n.c. wtt-degree contains a dense
simple set, we obtain a complete characterization which of the classical simplicity
properties may hold for a.n.c. wtt-degrees.
The guiding theme for Chapter 5 is a theorem by Barmpalias, Downey and
Greenberg [BDG10] in which they characterize the c.e. not totally ω-c.e. degrees
as the c.e. degrees which contain a c.e. set which is not wtt-reducible to any
hypersimple set. So Ambos-Spies asked what the above characterization would
look like if we replaced hypersimple sets by maximal sets in the above theorem.
In other words, what are the c.e. Turing degrees that contain c.e. sets which
are not wtt-reducible to any maximal set. We completely solve this question
on the set level by introducing the new class of eventually uniformly wtt-array
computable (e.u.wtt-a.c.) sets and by showing that the c.e. sets with this property
are precisely those c.e. sets which are wtt-reducible to maximal sets. Indeed,
this characterization can be extended in that we can replace wtt-reducible by
ibT-reducible and maximal sets by dense simple sets. By showing that the c.e.
e.u.wtt-a.c. sets are closed downwards under wtt-reductions and under the join
operation, it follows that the c.e. wtt-degrees containing e.u.wtt-a.c. sets form
an ideal in the upper semilattice of the c.e. wtt-degrees and, further, we obtain
a characterization of the c.e. wtt-degrees which contain c.e. sets that are not
wtt-reducible to any maximal set. Moreover, we give upper and lower bounds
(with respect to ⊆) for the class of the c.e. e.u.wtt-a.c. sets. For the upper bound,
we show that any c.e. e.u.wtt-a.c. set has array computable wtt-degree. For the
lower bound, we introduce the notion of a wtt-superlow set and show that any
wtt-superlow c.e. set is e.u.wtt-a.c. Besides, we show that the wtt-superlow c.e.
sets can be characterized as the c.e. sets whose bounded jump is ω-computably
approximable (ω-c.a. for short); hence, they are precisely the bounded low sets as
introduced in the paper by Anderson, Csima and Lange [ACL17]. Furthermore,
we prove a hierarchy theorem for the wtt-superlow c.e. sets and we show that
there exists a Turing complete set which lies in the intersection of that hierarchy.
Finally, it is shown that the above bounds are strict, i.e., there exist c.e. e.u.wtta.
c. sets which are not wtt-superlow and that there exist c.e. sets whose wtt-degree
is array computable and which are not e.u.wtt-a.c. (where here, we obtain the
separation even on the level of Turing degrees). The results from Chapter 5 will
be included in a paper which is in preparation by Ambos-Spies, Downey and
Monath [ASDM19]
Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining
Dual encoder models are ubiquitous in modern classification and retrieval.
Crucial for training such dual encoders is an accurate estimation of gradients
from the partition function of the softmax over the large output space; this
requires finding negative targets that contribute most significantly ("hard
negatives"). Since dual encoder model parameters change during training, the
use of traditional static nearest neighbor indexes can be sub-optimal. These
static indexes (1) periodically require expensive re-building of the index,
which in turn requires (2) expensive re-encoding of all targets using updated
model parameters. This paper addresses both of these challenges. First, we
introduce an algorithm that uses a tree structure to approximate the softmax
with provable bounds and that dynamically maintains the tree. Second, we
approximate the effect of a gradient update on target encodings with an
efficient Nystrom low-rank approximation. In our empirical study on datasets
with over twenty million targets, our approach cuts error by half in relation
to oracle brute-force negative mining. Furthermore, our method surpasses prior
state-of-the-art while using 150x less accelerator memory.Comment: To appear at AISTATS 202
- …