8 research outputs found
On Efficient Range-Summability of IID Random Variables in Two or Higher Dimensions
d-dimensional (for d > 1) efficient range-summability (dD-ERS) of random variables (RVs) is a fundamental algorithmic problem that has applications to two important families of database problems, namely, fast approximate wavelet tracking (FAWT) on data streams and approximately answering range-sum queries over a data cube. Whether there are efficient solutions to the dD-ERS problem, or to the latter database problem, have been two long-standing open problems. Both are solved in this work. Specifically, we propose a novel solution framework to dD-ERS on RVs that have Gaussian or Poisson distribution. Our dD-ERS solutions are the first ones that have polylogarithmic time complexities. Furthermore, we develop a novel k-wise independence theory that allows our dD-ERS solutions to have both high computational efficiencies and strong provable independence guarantees. Finally, we show that under a sufficient and likely necessary condition, certain existing solutions for 1D-ERS can be generalized to higher dimensions
A Dyadic Simulation Approach to Efficient Range-Summability
Efficient range-summability (ERS) of a long list of random variables is a fundamental algorithmic problem that has applications to three important database applications, namely, data stream processing, space-efficient histogram maintenance (SEHM), and approximate nearest neighbor searches (ANNS). In this work, we propose a novel dyadic simulation framework and develop three novel ERS solutions, namely Gaussian-dyadic simulation tree (DST), Cauchy-DST and Random Walk-DST, using it. We also propose novel rejection sampling techniques to make these solutions computationally efficient. Furthermore, we develop a novel k-wise independence theory that allows our ERS solutions to have both high computational efficiencies and strong provable independence guarantees
Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data
In this vision paper, we propose a shift in perspective for improving the
effectiveness of similarity search. Rather than focusing solely on enhancing
the data quality, particularly machine learning-generated embeddings, we
advocate for a more comprehensive approach that also enhances the underpinning
search mechanisms. We highlight three novel avenues that call for a
redefinition of the similarity search problem: exploiting implicit data
structures and distributions, engaging users in an iterative feedback loop, and
moving beyond a single query vector. These novel pathways have gained relevance
in emerging applications such as large-scale language models, video clip
retrieval, and data labeling. We discuss the corresponding research challenges
posed by these new problem areas and share insights from our preliminary
discoveries
RECIPE: Rateless Erasure Codes Induced by Protocol-Based Encoding
LT (Luby transform) codes are a celebrated family of rateless erasure codes
(RECs). Most of existing LT codes were designed for applications in which a
centralized encoder possesses all message blocks and is solely responsible for
encoding them into codewords. Distributed LT codes, in which message blocks are
physically scattered across multiple different locations (encoders) that need
to collaboratively perform the encoding, has never been systemically studied
before despite its growing importance in applications. In this work, we present
the first systemic study of LT codes in the distributed setting, and make the
following three major contributions. First, we show that only a proper subset
of LT codes are feasible in the distributed setting, and give the sufficient
and necessary condition for such feasibility. Second, we propose a distributed
encoding protocol that can efficiently implement any feasible code. The
protocol is parameterized by a so-called action probability array (APA) that is
only a few KBs in size, and any feasible code corresponds to a valid APA
setting and vice versa. Third, we propose two heuristic search algorithms that
have led to the discovery of feasible codes that are much more efficient than
the state of the art.Comment: Accepted by IEEE ISIT 202
Recommended from our members
MP-RW-LSHÂ an efficient multi-probe LSH solution to ANNS- L 1
Approximate Nearest Neighbor Search (ANNS) is a fundamental algorithmic problem, with numerous applications in many areas of computer science. Locality-Sensitive Hashing (LSH) is one of the most popular solution approaches for ANNS. A common shortcoming of many LSH schemes is that since they probe only a single bucket in a hash table, they need to use a large number of hash tables to achieve a high query accuracy. For ANNS- L 2 , a multi-probe scheme was proposed to overcome this drawback by strategically probing multiple buckets in a hash table. In this work, we propose MP-RW-LSH, the first and so far only multi-probe LSH solution to ANNS in L 1 distance, and show that it achieves a better tradeoff between scalability and query efficiency than all existing LSH-based solutions. We also explain why a state-of-the-art ANNS -L 1 solution called Cauchy projection LSH (CP-LSH) is fundamentally not suitable for multi-probe extension. Finally, as a use case, we construct, using MP-RW-LSH as the underlying "ANNS- L 1 engine", a new ANNS-E (E for edit distance) solution that beats the state of the art