83 research outputs found
Cache-oblivious index for approximate string matching
This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog kn)B) disk pages and finds all k-error matches with O((|P|+occ)B+log knloglog Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω (|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)B) disk pages, and the I/O complexity is O((|P|+occ)B+log k(k+1)nloglogn) . © 2011 Elsevier B.V. All rights reserved.postprin
10091 Abstracts Collection -- Data Structures
From February 28th to March 5th 2010, the Dagstuhl Seminar 10091 "Data
Structures" was held in Schloss Dagstuhl~--~Leibniz Center for
Informatics. It brought together 45 international researchers to
discuss recent developments concerning data structures in terms of
research, but also in terms of new technologies that impact how data
can be stored, updated, and retrieved. During the seminar a fair
number of participants presented their current research and open
problems where discussed. This document first briefly describes the
seminar topics and then gives the abstracts of the presentations given
during the seminar
Computationally efficient algorithms for the two-dimensional Kolmogorov-Smirnov test
Goodness-of-fit statistics measure the compatibility of random samples against some theoretical or reference probability distribution function. The classical one-dimensional Kolmogorov-Smirnov test is a non-parametric statistic for comparing two empirical distributions which defines the largest absolute difference between the two cumulative distribution functions as a measure of disagreement. Adapting this test to more than one dimension is a challenge because there are 2^d-1 independent ways of ordering a cumulative distribution function in d dimensions. We discuss Peacock's version of the Kolmogorov-Smirnov test for two-dimensional data sets which computes the differences between cumulative distribution functions in 4n^2 quadrants. We also examine Fasano and Franceschini's variation of Peacock's test, Cooke's algorithm for Peacock's test, and ROOT's version of the two-dimensional Kolmogorov-Smirnov test. We establish a lower-bound limit on the work for computing Peacock's test of
Omega(n^2.lg(n)), introducing optimal algorithms for both this and Fasano and Franceschini's test, and show that Cooke's algorithm is not a faithful implementation of Peacock's test. We also discuss and evaluate parallel algorithms for Peacock's test
The two-dimensional Kolmogorov-Smirnov test
Goodness-of-fit statistics measure the compatibility of random samples against some theoretical
probability distribution function. The classical one-dimensional Kolmogorov-Smirnov test is a
non-parametric statistic for comparing two empirical distributions which defines the largest absolute
difference between the two cumulative distribution functions as a measure of disagreement.
Adapting this test to more than one dimension is a challenge because there are 2d −1 independent
ways of defining a cumulative distribution function when d dimensions are involved. In this paper
three variations on the Kolmogorov-Smirnov test for multi-dimensional data sets are surveyed:
Peacock’s test [1] that computes in O(n3); Fasano and Franceschini’s test [2] that computes in
O(n2); Cooke’s test that computes in O(n2).
We prove that Cooke’s algorithm runs in O(n2), contrary to his claims that it runs in O(nlgn).
We also compare these algorithms with ROOT’s version of the Kolmogorov-Smirnov test
Succinct Indices for Range Queries with applications to Orthogonal Range Maxima
We consider the problem of preprocessing points in 2D, each endowed with
a priority, to answer the following queries: given a axis-parallel rectangle,
determine the point with the largest priority in the rectangle. Using the ideas
of the \emph{effective entropy} of range maxima queries and \emph{succinct
indices} for range maxima queries, we obtain a structure that uses O(N) words
and answers the above query in time. This is a direct
improvement of Chazelle's result from FOCS 1985 for this problem -- Chazelle
required words to answer queries in
time for any constant .Comment: To appear in ICALP 201
LIPIcs, Volume 244, ESA 2022, Complete Volume
LIPIcs, Volume 244, ESA 2022, Complete Volum
Algorithms and Data Structures for Geometric Intersection Query Problems
University of Minnesota Ph.D. dissertation. September 2017. Major: Computer Science. Advisor: Ravi Janardan. 1 computer file (PDF); xi, 126 pages.The focus of this thesis is the topic of geometric intersection queries (GIQ) which has been very well studied by the computational geometry community and the database community. In a GIQ problem, the user is not interested in the entire input geometric dataset, but only in a small subset of it and requests an informative summary of that small subset of data. Formally, the goal is to preprocess a set A of n geometric objects into a data structure so that given a query geometric object q, a certain aggregation function can be applied efficiently on the objects of A intersecting q. The classical aggregation functions studied in the literature are reporting or counting the objects of A intersecting q. In many applications, the same set A is queried several times, in which case one would like to answer a query faster by preprocessing A into a data structure. The goal is to organize the data into a data structure which occupies a small amount of space and yet responds to any user query in real-time. In this thesis the study of the GIQ problems was conducted from the point-of-view of a computational geometry researcher. Given a model of computation and a GIQ problem, what are the best possible upper bounds (resp., lower bounds) on the space and the query time that can be achieved by a data structure? Also, what is the relative hardness of various GIQ problems and aggregate functions. Here relative hardness means that given two GIQ problems A and B (or, two aggregate functions f(A, q) and g(A, q)), which of them can be answered faster by a computer (assuming data structures for both of them occupy asymptotically the same amount of space)? This thesis presents results which increase our understanding of the above questions. For many GIQ problems, data structures with optimal (or near-optimal) space and query time bounds have been achieved. The geometric settings studied are primarily orthogonal range searching where the input is points and the query is an axes-aligned rectangle, and the dual setting of rectangle stabbing where the input is a set of axes-aligned rectangles and the query is a point. The aggregation functions studied are primarily reporting, top-k, and approximate counting. Most of the data structures are built for the internal memory model (word-RAM or pointer machine model), but in some settings they are generic enough to be efficient in the I/O-model as well
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum
- …