Search CORE

83 research outputs found

Cache-oblivious index for approximate string matching

Author: Hon WK
Lam TW
Shah R
Tam SL
Vitter JS
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog kn)B) disk pages and finds all k-error matches with O((|P|+occ)B+log knloglog Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω (|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)B) disk pages, and the I/O complexity is O((|P|+occ)B+log k(k+1)nloglogn) . © 2011 Elsevier B.V. All rights reserved.postprin

Elsevier - Publisher Connector

HKU Scholars Hub

10091 Abstracts Collection -- Data Structures

Author: Arge Lars
Demaine Erik D.
Seidel Raimund
Publication venue: Dagstuhl Seminar Proceedings. 10091 - Data Structures
Publication date: 01/01/2010
Field of study

From February 28th to March 5th 2010, the Dagstuhl Seminar 10091 "Data Structures" was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. It brought together 45 international researchers to discuss recent developments concerning data structures in terms of research, but also in terms of new technologies that impact how data can be stored, updated, and retrieved. During the seminar a fair number of participants presented their current research and open problems where discussed. This document first briefly describes the seminar topics and then gives the abstracts of the presentations given during the seminar

Dagstuhl Research Online Publication Server

Computationally efficient algorithms for the two-dimensional Kolmogorov-Smirnov test

Author: A Katsuki
H Sakai
Hazama R
Hazama R
I Ogawa
K Kishimoto
K Mukaida
Kishimoto T
R Hazama
S Umehara
S Yoshida
T Itamura
T Kishimoto
Umehara S
Publication venue: 'IOP Publishing'
Publication date: 01/01/2008
Field of study

Goodness-of-fit statistics measure the compatibility of random samples against some theoretical or reference probability distribution function. The classical one-dimensional Kolmogorov-Smirnov test is a non-parametric statistic for comparing two empirical distributions which defines the largest absolute difference between the two cumulative distribution functions as a measure of disagreement. Adapting this test to more than one dimension is a challenge because there are 2^d-1 independent ways of ordering a cumulative distribution function in d dimensions. We discuss Peacock's version of the Kolmogorov-Smirnov test for two-dimensional data sets which computes the differences between cumulative distribution functions in 4n^2 quadrants. We also examine Fasano and Franceschini's variation of Peacock's test, Cooke's algorithm for Peacock's test, and ROOT's version of the two-dimensional Kolmogorov-Smirnov test. We establish a lower-bound limit on the work for computing Peacock's test of Omega(n^2.lg(n)), introducing optimal algorithms for both this and Fasano and Franceschini's test, and show that Cooke's algorithm is not a faithful implementation of Peacock's test. We also discuss and evaluate parallel algorithms for Peacock's test

CiteSeerX

Crossref

Brunel University Research Archive

The two-dimensional Kolmogorov-Smirnov test

Author: Hobson PR
Lopes RHC
Reid ID
Publication venue: 'Proceedings of Science Open Reviewed'
Publication date: 01/01/2007
Field of study

Goodness-of-fit statistics measure the compatibility of random samples against some theoretical probability distribution function. The classical one-dimensional Kolmogorov-Smirnov test is a non-parametric statistic for comparing two empirical distributions which defines the largest absolute difference between the two cumulative distribution functions as a measure of disagreement. Adapting this test to more than one dimension is a challenge because there are 2d −1 independent ways of defining a cumulative distribution function when d dimensions are involved. In this paper three variations on the Kolmogorov-Smirnov test for multi-dimensional data sets are surveyed: Peacock’s test [1] that computes in O(n3); Fasano and Franceschini’s test [2] that computes in O(n2); Cooke’s test that computes in O(n2). We prove that Cooke’s algorithm runs in O(n2), contrary to his claims that it runs in O(nlgn). We also compare these algorithms with ROOT’s version of the Kolmogorov-Smirnov test

CiteSeerX

Brunel University Research Archive

Succinct Indices for Range Queries with applications to Orthogonal Range Maxima

Author: C. Makris
G.S. Brodal
G.S. Brodal
H. Yuan
J. Barbay
J. JáJá
M. Karpinski
M.A. Bender
M.J. Golin
P. Bose
P. Bose
T.M. Chan
T.M. Chan
Y. Nekrich
Publication venue
Publication date: 01/01/2012
Field of study

We consider the problem of preprocessing

N

points in 2D, each endowed with a priority, to answer the following queries: given a axis-parallel rectangle, determine the point with the largest priority in the rectangle. Using the ideas of the \emph{effective entropy} of range maxima queries and \emph{succinct indices} for range maxima queries, we obtain a structure that uses O(N) words and answers the above query in

O(\log N \log \log N)

time. This is a direct improvement of Chazelle's result from FOCS 1985 for this problem -- Chazelle required

O(N/\epsilon)

words to answer queries in

O((\log N)^{1+\epsilon})

time for any constant

\epsilon > 0

.Comment: To appear in ICALP 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Leicester Research Archive

LIPIcs, Volume 244, ESA 2022, Complete Volume

Author: Chechik Shiri
Herman Grzegorz
Navarro Gonzalo
Rotenberg Eva
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

LIPIcs, Volume 244, ESA 2022, Complete Volum

Dagstuhl Research Online Publication Server

Algorithms and Data Structures for Geometric Intersection Query Problems

Author: Saladi Rahul
Publication venue
Publication date: 01/09/2017
Field of study

University of Minnesota Ph.D. dissertation. September 2017. Major: Computer Science. Advisor: Ravi Janardan. 1 computer file (PDF); xi, 126 pages.The focus of this thesis is the topic of geometric intersection queries (GIQ) which has been very well studied by the computational geometry community and the database community. In a GIQ problem, the user is not interested in the entire input geometric dataset, but only in a small subset of it and requests an informative summary of that small subset of data. Formally, the goal is to preprocess a set A of n geometric objects into a data structure so that given a query geometric object q, a certain aggregation function can be applied efficiently on the objects of A intersecting q. The classical aggregation functions studied in the literature are reporting or counting the objects of A intersecting q. In many applications, the same set A is queried several times, in which case one would like to answer a query faster by preprocessing A into a data structure. The goal is to organize the data into a data structure which occupies a small amount of space and yet responds to any user query in real-time. In this thesis the study of the GIQ problems was conducted from the point-of-view of a computational geometry researcher. Given a model of computation and a GIQ problem, what are the best possible upper bounds (resp., lower bounds) on the space and the query time that can be achieved by a data structure? Also, what is the relative hardness of various GIQ problems and aggregate functions. Here relative hardness means that given two GIQ problems A and B (or, two aggregate functions f(A, q) and g(A, q)), which of them can be answered faster by a computer (assuming data structures for both of them occupy asymptotically the same amount of space)? This thesis presents results which increase our understanding of the above questions. For many GIQ problems, data structures with optimal (or near-optimal) space and query time bounds have been achieved. The geometric settings studied are primarily orthogonal range searching where the input is points and the query is an axes-aligned rectangle, and the dual setting of rectangle stabbing where the input is a set of axes-aligned rectangles and the query is a point. The aggregation functions studied are primarily reporting, top-k, and approximate counting. Most of the data structures are built for the internal memory model (word-RAM or pointer machine model), but in some settings they are generic enough to be efficient in the I/O-model as well

University of Minnesota Digital Conservancy

LIPIcs, Volume 274, ESA 2023, Complete Volume

Author: Farach-Colton Martin
Herman Grzegorz
Puglisi Simon J.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

LIPIcs, Volume 274, ESA 2023, Complete Volum

Dagstuhl Research Online Publication Server