6 research outputs found
Succinct Indices for Range Queries with applications to Orthogonal Range Maxima
We consider the problem of preprocessing points in 2D, each endowed with
a priority, to answer the following queries: given a axis-parallel rectangle,
determine the point with the largest priority in the rectangle. Using the ideas
of the \emph{effective entropy} of range maxima queries and \emph{succinct
indices} for range maxima queries, we obtain a structure that uses O(N) words
and answers the above query in time. This is a direct
improvement of Chazelle's result from FOCS 1985 for this problem -- Chazelle
required words to answer queries in
time for any constant .Comment: To appear in ICALP 201
Fast Computation of Output-Sensitive Maxima in a Word RAM
In this paper, we study the problem of computing the maxima of a set of n points in three dimensions with integer coordinates and show that in a word RAM, the maxima can be found in O n log logn/h n deterministic time in which h is the output size. For h = n1−α this is O(n log(1/α)). This improves the previous O(n log log h) time algorithm and can be considered surprising since it gives a linear time algorithm when α> 0 is a constant, which is faster than the current best deterministic and randomized integer sorting algorithms. We observe that improving this running time is most likely difficult since it requires breaking a number of important barriers, even if randomization is allowed. Additionally, we show that the same deterministic running time could be achieved for performing n point location queries in an arrangement of size h. Finally, our maxima result can be extended to higher dimensions by paying a logn/h n factor penalty per dimension. This has further interesting consequences for example it preserves the linear running time when h ≤ n1−α, for a constant α> 0, and thus it shows that for a variety of input distributions the maxima can be computed in linear expected time without knowing the distribution.
Efficient Indexing for Structured and Unstructured Data
The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation
On Geometric Range Searching, Approximate Counting and Depth Problems
In this thesis we deal with problems connected to range searching,
which is one of the central areas of computational geometry.
The dominant problems in this area are
halfspace range searching, simplex range searching and orthogonal range searching and
research into these problems has spanned decades.
For many range searching problems, the best possible
data structures cannot offer fast (i.e., polylogarithmic) query
times if we limit ourselves to near linear storage.
Even worse, it is conjectured (and proved in some cases)
that only very small improvements to these might be possible.
This inefficiency has encouraged many researchers to seek alternatives through approximations.
In this thesis we continue this line of research and focus on
relative approximation of range counting problems.
One important problem where it is possible to achieve significant speedup
through approximation is halfspace range counting in 3D.
Here we continue the previous research done
and obtain the first optimal data structure for approximate halfspace range counting in 3D.
Our data structure has the slight advantage of being Las Vegas (the result is always correct) in contrast
to the previous methods that were Monte Carlo (the correctness holds with high probability).
Another series of problems where approximation can provide us with
substantial speedup comes from robust statistics.
We recognize three problems here:
approximate Tukey depth, regression depth and simplicial depth queries.
In 2D, we obtain an optimal data structure capable of approximating
the regression depth of a query hyperplane.
We also offer a linear space data structure which can answer approximate
Tukey depth queries efficiently in 3D.
These data structures are obtained by applying our ideas for the
approximate halfspace counting problem.
Approximating the simplicial depth turns out to be much more
difficult, however.
Computing the simplicial depth of a given point is more computationally
challenging than most other definitions of data depth.
In 2D we obtain the first data structure which uses near linear space
and can answer approximate simplicial depth queries in polylogarithmic time.
As applications of this result, we provide two non-trivial methods to
approximate the simplicial depth of a given point in higher dimension.
Along the way, we establish a tight combinatorial relationship between
the Tukey depth of any given point and its simplicial depth.
Another problem investigated in this thesis is the dominance reporting problem,
an important special case of orthogonal range reporting.
In three dimensions, we solve this
problem in the pointer machine model and the external memory model
by offering the first optimal data structures in these models of computation.
Also, in the RAM model and for points from
an integer grid we reduce the space complexity of the fastest
known data structure to optimal.
Using known techniques in the literature, we can use our
results to obtain solutions for the orthogonal range searching problem as well.
The query complexity offered by our orthogonal range reporting data structures
match the most efficient query complexities
known in the literature but our space bounds are lower than the previous methods in the external
memory model and RAM model where the input is a subset of an integer grid.
The results also yield improved orthogonal range searching in
higher dimensions (which shows the significance
of the dominance reporting problem).
Intersection searching is a generalization of range searching where
we deal with more complicated geometric objects instead of points.
We investigate the rectilinear disjoint polygon counting problem
which is a specialized intersection counting problem.
We provide a linear-size data structure capable of counting
the number of disjoint rectilinear polygons
intersecting any rectilinear polygon of constant size.
The query time (as well as some other properties of our data structure) resembles
the classical simplex range searching data structures
Algorithms and Data Structures for Geometric Intersection Query Problems
University of Minnesota Ph.D. dissertation. September 2017. Major: Computer Science. Advisor: Ravi Janardan. 1 computer file (PDF); xi, 126 pages.The focus of this thesis is the topic of geometric intersection queries (GIQ) which has been very well studied by the computational geometry community and the database community. In a GIQ problem, the user is not interested in the entire input geometric dataset, but only in a small subset of it and requests an informative summary of that small subset of data. Formally, the goal is to preprocess a set A of n geometric objects into a data structure so that given a query geometric object q, a certain aggregation function can be applied efficiently on the objects of A intersecting q. The classical aggregation functions studied in the literature are reporting or counting the objects of A intersecting q. In many applications, the same set A is queried several times, in which case one would like to answer a query faster by preprocessing A into a data structure. The goal is to organize the data into a data structure which occupies a small amount of space and yet responds to any user query in real-time. In this thesis the study of the GIQ problems was conducted from the point-of-view of a computational geometry researcher. Given a model of computation and a GIQ problem, what are the best possible upper bounds (resp., lower bounds) on the space and the query time that can be achieved by a data structure? Also, what is the relative hardness of various GIQ problems and aggregate functions. Here relative hardness means that given two GIQ problems A and B (or, two aggregate functions f(A, q) and g(A, q)), which of them can be answered faster by a computer (assuming data structures for both of them occupy asymptotically the same amount of space)? This thesis presents results which increase our understanding of the above questions. For many GIQ problems, data structures with optimal (or near-optimal) space and query time bounds have been achieved. The geometric settings studied are primarily orthogonal range searching where the input is points and the query is an axes-aligned rectangle, and the dual setting of rectangle stabbing where the input is a set of axes-aligned rectangles and the query is a point. The aggregation functions studied are primarily reporting, top-k, and approximate counting. Most of the data structures are built for the internal memory model (word-RAM or pointer machine model), but in some settings they are generic enough to be efficient in the I/O-model as well