4 research outputs found
Optimal Color Range Reporting in One Dimension
Color (or categorical) range reporting is a variant of the orthogonal range
reporting problem in which every point in the input is assigned a \emph{color}.
While the answer to an orthogonal point reporting query contains all points in
the query range , the answer to a color reporting query contains only
distinct colors of points in . In this paper we describe an O(N)-space data
structure that answers one-dimensional color reporting queries in optimal
time, where is the number of colors in the answer and is the
number of points in the data structure. Our result can be also dynamized and
extended to the external memory model
On Optimal Top-K String Retrieval
Let = be a given set of
(string) documents of total length . The top- document retrieval problem
is to index such that when a pattern of length , and a
parameter come as a query, the index returns the most relevant
documents to the pattern . Hon et. al. \cite{HSV09} gave the first linear
space framework to solve this problem in time. This was
improved by Navarro and Nekrich \cite{NN12} to . These results are
powerful enough to support arbitrary relevance functions like frequency,
proximity, PageRank, etc. In many applications like desktop or email search,
the data resides on disk and hence disk-bound indexes are needed. Despite of
continued progress on this problem in terms of theoretical, practical and
compression aspects, any non-trivial bounds in external memory model have so
far been elusive. Internal memory (or RAM) solution to this problem decomposes
the problem into subproblems and thus incurs the additive factor of
. In external memory, these approaches will lead to I/Os instead
of optimal I/O term where is the block-size. We re-interpret the
problem independent of , as interval stabbing with priority over tree-shaped
structure. This leads us to a linear space index in external memory supporting
top- queries (with unsorted outputs) in near optimal I/Os for any constant { and
}. Then we get space index
with optimal I/Os.Comment: 3 figure
Efficient Indexing for Structured and Unstructured Data
The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation