10,271 research outputs found
Document Retrieval on Repetitive Collections
Document retrieval aims at finding the most important documents where a
pattern appears in a collection of strings. Traditional pattern-matching
techniques yield brute-force document retrieval solutions, which has motivated
the research on tailored indexes that offer near-optimal performance. However,
an experimental study establishing which alternatives are actually better than
brute force, and which perform best depending on the collection
characteristics, has not been carried out. In this paper we address this
shortcoming by exploring the relationship between the nature of the underlying
collection and the performance of current methods. Via extensive experiments we
show that established solutions are often beaten in practice by brute-force
alternatives. We also design new methods that offer superior time/space
trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at
http://www.cs.helsinki.fi/group/suds/rlcsa
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes
Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string
over S. A "secondary index" for x answers alphabet range queries of the form:
Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in
[a_l; a_r]}. Secondary indexes are heavily used in relational databases and
scientific data analysis. It is well-known that the obvious solution, storing a
dictionary for the position set associated with each character, does not always
give optimal query time. In this paper we give the first theoretically optimal
data structure for the secondary indexing problem. In the I/O model, the amount
of data read when answering a query is within a constant factor of the minimum
space needed to represent I_{[a_l;a_r]}, assuming that the size of internal
memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space
usage of the data structure is O(n log |S|) bits in the worst case, and we
further show how to bound the size of the data structure in terms of the 0-th
order entropy of x. We show how to support updates achieving various time-space
trade-offs.
We also consider an approximate version of the basic secondary indexing
problem where a query reports a superset of I_{[a_l;a_r]} containing each
element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon >
0 is the false positive probability. For this problem the amount of data that
needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}|
log(1/epsilon)) bits.Comment: 16 page
On the Benefits of Non-Canonical Filtering in Publish/Subscribe Systems
Current matching approaches in pub/sub systems only allow conjunctive subscriptions. Arbitrary subscriptions have to be transformed into canonical expressions, e.g., DNFs, and need to be treated as several conjunctive subscriptions. This technique is known from database systems and allows us to apply more efficient filtering algorithms. Since pub/sub systems are the contrary to traditional database systems, it is questionable if filtering several canonical subscriptions is the most efficient and scalable way of dealing with arbitrary subscriptions. In this paper we show that our filtering approach supporting arbitrary Boolean subscriptions is more scalable and efficient than current matching algorithms requiring transformations of subscriptions into DNFs
Proximity Full-Text Search with a Response Time Guarantee by Means of Additional Indexes
Full-text search engines are important tools for information retrieval. Term
proximity is an important factor in relevance score measurement. In a proximity
full-text search, we assume that a relevant document contains query terms near
each other, especially if the query terms are frequently occurring words. A
methodology for high-performance full-text query execution is discussed. We
build additional indexes to achieve better efficiency. For a word that occurs
in the text, we include in the indexes some information about nearby words.
What types of additional indexes do we use? How do we use them? These questions
are discussed in this work. We present the results of experiments showing that
the average time of search query execution is 44-45 times less than that
required when using ordinary inverted indexes.
This is a pre-print of a contribution "Veretennikov A.B. Proximity Full-Text
Search with a Response Time Guarantee by Means of Additional Indexes" published
in "Arai K., Kapoor S., Bhatia R. (eds) Intelligent Systems and Applications.
IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868"
published by Springer, Cham. The final authenticated version is available
online at: https://doi.org/10.1007/978-3-030-01054-6_66. The work was supported
by Act 211 Government of the Russian Federation, contract no 02.A03.21.0006.Comment: Alexander B. Veretennikov. Chair of Calculation Mathematics and
Computer Science, INSM. Ural Federal Universit
Stochastic Database Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-Stores
Modern business applications and scientific databases call for inherently
dynamic data storage environments. Such environments are characterized by two
challenging features: (a) they have little idle system time to devote on
physical design; and (b) there is little, if any, a priori workload knowledge,
while the query and data workload keeps changing dynamically. In such
environments, traditional approaches to index building and maintenance cannot
apply. Database cracking has been proposed as a solution that allows on-the-fly
physical data reorganization, as a collateral effect of query processing.
Cracking aims to continuously and automatically adapt indexes to the workload
at hand, without human intervention. Indexes are built incrementally,
adaptively, and on demand. Nevertheless, as we show, existing adaptive indexing
methods fail to deliver workload-robustness; they perform much better with
random workloads than with others. This frailty derives from the inelasticity
with which these approaches interpret each query as a hint on how data should
be stored. Current cracking schemes blindly reorganize the data within each
query's range, even if that results into successive expensive operations with
minimal indexing benefit. In this paper, we introduce stochastic cracking, a
significantly more resilient approach to adaptive indexing. Stochastic cracking
also uses each query as a hint on how to reorganize data, but not blindly so;
it gains resilience and avoids performance bottlenecks by deliberately applying
certain arbitrary choices in its decision-making. Thereby, we bring adaptive
indexing forward to a mature formulation that confers the workload-robustness
previous approaches lacked. Our extensive experimental study verifies that
stochastic cracking maintains the desired properties of original database
cracking while at the same time it performs well with diverse realistic
workloads.Comment: VLDB201
Indexing arbitrary-length -mers in sequencing reads
We propose a lightweight data structure for indexing and querying collections
of NGS reads data in main memory. The data structure supports the interface
proposed in the pioneering work by Philippe et al. for counting and locating
-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array),
based on finding overlapping reads, is competitive to the existing algorithms
in the space use, query times, or both. The main applications of our index
include variant calling, error correction and analysis of reads from RNA-seq
experiments
- …