10,866 research outputs found
RkNN Query Processing in Distributed Spatial Infrastructures: A Performance Study
The Reverse k-Nearest Neighbor (RkNN) problem, i.e. finding all objects in a dataset that have a given query point among their corresponding k-nearest neighbors, has received increasing attention in the past years. RkNN queries are of particular interest in a wide range of applications such as decision support systems, resource allocation, profile-based marketing, location-based services, etc. With the current increasing volume of spatial data, it is difficult to perform RkNN queries efficiently in spatial data-intensive applications, because of the limited computational capability and storage resources. In this paper, we investigate how to design and implement distributed RkNN query algorithms using shared-nothing spatial cloud infrastructures as SpatialHadoop and LocationSpark. SpatialHadoop is a framework that inherently supports spatial indexing on top of Hadoop to perform efficiently spatial queries. LocationSpark is a recent spatial data processing system built on top of Spark. We have evaluated the performance of the distributed RkNN query algorithms on both SpatialHadoop and LocationSpark with big real-world datasets. The experiments have demonstrated the efficiency and scalability of our proposal in both distributed spatial data management systems, showing the performance advantages of LocationSpark
Indexing arbitrary-length -mers in sequencing reads
We propose a lightweight data structure for indexing and querying collections
of NGS reads data in main memory. The data structure supports the interface
proposed in the pioneering work by Philippe et al. for counting and locating
-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array),
based on finding overlapping reads, is competitive to the existing algorithms
in the space use, query times, or both. The main applications of our index
include variant calling, error correction and analysis of reads from RNA-seq
experiments
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Reverse-Safe Data Structures for Text Indexing
We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model
A Density-Based Approach to the Retrieval of Top-K Spatial Textual Clusters
Keyword-based web queries with local intent retrieve web content that is
relevant to supplied keywords and that represent points of interest that are
near the query location. Two broad categories of such queries exist. The first
encompasses queries that retrieve single spatial web objects that each satisfy
the query arguments. Most proposals belong to this category. The second
category, to which this paper's proposal belongs, encompasses queries that
support exploratory user behavior and retrieve sets of objects that represent
regions of space that may be of interest to the user. Specifically, the paper
proposes a new type of query, namely the top-k spatial textual clusters (k-STC)
query that returns the top-k clusters that (i) are located the closest to a
given query location, (ii) contain the most relevant objects with regard to
given query keywords, and (iii) have an object density that exceeds a given
threshold. To compute this query, we propose a basic algorithm that relies on
on-line density-based clustering and exploits an early stop condition. To
improve the response time, we design an advanced approach that includes three
techniques: (i) an object skipping rule, (ii) spatially gridded posting lists,
and (iii) a fast range query algorithm. An empirical study on real data
demonstrates that the paper's proposals offer scalability and are capable of
excellent performance
Improved ESP-index: a practical self-index for highly repetitive texts
While several self-indexes for highly repetitive texts exist, developing a
practical self-index applicable to real world repetitive texts remains a
challenge. ESP-index is a grammar-based self-index on the notion of
edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees
upper bounds of parsing discrepancies between different appearances of the same
subtexts in a text. Although ESP-index performs efficient top-down searches of
query texts, it has a serious issue on binary searches for finding appearances
of variables for a query text, which resulted in slowing down the query
searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea
behind succinct data structures for large alphabets. While ESP-index-I keeps
the same types of efficiencies as ESP-index about the top-down searches, it
avoid the binary searches using fast rank/select operations. We experimentally
test ESP-index-I on the ability to search query texts and extract subtexts from
real world repetitive texts on a large-scale, and we show that ESP-index-I
performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th
International Symposium on Experimental Algorithms (SEA2014
- …