87,322 research outputs found
Hercules Against Data Series Similarity Search
We propose Hercules, a parallel tree-based technique for exact similarity
search on massive disk-based data series collections. We present novel index
construction and query answering algorithms that leverage different
summarization techniques, carefully schedule costly operations, optimize memory
and disk accesses, and exploit the multi-threading and SIMD capabilities of
modern hardware to perform CPU-intensive calculations. We demonstrate the
superiority and robustness of Hercules with an extensive experimental
evaluation against state-of-the-art techniques, using many synthetic and real
datasets, and query workloads of varying difficulty. The results show that
Hercules performs up to one order of magnitude faster than the best competitor
(which is not always the same). Moreover, Hercules is the only index that
outperforms the optimized scan on all scenarios, including the hard query
workloads on disk-based datasets. This paper was published in the Proceedings
of the VLDB Endowment, Volume 15, Number 10, June 2022
Selecting adequate samples for approximate decision support queries
For highly selective queries, a simple random sample of records drawn from a large data warehouse may not contain sufficient number of records that satisfy the query conditions. Efficient sampling schemes for such queries require innovative techniques that can access records that are relevant to each specific query. In drawing the sample, it is advantageous to know what would be an adequate sample size for a given query. This paper proposes methods for picking adequate samples that ensure approximate query results with a desired level of accuracy. A special index based on a structure known as the k-MDI Tree is used to draw samples. An unbiased estimator named inverse simple random sampling without replacement is adapted to estimate adequate sample sizes for queries. The methods are evaluated experimentally on a large real life data set. The results of evaluation show that adequate sample sizes can be determined such that errors in outputs of most queries are wtihin the acceptable limit of 5%
Semantics and efficient evaluation of partial tree-pattern queries on XML
Current applications export and exchange XML data on the web. Usually, XML data are queried using keyword queries or using the standard structured query language XQuery the core of which consists of the navigational query language XPath. In this context, one major challenge is the querying of the data when the structure of the data sources is complex or not fully known to the user. Another challenge is the integration of multiple data sources that export data with structural differences and irregularities. In this dissertation, a query language for XML called Partial Tree-Pattern Query (PTPQ) language is considered. PTPQs generalize and strictly contain Tree-Pattern Queries (TPQs) and can express a broad structural fragment of XPath. Because of their expressive power and flexibility, they are useful for querying XML documents the structure of which is complex or not fully known to the user, and for integrating XML data sources with different structures. The dissertation focuses on three issues. The first one is the design of efficient non-main-memory evaluation methods for PTPQs. The second one is the assignment of semantics to PTPQs so that they return meaningful answers. The third one is the development of techniques for answering TPQs using materialized views.
Non-main-memory XML query evaluation can be done in two modes (which also define two evaluation models). In the first mode, data is preprocessed and indexes, called inverted lists, are built for it. In the second mode, data are unindexed and arrives continuously in the form of a stream. Existing algorithms cannot be used directly or indirectly to efficiently compute PTPQs in either mode. Initially, the problem of efficiently evaluating partial path queries in the inverted lists model has been addressed. Partial path queries form a subclass of PTPQs which is not contained in the class of TPQs. Three novel algorithms for evaluating partial path queries including a holistic one have been designed. The analytical and experimental results show that the holistic algorithm outperforms the other two. These results have been extended into holistic and non-holistic approaches for PTPQs in the inverted lists model. The experiments show again the superiority of the holistic approach. The dissertation has also addressed the problem of evaluating PTPQs in the streaming model, and two original efficient streaming algorithms for PTPQs have been designed. Compared to the only known streaming algorithm that supports an extension of TPQs, the experimental results show that the proposed algorithms perform better by orders of magnitude while consuming a much smaller fraction of memory space.
An original approach for assigning semantics to PTPQs has also been devised. The novel semantics seamlessly applies to keyword queries and to queries with structural restrictions. In contrast to previous approaches that operate locally on data, the proposed approach operates globally on structural summaries of data to extract tree patterns. Compared to previous approaches, an experimental evaluation shows that our approach has a perfect recall both for XML documents with complete and with incomplete data. It also shows better precision compared to approaches with similar recall.
Finally, the dissertation has addressed the problem of answering XML queries using exclusively materialized views. An original approach for materializing views in the context of the inverted lists model has been suggested. Necessary and sufficient conditions have been provided for tree-pattern query answerability in terms of view-to-query homomorphisms. A time and space efficient algorithm was designed for deciding query answerability and a technique for computing queries over view materializations using stack- based holistic algorithms was developed. Further, optimizations were developed which (a) minimize the storage space and avoid redundancy by materializing views as bitmaps, and (b) optimize the evaluation of the queries over the views by applying bitwise operations on view materializations. The experimental results show that the proposed approach obtains largely higher hit rates than previous approaches, speeds up significantly the evaluation of queries without using views, and scales very smoothly in terms of storage space and computational overhead
Forecasting the cost of processing multi-join queries via hashing for main-memory databases (Extended version)
Database management systems (DBMSs) carefully optimize complex multi-join
queries to avoid expensive disk I/O. As servers today feature tens or hundreds
of gigabytes of RAM, a significant fraction of many analytic databases becomes
memory-resident. Even after careful tuning for an in-memory environment, a
linear disk I/O model such as the one implemented in PostgreSQL may make query
response time predictions that are up to 2X slower than the optimal multi-join
query plan over memory-resident data. This paper introduces a memory I/O cost
model to identify good evaluation strategies for complex query plans with
multiple hash-based equi-joins over memory-resident data. The proposed cost
model is carefully validated for accuracy using three different systems,
including an Amazon EC2 instance, to control for hardware-specific differences.
Prior work in parallel query evaluation has advocated right-deep and bushy
trees for multi-join queries due to their greater parallelization and
pipelining potential. A surprising finding is that the conventional wisdom from
shared-nothing disk-based systems does not directly apply to the modern
shared-everything memory hierarchy. As corroborated by our model, the
performance gap between the optimal left-deep and right-deep query plan can
grow to about 10X as the number of joins in the query increases.Comment: 15 pages, 8 figures, extended version of the paper to appear in
SoCC'1
Performance and scalability of indexed subgraph query processing methods
Graph data management systems have become very popular
as graphs are the natural data model for many applications.
One of the main problems addressed by these systems is subgraph
query processing; i.e., given a query graph, return all
graphs that contain the query. The naive method for processing
such queries is to perform a subgraph isomorphism
test against each graph in the dataset. This obviously does
not scale, as subgraph isomorphism is NP-Complete. Thus,
many indexing methods have been proposed to reduce the
number of candidate graphs that have to underpass the subgraph
isomorphism test. In this paper, we identify a set of
key factors-parameters, that influence the performance of
related methods: namely, the number of nodes per graph,
the graph density, the number of distinct labels, the number
of graphs in the dataset, and the query graph size. We then
conduct comprehensive and systematic experiments that analyze
the sensitivity of the various methods on the values of
the key parameters. Our aims are twofold: first to derive
conclusions about the algorithms’ relative performance, and,
second, to stress-test all algorithms, deriving insights as to
their scalability, and highlight how both performance and
scalability depend on the above factors. We choose six wellestablished
indexing methods, namely Grapes, CT-Index,
GraphGrepSX, gIndex, Tree+∆, and gCode, as representative
approaches of the overall design space, including the
most recent and best performing methods. We report on
their index construction time and index size, and on query
processing performance in terms of time and false positive
ratio. We employ both real and synthetic datasets. Specifi-
cally, four real datasets of different characteristics are used:
AIDS, PDBS, PCM, and PPI. In addition, we generate a
large number of synthetic graph datasets, empowering us to
systematically study the algorithms’ performance and scalability
versus the aforementioned key parameters
- …