713 research outputs found
Fast in-memory XPath search using compressed indexes
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.Peer reviewe
Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data
El actual diluvio de datos está inundando la web con grandes volúmenes de datos representados en RDF, dando lugar a la denominada 'Web de Datos'. En esta tesis proponemos, en primer lugar, un estudio profundo de aquellos textos que nos permitan abordar un conocimiento global de la estructura real de los conjuntos de datos RDF, HDT, que afronta la representación eficiente de grandes volúmenes de datos RDF a través de estructuras optimizadas para su almacenamiento y transmisión en red. HDT representa efizcamente un conjunto de datos RDF a través de su división en tres componentes: la cabecera (Header), el diccionario (Dictionary) y la estructura de sentencias RDF (Triples). A continuación, nos centramos en proveer estructuras eficientes de dichos componentes, ocupando un espacio comprimido al tiempo que se permite el acceso directo a cualquier dat
Fast in-memory XPath search using compressed indexes
Artículo de publicación ISIExtensible Markup Language (XML) documents consist of text data plus structured data (markup). XPath
allows to query both text and structure. Evaluating such hybrid queries is challenging. We present a system
for in-memory evaluation of XPath search queries, that is, queries with text and structure predicates,
yet without advanced features such as backward axes, arithmetics, and joins. We show that for this query
fragment, which contains Forward Core XPath, our system, dubbed Succinct XML Self-Index (‘SXSI’),
outperforms existing systems by 1–3 orders of magnitude. SXSI is based on state-of-the-art indexes for text
and structure data. It combines two novelties. On one hand, it represents the XML data in a compact indexed
form, which allows it to handle larger collections in main memory while supporting powerful search and
navigation operations over the text and the structure. On the other hand, it features an execution engine that
uses tree automata and cleverly chooses evaluation orders that leverage the speeds of the respective indexes.
SXSI is modular and allows seamless replacement of its indexes. This is demonstrated through experiments
with (1) a text index specialized for search of bio sequences, and (2) a word-based text index specialized for
natural language search.Fondecyt, Chile 1-11006
SIQXC: Schema Independent Queryable XML Compression for Smartphones
The explosive growth of XML use over the last decade has led to a lot of research on how to best store and access it. This growth has resulted in XML being described as a de facto standard for storage and exchange of data over the web. However, XML has high redundancy because of its self-‐ describing nature making it verbose. The verbose nature of XML poses a storage problem. This has led to much research devoted to XML compression. It has become of more interest since the use of resource constrained devices is also on the rise. These devices are limited in storage space, processing power and also have finite energy. Therefore, these devices cannot cope with storing and processing large XML documents. XML queryable compression methods could be a solution but none of them has a query processor that runs on such devices. Currently, wireless connections are used to alleviate the problem but they have adverse effects on the battery life. They are therefore not a sustainable solution.
This thesis describes an attempt to address this problem by proposing a queryable compressor (SIQXC) with a query processor that runs in a resource constrained environment thereby lowering wireless connection dependency yet alleviating the storage problem. It applies a novel simple 2 tuple integer encoding system, clustering and gzip. SIQXC achieves an average compression ratio of 70% which is higher than most queryable XML compressors and also supports a wide range of XPATH operators making it competitive approach. It was tested through a practical implementation evaluated against the real data that is usually used for XML benchmarking. The evaluation covered the compression ratio, compression time and query evaluation accuracy and response time. SIQXC allows users to some extent locally store and manipulate the otherwise verbose XML on their Smartphones
Query Evaluation in the Presence of Fine-grained Access Control
Access controls are mechanisms to enhance security by protecting
data from unauthorized accesses. In contrast to traditional access
controls that grant access rights at the granularity of the whole
tables or views, fine-grained access controls specify access
controls at finer granularity, e.g., individual nodes in XML
databases and individual tuples in relational databases.
While there is a voluminous literature on specifying and modeling
fine-grained access controls, less work has been done to address
the performance issues of database systems with fine-grained
access controls. This thesis addresses the performance issues of
fine-grained access controls and proposes corresponding solutions.
In particular, the following issues are addressed: effective
storage of massive access controls, efficient query planning for
secure query evaluation, and accurate cardinality estimation for
access controlled data.
Because fine-grained access controls specify access rights from
each user to each piece of data in the system, they are
effectively a massive matrix of the size as the product of the
number of users and the size of data. Therefore, fine-grained
access controls require a very compact encoding to be feasible.
The proposed storage system in this thesis achieves an
unprecedented level of compactness by leveraging the high
correlation of access controls found in real system data. This
correlation comes from two sides: the structural similarity of
access rights between data, and the similarity of access patterns
from different users. This encoding can be embedded into a
linearized representation of XML data such that a query evaluation
framework is able to compute the answer to the access controlled
query with minimal disk I/O to the access controls.
Query optimization is a crucial component for database systems.
This thesis proposes an intelligent query plan caching mechanism
that has lower amortized cost for query planning in the presence
of fine-grained access controls. The rationale behind this query
plan caching mechanism is that the queries, customized by
different access controls from different users, may share common
upper-level join trees in their optimal query plans. Since join
plan generation is an expensive step in query optimization,
reusing the upper-level join trees will reduce query optimization
significantly. The proposed caching mechanism is able to match
efficient query plans to access controlled query plans with
minimal runtime cost.
In case of a query plan cache miss, the optimizer needs to
optimize an access controlled query from scratch. This depends on
accurate cardinality estimation on the size of the intermediate
query results. This thesis proposes a novel sampling scheme that
has better accuracy than traditional cardinality estimation
techniques
DescribeX: A Framework for Exploring and Querying XML Web Collections
This thesis introduces DescribeX, a powerful framework that is capable of
describing arbitrarily complex XML summaries of web collections, providing
support for more efficient evaluation of XPath workloads. DescribeX permits the
declarative description of document structure using all axes and language
constructs in XPath, and generalizes many of the XML indexing and summarization
approaches in the literature. DescribeX supports the construction of
heterogeneous summaries where different document elements sharing a common
structure can be declaratively defined and refined by means of path regular
expressions on axes, or axis path regular expression (AxPREs). DescribeX can
significantly help in the understanding of both the structure of complex,
heterogeneous XML collections and the behaviour of XPath queries evaluated on
them.
Experimental results demonstrate the scalability of DescribeX summary
refinements and stabilizations (the key enablers for tailoring summaries) with
multi-gigabyte web collections. A comparative study suggests that using a
DescribeX summary created from a given workload can produce query evaluation
times orders of magnitude better than using existing summaries. DescribeX's
light-weight approach of combining summaries with a file-at-a-time XPath
processor can be a very competitive alternative, in terms of performance, to
conventional fully-fledged XML query engines that provide DB-like functionality
such as security, transaction processing, and native storage.Comment: PhD thesis, University of Toronto, 2008, 163 page
Implementation of Web Query Languages Reconsidered
Visions of the next generation Web such as the "Semantic Web" or the "Web 2.0" have triggered the emergence of a multitude of data formats. These formats have different characteristics as far as the shape of data is concerned (for example tree- vs. graph-shaped). They are accompanied by a puzzlingly large number of query languages each limited to one data format. Thus, a key feature of the Web, namely to make it possible to access anything published by anyone, is compromised.
This thesis is devoted to versatile query languages capable of accessing data in a variety of Web formats. The issue is addressed from three angles: language design, common, yet uniform semantics, and common, yet uniform evaluation. % Thus it is divided in three parts:
First, we consider the query language Xcerpt as an example of the advocated class of versatile Web query languages. Using this concrete exemplar allows us to clarify and discuss the vision of versatility in detail.
Second, a number of query languages, XPath, XQuery, SPARQL, and Xcerpt, are translated into a common intermediary language, CIQLog. This language has a purely logical semantics, which makes it easily amenable to optimizations. As a side effect, this provides the, to the best of our knowledge, first logical semantics for XQuery and SPARQL. It is a very useful tool for understanding the commonalities and differences of the considered languages.
Third, the intermediate logical language is translated into a query algebra, CIQCAG. The core feature of CIQCAG is that it scales from tree- to graph-shaped data and queries without efficiency losses when tree-data and -queries are considered: it is shown that, in these cases, optimal complexities are achieved. CIQCAG is also shown to evaluate each of the aforementioned query languages with a complexity at least as good as the best known evaluation methods so far. For example, navigational XPath is evaluated with space complexity O(q d) and time complexity O(q n) where q is the query size, n the data size, and d the depth of the (tree-shaped) data.
CIQCAG is further shown to provide linear time and space evaluation of tree-shaped queries for a larger class of graph-shaped data than any method previously proposed. This larger class of graph-shaped data, called continuous-image graphs, short CIGs, is introduced for the first time in this thesis. A (directed) graph is a CIG if its nodes can be totally ordered in such a manner that, for this order, the children of any node form a continuous interval.
CIQCAG achieves these properties by employing a novel data structure, called sequence map, that allows an efficient evaluation of tree-shaped queries, or of tree-shaped cores of graph-shaped queries on any graph-shaped data. While being ideally suited to trees and CIGs, the data structure gracefully degrades to unrestricted graphs. It yields a remarkably efficient evaluation on graph-shaped data that only a few edges prevent from being trees or CIGs
- …