    XQuery Optimization Based on Program Slicing *

    ABSTRACT XQuery has become the standard query language for XML. The efforts put on this language have produced mature and efficient implementations of XQuery processors. However, in practice the efficiency of XQuery programs is strongly dependent on the ability of the programmer to combine different queries which often affect several XML sources that in turn can be distributed in different branches of the organization. Therefore, techniques to reduce the amount of data loaded and also to reduce the intermediate structures computed by queries is a necessity. In this work we propose a novel technique that allows the programmer to automatically optimize a query in such a way that unnecessary intermediate computations are avoided, and, in addition, it identifies the paths in the source XML documents that are really required to resolve the query

    Efficient main memory-based XML stream processing

    Applications that process XML documents as files or streams are naturally main-memory based. This makes main memory the bottleneck for scalability. This doctoral thesis addresses this problem and presents a toolkit for effective buffer management in main memory-based XML stream processors. XML document projection is an established technique for reducing the buffer requirements of main memory-based XML processors, where only data relevant to query evaluation is loaded into main memory buffers. We present a novel implementation of this task, where we use string matching algorithms designed for efficient keyword search in flat strings to navigate in tree-structured data. We then introduce an extension of the XQuery language, called FluX, that supports event-based query processing. Purely event-based queries of this language can be executed on streaming XML data in a very direct way. We develop an algorithm to efficiently rewrite XQueries into FluX. This algorithm is capable of exploiting order constraints derived from schemata to reduce the amount of buffering in query evaluation. During streaming query evaluation, we continuously purge buffers from data that is no longer relevant. By combining static query analysis with a dynamic analysis of the buffer contents, we effectively reduce the size of memory buffers. We have confirmed the efficacy of these techniques by extensive experiments and by publication at international venues. To compare our contributions to related work in a systematic manner, we contribute an abstract framework for XML stream processing. This framework allows us to gain a greater-picture view over the factors influencing the main memory consumption.Anwendungen, die XML-Dokumente als Dateien oder Ströme verarbeiten, sind natĂŒrlicherweise hauptspeicherbasiert. FĂŒr die Skalierbarkeit wird der Hauptspeicher damit zu einem Engpass. Diese Doktorarbeit widmet sich diesem Problem, zu dessen Lösung sie Werkzeuge fĂŒr eine effektive Pufferverwaltung in hauptspeicherbasierten Prozessoren fĂŒr XML-Datenströme vorstellt. Die Projektion von XML-Dokumenten ist eine etablierte Methode, um den Pufferverbrauch von hauptspeicherbasierten XML-Prozessoren zu reduzieren. Dabei werden nur jene Daten in den Hauptspeicherpuffer geladen, die fĂŒr die Anfrageauswertung auch relevant sind. Wir prĂ€sentieren eine neue Implementierung dieser Aufgabe, wobei wir Algorithmen zur effizienten Suche in flachen Zeichenketten einsetzen, um in baumartig strukturierten Daten zu navigieren. Danach stellen wir eine Erweiterung der XQuery-Sprache vor, genannt FluX, welche eine ereignisbasierte Anfragebearbeitung erlaubt. Anfragen, die nur ereignisbasierte Konstrukte benutzen, können direkt ĂŒber XML-Datenströmen ausgewertet werden. Dazu entwickeln wir einen Algorithmus, mit dessen Hilfe sich XQuery-Anfragen effizient in FluX ĂŒbersetzen lassen. Dieser benutzt Ordnungsinformationen aus Datenschemata, womit das Puffern in der Anfragebearbeitung reduziert werden kann. WĂ€hrend der Verarbeitung des Datenstroms bereinigen wir laufend den Hauptspeicherpuffer von solchen Daten, die nicht lĂ€nger relevant sind. Eine nachhaltige Reduzierung der GrĂ¶ĂŸe von Hauptspeicherpuffern gelingt durch die Kombination der statischen Anfrageanalyse mit einer dynamischen Analyse der Pufferinhalte. Die EffektivitĂ€t dieser Puffermanagement-Techniken erfĂ€hrt ihre BestĂ€tigung in umfangreichen Experimenten und internationalen Publikationen. FĂŒr einen systematischen Vergleich unserer BeitrĂ€ge mit der aktuellen Literatur entwickeln wir ein abstraktes System zur Modellierung von Prozessoren zur XML-Stromverarbeitung. So können wir die spezifischen Faktoren herausgreifen, die den Hauptspeicherverbrauch beeinflussen

    Bulk Data in Main Memory-based XQuery Evaluation

    XQuery processors that load the input into main memory suffer from huge memory demands. Yet for the evaluation of many queries, large parts of the input are actually irrelevant. In XML document projection, this data is recognized and not loaded in the first place. However, there are also queries where little can be gained by projection. We have observed that these queries tend to require large parts of the input only for generating output. This suggests that such “bulk ” data may be stored and treated differently from data that is actually traversed in query evaluation. In this paper, we present a technique to recognize bulk data while loading XML documents for the evaluation of compositionfree XQuery. Our approach is coupled with XML document projection, and utilizes a finite automaton that is expressly suited for matching path expressions. We show in an exploratory analysis that bulk data arises in practice, and discuss ongoing work along the line of bulk-bypassing in main memory-based XQuery engines. 1

    Complex Event Processing with XChangeEQ

    The emergence of event-driven architectures, automation of business processes, drastic cost-reductions in sensor technology, and a growing need to monitor IT systems (as well as other systems) due to legal, contractual, or operational considerations lead to an increasing generation of events. This development is accompanied by a growing demand for managing and processing events in an automated and systematic way. Complex Event Processing (CEP) encompasses the (automatable) tasks involved in making sense of all events in a system by deriving higher-level knowledge from lower-level events while the events occur, i.e., in a timely, online fashion and permanently. At the core of CEP are queries which monitor streams of "simple" events for so-called complex events, that is, events or situations that manifest themselves in certain combinations of several events occurring (or not occurring) over time and that cannot be detected from looking only at single events. Querying events is fundamentally different from traditional querying and reasoning with database or Web data, since event queries are standing queries that are evaluated permanently over time against incoming streams of event data. In order to express complex events that are of interest to a particular application or user in a convenient, concise, cost-effective and maintainable manner, special purpose Event Query Languages (EQLs) are needed. This thesis investigates practical and theoretical issues related to querying complex events, covering the spectrum from language design over declarative semantics to operational semantics for incremental query evaluation. Its central topic is the development of the high-level event query language XChangeEQ. In contrast to previous data stream and event query languages, XChangeEQ's language design recognizes the four querying dimensions of data extractions, event composition, temporal relationships, and, for non-monotonic queries involving negation or aggregation, event accumulation. XChangeEQ deals with complex structured data in event messages, thus addressing the need to query events communicated in XML formats over the Web. It supports deductive rules as an abstraction and reasoning mechanism for events. To achieve a full coverage of the four querying dimensions, it builds upon a separation of concerns of the four querying dimensions, which makes it easy-to-use and highly expressive. A recurrent theme in the formal foundations of XChangeEQ is that, despite the fundamental differences between traditional database queries and event queries, many well-known results from databases and logic programming are, with some importance changes, applicable to event queries. Declarative semantics for XChangeEQ are given as a (Tarski-style) model theory with accompanying fixpoint theory. This approach accounts well for (1) data in events and (2) deductive rules defining new events from existing ones, two aspects often neglected in previous work of semantics of EQLs. For the evaluation of event queries, this work introduces operational semantics based on an extended and tailored form of relational algebra and query plans with materialization points. Materialization points account for storing and maintaining information about those received events that are relevant for, i.e., can contribute to, future query answers, as well as for an incremental evaluation that avoids recomputing certain intermediate results. Efficient state maintenance in incremental evaluation is approached by "differentiating" algebra expressions, i.e., by deriving expressions for computing only the changes to materialization points. Knowing how long an event is relevant is a prerequisite for performing garbage collection during event query evaluation and also of central importance for developing cost-based query planners. To this end, this thesis introduces a notion of relevance of events (to a given query plan) and develops methods for determining temporal relevance, a particularly useful form based on time-related information