    In real-life applications, different subsets of data may have distinct statistical properties, e.g., various websites may have diverse visitation rates, different categories of stocks may have dissimilar price fluctuation patterns. For such applications, it can be fruitful to eliminate the commonly made single execution plan assumption and instead execute a query using several plans, each optimally serving a subset of data with particular statistical properties. Furthermore, in dynamic environments, data properties may change continuously, thus calling for adaptivity. The intriguing question is: can we have an execution strategy that (1) is plan-based to leverage on all the benefits of traditional plan-based systems, (2) supports multiple plans each customized for different subset of data, and yet (3) is as adaptive as “plan-less ” systems like Eddies? While the recently proposed Query Mesh (QM) approach provides a foundation for such an execution paradigm, it does not address the question of adaptivity required for highly dynamic environments. In this work, we fill this gap by proposing a Self-Tuning Query Mesh (ST-QM) – an adaptive solution for content-based multi-plan execution engines. ST-QM addresses adaptive query processing by abstracting it as a concept drift problem – a wellknown subject in machine learning. Such abstraction allows to discard adaptivity candidates (i.e., the cases indicating a change in the environment) early in the process if they are insignificant or not “worthwhile ” to adapt to, and thus minimize the adaptivity overhead. A unique feature of our approach is that all logical transformations to the execution strategy get translated into a single inexpensive physical operation – the classifier change. Our experimental evaluation using a continuous query engine shows the performance benefits of ST-QM approach over the alternatives, namely the non-adaptive and the Eddies-based solutions

    Multi-route query processing and optimization

    A modern query optimizer typically picks a single query plan for all data based on overall data statistics. However, many have observed that real-life datasets tend to have non-uniform distributions. Selecting a single query plan may result in ineffective query execution for possibly large portions of the actual data. In addition most stream query processing systems, given the volume of data, cannot precisely model the system state much less account for uncertainty due to continuous variations. Such systems select a single query plan based upon imprecise statistics. In this paper, we present "Query Mesh" (or QM), a practical alternative to state-of-the-art data stream processing approaches. The main idea of QM is to compute multiple routes (i.e., query plans), each designed for a particular subset of the data with distinct statistical properties. We use terms "plans" and "routes" interchangeably in our work. A classifier model is induced and used to assign the best route to process incoming tuples based upon their data characteristics. We formulate the QM search space and analyze its complexity. Due to the substantial search space, we propose several cost-based query optimization heuristics designed to effectively find nearly optimal QMs. We propose the Self-Routing Fabric (SRF) infrastructure that supports query execution with multiple plans without physically constructing their topologies nor using a central router like Eddy. We also consider how to support uncertain route specification and execution in QM which can occur when imprecise statistics lead to more than one optimal route for a subset of data. Our experimental results indicate that QM consistently provides better query execution performance and incurs negligible overhead compared to the alternative state-of-the-art data stream approaches

    Evaluación de técnicas y sistemas de procesamiento de data streams

    Los sistemas y técnicas tradicionales para el procesamiento de datos no resultan adecuados en contextos donde existe un flujo de entrada continuo de datos altamente dinámicos (por ejemplo, en aplicaciones de monitorización, redes de sensores, etc.). Así, a diferencia de lo que sucede en un gestor de bases de datos tradicional, no podemos asumir que vamos a poder ejecutar consultas sobre un conjunto finito y estático de datos almacenados. Al contrario, tenemos que considerar situaciones en las que no es posible almacenar el conjunto completo de datos de entrada y donde las consultas van a estar ejecutándose de forma continua conforme llegan nuevos datos, en lugar de como una consulta instantánea donde únicamente se obtiene una respuesta. Este nuevo entorno tiene implicaciones en múltiples aspectos, tanto en lo referente a algoritmos de procesamiento de consultas, arquitecturas y técnicas de gestión de datos, como en la definición de nuevos lenguajes de consulta y operadores adecuados para manejar conjuntos ilimitados de datos. A lo largo del tiempo, han ido surgiendo diversas propuestas y sistemas para el procesamiento de estos flujos de datos (denominados en la literatura "data streams"). Se considerarán diversos sistemas existentes, como STREAM (Stanford University), Cougar (Cornell University), Aurora/Borealis (Brandeis University, Brown University, MIT),TinyDB (Berkeley University), NiagaraCQ (University of Wisconsin-Madison), TelegraphCQ (University of California, Berkeley), Nile (Purdue Univesity), Gigascope (AT&T Labs), ATLAS (INRIA - Institut National de Recherche en Informatique et en Automatique) o PLACE (Purdue Univesity) analizando en más detalle los sistemas que se consideren más relevantes. Se determinarán las características más destacables de cada aproximación, así como el estado del arte en relación a la aplicación de estos sistemas en entornos más novedosos, como entornos móviles, entornos distribuidos y redes de vehículos. Los objetivos de este Trabajo de Fin de Máster son: -Introducirse en la problemática del procesamiento de data streams. -Estudiar las aproximaciones existentes. -Comparar las propuestas más significativas, tanto cualitativamente como experimentalmente en los casos en los que se tenga acceso al prototipo. Como resultado del trabajo, se elaborará un documento con el estudio realizado y los resultados de la comparativa, que sirva como resumen del estado del arte en el campo. Se considerará el envío de dicho trabajo para su evaluación a alguna conferencia o revista relevante

    Self-tuning query mesh for adaptive multi-route query processing

    Cost-Based Optimization of Integration Flows

    Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation

    Toward timely, predictable and cost-effective data analytics

    Modern industrial, government, and academic organizations are collecting massive amounts of data at an unprecedented scale and pace. The ability to perform timely, predictable and cost-effective analytical processing of such large data sets in order to extract deep insights is now a key ingredient for success. Traditional database systems (DBMS) are, however, not the first choice for servicing these modern applications, despite 40 years of database research. This is due to the fact that modern applications exhibit different behavior from the one assumed by DBMS: a) timely data exploration as a new trend is characterized by ad-hoc queries and a short user interaction period, leaving little time for DBMS to do good performance tuning, b) accurate statistics representing relevant summary information about distributions of ever increasing data are frequently missing, resulting in suboptimal plan decisions and consequently poor and unpredictable query execution performance, and c) cloud service providers - a major winner in the data analytics game due to the low cost of (shared) storage - have shifted the control over data storage from DBMS to the cloud providers, making it harder for DBMS to optimize data access. This thesis demonstrates that database systems can still provide timely, predictable and cost-effective analytical processing, if they use an agile and adaptive approach. In particular, DBMS need to adapt at three levels (to workload, data and hardware characteristics) in order to stabilize and optimize performance and cost when faced with requirements posed by modern data analytics applications. Workload-driven data ingestion is introduced with NoDB as a means to enable efficient data exploration and reduce the data-to-insight time (i.e., the time to load the data and tune the system) by doing these steps lazily and incrementally as a side-effect of posed queries as opposed to mandatory first steps. Data-driven runtime access path decision making introduced with Smooth Scan alleviates suboptimal query execution, postponing the decision on access paths from query optimization, where statistics are heavily exploited, to query execution, where the system can obtain more details about data distributions. Smooth Scan uses access path morphing from one physical alternative to another to fit the observed data distributions, which removes the need for a priori access path decisions and substantially improves the predictability of DBMS. Hardware-driven query execution introduced with Skipper enables the usage of cold storage devices (CSD) as a cost-effective solution for storing the ever increasing customer data. Skipper uses an out-of-order CSD-driven query execution model based on multi-way joins coupled with efficient cache and I/O scheduling policies to hide the non-uniform access latencies of CSD. This thesis advocates runtime adaptivity as a key to dealing with raising uncertainty about workload characteristics that modern data analytics applications exhibit. Overall, the techniques introduced in this thesis through the three levels of adaptivity (workload, data and hardware-driven adaptivity) increase the usability of database systems and the user satisfaction in the case of big data exploration, making low-cost data analytics reality