19 research outputs found
Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution
Cloud-based data analysis is nowadays common practice because of the lower
system management overhead as well as the pay-as-you-go pricing model. The
pricing model, however, is not always suitable for query processing as heavy
use results in high costs. For example, in query-as-a-service systems, where
users are charged per processed byte, collections of queries accessing the same
data frequently can become expensive. The problem is compounded by the limited
options for the user to optimize query execution when using declarative
interfaces such as SQL. In this paper, we show how, without modifying existing
systems and without the involvement of the cloud provider, it is possible to
significantly reduce the overhead, and hence the cost, of query-as-a-service
systems. Our approach is based on query rewriting so that multiple concurrent
queries are combined into a single query. Our experiments show the aggregated
amount of work done by the shared execution is smaller than in a
query-at-a-time approach. Since queries are charged per byte processed, the
cost of executing a group of queries is often the same as executing a single
one of them. As an example, we demonstrate how the shared execution of the
TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and Google
BigQuery than using a query-at-a-time approach while achieving a higher
throughput
Flight Profiling System
Projecte final de carrera fet en col.laboraciĂł amb l'empresa Amadeus IT GroupEnglish: The main goal of the project is to develop a dynamic flight target detector from scratch
finally called PaxFinder (Passengers Finder). This goal has been evolving during the course of the internship. At the beginning, the objective was far more a generic one:
“The objective of the internship is to study the public data available from web-based
social networks so Amadeus products can be tailored to fit each individual”. The first
part was clear. The public data of the people profiles coming from the social networks
should be studied in order to be used in this project. The second part was specified to
develop a dynamic flight target detector through the analysis of passengers profiles.
Some requirements and limitations were established at the beginning of the
project in order to define and limit the scope of the project. A first requirement was to
integrate Crescando into the project in order to exploit its capacities. A second one
was to only work with profile data coming from social networks and Amadeus
repositories. For example, information like the travel history and the background of a
traveler was forbidden. Paying attention to the goal and the requirements of the project, the way to do
this dynamic flight target detector was first developing a flight profiling system and
then, a profile search system.
The tasks that the project should go through in order to reach the objectives were
the following:
· Studying which social networks contain data relevant for the planned
usage.
· Studying how data can be extracted and building tool to perform the
extractions even from social networks as Amadeus databases.
· Getting familiar with Crescando.
· Participate in the definition of the model of data and creating a new
Crescando database thanks to the existing administration tools.
· Define the parameters of a profile.
· Define a methodology to classify and compare profiles.
· Develop PaxFinder.
Any target detector has to be correctly tested. This process is composed of three
steps. The first one is an offline testing. The second one is a study with real users and
the third one is to make an online version massively working with real users data.In this project we have done the offline testing. This testing would have be done
with real data coming from Amadeus repositories and social network extractions.
However, some problems related to the data consistency into Amadeus repositories
and the social networks privacy policy did manipulate the data stored in Crescando.
Anyway, the study of the most important social networks has been done, as well as
their respective data extraction tools. The documentation of this work can be read in
the annexes attached at the end of this document
10381 Summary and Abstracts Collection -- Robust Query Processing
Dagstuhl seminar 10381 on robust query processing (held 19.09.10 -
24.09.10) brought together a diverse set of researchers and practitioners
with a broad range of expertise for the purpose of fostering discussion
and collaboration regarding causes, opportunities, and solutions for
achieving robust query processing.
The seminar strove to build a unified view across
the loosely-coupled system components responsible for
the various stages of database query processing.
Participants were chosen for their experience with database
query processing and, where possible, their prior work in academic
research or in product development towards robustness in database query
processing.
In order to pave the way to motivate, measure, and protect future advances
in robust query processing, seminar 10381 focused on developing tests
for measuring the robustness of query processing.
In these proceedings, we first review the seminar topics, goals,
and results, then present abstracts or notes of some of the seminar break-out
sessions.
We also include, as an appendix,
the robust query processing reading list that
was collected and distributed to participants before the seminar began,
as well as summaries of a few of those papers that were
contributed by some participants
From Cooperative Scans to Predictive Buffer Management
In analytical applications, database systems often need to sustain workloads
with multiple concurrent scans hitting the same table. The Cooperative Scans
(CScans) framework, which introduces an Active Buffer Manager (ABM) component
into the database architecture, has been the most effective and elaborate
response to this problem, and was initially developed in the X100 research
prototype. We now report on the the experiences of integrating Cooperative
Scans into its industrial-strength successor, the Vectorwise database product.
During this implementation we invented a simpler optimization of concurrent
scan buffer management, called Predictive Buffer Management (PBM). PBM is based
on the observation that in a workload with long-running scans, the buffer
manager has quite a bit of information on the workload in the immediate future,
such that an approximation of the ideal OPT algorithm becomes feasible. In the
evaluation on both synthetic benchmarks as well as a TPC-H throughput run we
compare the benefits of naive buffer management (LRU) versus CScans, PBM and
OPT; showing that PBM achieves benefits close to Cooperative Scans, while
incurring much lower architectural impact.Comment: VLDB201
Predictable performance and high query concurrency for data analytics
Conventional data warehouses employ the query- at-a-time model, which maps each query to a distinct physical plan. When several queries execute concurrently, this model introduces contention and thrashing, because the physical plans—unaware of each other—compete for access to the underlying I/O and computation resources. As a result, while modern systems can efficiently optimize and evaluate a single complex data analysis query, their performance suffers significantly and can be highly erratic when multiple complex queries run at the same time. We present in this paper Cjoin , a new design that substantially improves throughput in large-scale data analytics systems processing many concurrent join queries. In contrast to the conventional query-at-a-time model, our approach employs a single physical plan that shares I/O, computation, and tuple storage across all in-flight join queries. We use an “always on” pipeline of non-blocking operators, managed by a controller that continuously examines the current query mix and optimizes the pipeline on the fly. Our design enables data analytics engines to scale gracefully to large data sets, provide predictable execution times, and reduce contention. We implemented Cjoin as an extension to the PostgreSQL DBMS. This prototype outperforms conventional commercial systems by an order of magnitude for tens to hundreds of concurrent queries
In-Memory-Datenmanagement in betrieblichen Anwendungssystemen
In-Memory-Datenbanken halten den gesamten Datenbestand permanent im Hauptspeicher vor. Somit können lesende Zugriffe weitaus schneller erfolgen als bei traditionellen Datenbanksystemen, da keine I/O-Zugriffe auf die Festplatte erfolgen müssen. Für schreibende Zugriffe wurden Mechanismen entwickelt, die Persistenz und somit Transaktionssicherheit gewährleisten. In-Memory-Datenbanken werden seit geraumer Zeit entwickelt und haben sich in speziellen Anwendungen bewährt. Mit zunehmender Speicherdichte von DRAM-Bausteinen sind Hardwaresysteme wirtschaftlich erschwinglich, deren Hauptspeicher einen kompletten betrieblichen Datenbestand aufnehmen können. Somit stellt sich die Frage, ob In-Memory-Datenbanken auch in betrieblichen Anwendungssystemen eingesetzt werden können. Hasso Plattner, der mit HANA eine In-Memory-Datenbank entwickelt hat, ist ein Protagonist dieses Ansatzes. Er sieht erhebliche Potenziale für neue Konzepte in der Entwicklung betrieblicher Informationssysteme. So könne beispielsweise eine transaktionale und eine analytische Anwendung auf dem gleichen Datenbestand laufen, d. h. eine Trennung in operative Datenbanken einerseits und Data-Warehouse-Systeme andererseits ist in der betrieblichen Informationsverarbeitung nicht mehr notwendig (Plattner und Zeier 2011). Doch nicht alle Datenbank-Vertreter stimmen darin überein. Larry Ellison hat die Idee des betrieblichen In-Memory-Einsatzes, eher medienwirksam als seriös argumentativ, als „wacko“ bezeichnet (Bube 2010). Stonebraker (2011) sieht zwar eine Zukunft für In-Memory-Datenbanken in betrieblichen Anwendungen, hält aber weiterhin eine Trennung von OLTP- und OLAP-Anwendungen für sinnvoll. [Aus: Einleitung
Sharing Data and Work Across Concurrent Analytical Queries
Today's data deluge enables organizations to collect massive data, and analyze it with an ever-increasing number of concurrent queries. Traditional data warehouses (DW) face a challenging problem in executing this task, due to their query-centric model: each query is optimized and executed independently. This model results in high contention for resources. Thus, modern DW depart from the query-centric model to execution models involving sharing of common data and work. Our goal is to show when and how a DW should employ sharing. We evaluate experimentally two sharing methodologies, based on their original prototype systems, that exploit work sharing opportunities among concurrent queries at run-time: Simultaneous Pipelining (SP), which shares intermediate results of common sub-plans, and Global Query Plans (GQP), which build and evaluate a single query plan with shared operators. First, after a short review of sharing methodologies, we show that SP and GQP are orthogonal techniques. SP can be applied to shared operators of a GQP, reducing response times by 20%-48% in workloads with numerous common sub-plans. Second, we corroborate previous results on the negative impact of SP on performance for cases of low concurrency. We attribute this behavior to a bottleneck caused by the push-based communication model of SP. We show that pull-based communication for SP eliminates the overhead of sharing altogether for low concurrency, and scales better on multi-core machines than push-based SP, further reducing response times by 82%-86% for high concurrency. Third, we perform an experimental analysis of SP, GQP and their combination, and show when each one is beneficial. We identify a trade-off between low and high concurrency. In the former case, traditional query-centric operators with SP perform better, while in the latter case, GQP with shared operators enhanced by SP give the best results
Tre cronache veneziane inedite della Houghton Library di Harvard
The article describes the witnesses of three Venetian chronicles of the
Houghton Library of the Harvard University in Cambridge (Massachusetts). The paper manuscript Ital. 67, dating to the 16th century, is acephalous and
contains a history of Venice from 1106 to the 15th century. The story ends, in fact, by mentioning
the noble captain Piero Loredan (1372 - 28 October 1438). The codex belonged to
the Ward M. Canaday couple, who donated it to the Houghton library in 1964. The paper
manuscript Ital. 178 dates to the XV century (the term post quem is 1417) and contains a
history of Venice from the origins to the fifteenth century. It is mutilated in the final part. The
codex belonged first to Walter Sneyd (1809-1888), then to Charles William Previt\ue9-Orton
(1877-1947). It is not possible at the moment to indicate the exact date when the manuscript
became part of the collection of the Houghton Library, where it is housed since 1996.
The paper manuscript Riant 12 dates to the 17th century and contains a Chronicle of Venice
from its foundation until 1432. The codex belonged to Count Paul Edouard Didier Riant
(1836-1888) and entered the library of Harvard University in 1899