163 research outputs found
Forecasting the cost of processing multi-join queries via hashing for main-memory databases (Extended version)
Database management systems (DBMSs) carefully optimize complex multi-join
queries to avoid expensive disk I/O. As servers today feature tens or hundreds
of gigabytes of RAM, a significant fraction of many analytic databases becomes
memory-resident. Even after careful tuning for an in-memory environment, a
linear disk I/O model such as the one implemented in PostgreSQL may make query
response time predictions that are up to 2X slower than the optimal multi-join
query plan over memory-resident data. This paper introduces a memory I/O cost
model to identify good evaluation strategies for complex query plans with
multiple hash-based equi-joins over memory-resident data. The proposed cost
model is carefully validated for accuracy using three different systems,
including an Amazon EC2 instance, to control for hardware-specific differences.
Prior work in parallel query evaluation has advocated right-deep and bushy
trees for multi-join queries due to their greater parallelization and
pipelining potential. A surprising finding is that the conventional wisdom from
shared-nothing disk-based systems does not directly apply to the modern
shared-everything memory hierarchy. As corroborated by our model, the
performance gap between the optimal left-deep and right-deep query plan can
grow to about 10X as the number of joins in the query increases.Comment: 15 pages, 8 figures, extended version of the paper to appear in
SoCC'1
HDTQ: Managing RDF Datasets in Compressed Space
HDT (Header-Dictionary-Triples) is a compressed representation of RDF data that supports retrieval features without prior decompression. Yet, RDF datasets often contain additional graph information, such as the origin, version or validity time of a triple. Traditional HDT is not capable of handling this additional parameter(s). This work introduces HDTQ (HDT Quads), an extension of HDT that is able to represent quadruples (or quads) while still being highly compact and queryable. Two HDTQ-based approaches are introduced: Annotated Triples and Annotated Graphs, and their performance is compared to the leading open-source RDF stores on the market. Results show that HDTQ achieves the best compression rates and is a competitive alternative to well-established systems
NUMA obliviousness through memory mapping
htmlabstractWith the rise of multi-socket multi-core CPUs a lot of effort
is being put into how to best exploit their abundant
CPU power. In a shared memory setting the multi-socket
CPUs are equipped with their own memory module, and access
memory modules across sockets in a non-uniform access
pattern (NUMA). Memory access across socket is relatively
expensive compared to memory access within a socket. One
of the common solutions to minimize across socket memory
access is to partition the data, such that the data affinity is
maintained per socket.
In this paper we explore the role of memory mapped storage
to provide transparent data access in a NUMA environment,
without the need of explicit data partitioning. We
compare the performance of a database engine in a distributed
setting in a multi-socket environment, with a database
engine in a NUMA oblivious setting. We show that though
the operating system tries to keep the data affinity to local
sockets, a significant remote memory access still occurs, as
the number of threads increase. Hence, setting explicit process
and memory affinity results into a robust execution in
NUMA oblivious plans. We use micro-experiments and SQL
queries from the TPC-H benchmark to provide an in-depth
experimental exploration of the landscape, in a four socket
Intel machine
Preferences of Hungarian consumers for quality, access and price attributes of health care services — result of a discrete choice experiment
In 2010, a household survey was carried out in Hungary among 1037 respondents to study consumer preferences and willingness to pay for health care services. In this paper, we use the data from the discrete choice experiments included in the survey, to elicit the preferences of health care consumers about the choice of health care providers. Regression analysis is used to estimate the effect of the improvement of service attributes (quality, access, and price) on patients’ choice, as well as the differences among the socio-demographic groups. We also estimate the marginal willingness to pay for the improvement in attribute levels by calculating marginal rates of substitution. The results show that respondents from a village or the capital, with low education and bad health status are more driven by the changes in the price attribute when choosing between health care providers. Respondents value the good skills and reputation of the physician and the attitude of the personnel most, followed by modern equipment and maintenance of the office/hospital. Access attributes (travelling and waiting time) are less important. The method of discrete choice experiment is useful to reveal patients’ preferences, and might support the development of an evidence-based and sustainable health policy on patient payments
A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics (Extended Version)
There has been significant amount of excitement and recent work on GPU-based
database systems. Previous work has claimed that these systems can perform
orders of magnitude better than CPU-based database systems on analytical
workloads such as those found in decision support and business intelligence
applications. A hardware expert would view these claims with suspicion. Given
the general notion that database operators are memory-bandwidth bound, one
would expect the maximum gain to be roughly equal to the ratio of the memory
bandwidth of GPU to that of CPU. In this paper, we adopt a model-based approach
to understand when and why the performance gains of running queries on GPUs vs
on CPUs vary from the bandwidth ratio (which is roughly 16x on modern
hardware). We propose Crystal, a library of parallel routines that can be
combined together to run full SQL queries on a GPU with minimal materialization
overhead. We implement individual query operators to show that while the
speedups for selection, projection, and sorts are near the bandwidth ratio,
joins
achieve less speedup due to differences in hardware capabilities.
Interestingly, we show on a popular analytical workload that full query
performance gain from running on GPU exceeds the bandwidth ratio despite
individual operators having speedup less than bandwidth ratio, as a result of
limitations of vectorizing chained operators on CPUs, resulting in a 25x
speedup for GPUs over CPUs on the benchmark
H2O: A Hands-free Adaptive Store
Modern state-of-the-art database systems are designed around a single data storage layout. This is a fixed decision that drives the whole architectural design of a database system, i.e., row-stores, column-stores. However, none of those choices is a universally good solution; different workloads require different storage layouts and data access methods in order to achieve good performance. In this paper, we present the H2O system which introduces two novel concepts. First, it is flexible to support multiple storage layouts and data access patterns in a single engine. Second, and most importantly, it decides on-the-fly, i.e., during query processing, which design is best for classes of queries and the respective data parts. At any given point in time, parts of the data might be materialized in various patterns purely depending on the query workload; as the workload changes and with every single query, the storage and access patterns continuously adapt. In this way, H2O makes no a priori and fixed decisions on how data should be stored, allowing each single query to enjoy a storage and access pattern which is tailored to its specific properties. We present a detailed analysis of H2O using both synthetic benchmarks and realistic scientific workloads. We demonstrate that while existing systems cannot achieve maximum performance across all workloads, H2O can always match the best case performance without requiring any tuning or workload knowledge
Four Lessons in Versatility or How Query Languages Adapt to the Web
Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”
- …