283 research outputs found
Recommended from our members
Analytical Query Execution Optimized for all Layers of Modern Hardware
Analytical database queries are at the core of business intelligence and decision support. To analyze the vast amounts of data available today, query execution needs to be orders of magnitude faster. Hardware advances have made a profound impact on database design and implementation. The large main memory capacity allows queries to execute exclusively in memory and shifts the bottleneck from disk access to memory bandwidth. In the new setting, to optimize query performance, databases must be aware of an unprecedented multitude of complicated hardware features. This thesis focuses on the design and implementation of highly efficient database systems by optimizing analytical query execution for all layers of modern hardware. The hardware layers include the network across multiple machines, main memory and the NUMA interconnection across multiple processors, the multiple levels of caches across multiple processor cores, and the execution pipeline within each core. For the network layer, we introduce a distributed join algorithm that minimizes the network traffic. For the memory hierarchy, we describe partitioning variants aware to the dynamics of the CPU caches and the NUMA interconnection. To improve the memory access rate of linear scans, we optimize lightweight compression variants and evaluate their trade-offs. To accelerate query execution within the core pipeline, we introduce advanced SIMD vectorization techniques generalizable across multiple operators. We evaluate our algorithms and techniques on both mainstream hardware and on many-integrated-core platforms, and combine our techniques in a new query engine design that can better utilize the features of many-core CPUs. In the era of hardware becoming increasingly parallel and datasets consistently growing in size, this thesis can serve as a compass for developing hardware-conscious databases with truly high-performance analytical query execution
Compact and efficient representations of graphs
[Resumen] En esta tesis estudiamos el problema de la creación de representaciones compactas y
eficientes de grafos. Proponemos nuevas estructuras para persistir y consultar grafos
de diferentes dominios, prestando especial atención al diseño de soluciones eficientes
para grafos generales y grafos RDF.
Hemos diseñado una nueva herramienta para generar grafos a partir de fuentes de
datos heterogéneas mediante un sistema de definición de reglas. Es una herramienta
de propósito general y, hasta nuestro conocimiento, no existe otra herramienta de
estas características en el Estado del Arte. Otra contribución de este trabajo es
una representación compacta de grafos generales, que soporta el acceso eficiente
a los atributos y aristas del grafo. Así mismo, hemos estudiado el problema de
la distribución de grafos en un entorno paralelo, almacenados sobre estructuras
compactas, y hemos propuesto nueve alternativas diferentes que han sido evaluadas
experimentalmente. También hemos propuesto un nuevo índice para RDF que
soporta la resolución básica de SPARQL de forma comprimida. Por último,
presentamos una nueva estructura compacta para almacenar relaciones ternarias
cuyo diseño se enfoca a la representación eficiente de datos RDF.
Todas estas propuestas han sido experimentalmente validadas con conjuntos de
datos ampliamente aceptados, obteniéndose resultados competitivos comparadas con
otras alternativas del Estado del Arte.[Resumo] Na presente tese estudiamos o problema da creación de representacións compactas e
eficientes de grafos. Para isto propoñemos novas estruturas para persistir e consultar
grafos de diferentes dominios, facendo especial fincapé no deseño de solucións
eficientes nos casos de grafos xerais e grafos RDF.
Deseñamos unha nova ferramenta para a xeración de grafos a partires de fontes
de datos heteroxéneas mediante un sistema de definición de regras. Trátase dunha
ferramenta de propósito xeral e, até onde chega o noso coñecemento, non existe outra
ferramenta semellante no Estado do Arte. Outra das contribucións do traballo é unha
representación compacta de grafos xerais, con soporte para o acceso eficiente aos
atributos e aristas do grafo. Así mesmo, estudiamos o problema da distribución de
grafos nun contorno paralelo, almacenados sobre estruturas compactas, e propoñemos
nove alternativas diferentes que foron avaliadas de xeito experimental. Propoñemos
tamén un novo índice para RDF que soporta a resolución básica de SPARQL de
xeito comprimido. Para rematar, presentamos unha nova estrutura compacta para
almacenar relacións ternarias, cun diseño especialmente enfocado á representación
eficiente de datos RDF.
Todas estas propostas foron validadas experimentalmente con conxuntos de datos
amplamente aceptados, obténdose resultados competitivos comparadas con outras
alternativas do Estado do Arte.[Abstract] In this thesis we study the problem of creating compact and efficient representations
of graphs. We propose new data structures to store and query graph data from
diverse domains, paying special attention to the design of efficient solutions for
attributed and RDF graphs.
We have designed a new tool to generate graphs from arbitrary data through
a rule definition system. It is a general-purpose solution that, to the best of our
knowledge, is the first with these characteristics. Another contribution of this work
is a very compact representation for attributed graphs, providing efficient access
to the properties and links of the graph. We also study the problem of graph
distribution on a parallel environment using compact structures, proposing nine
different alternatives that are experimentally compared. We also propose a novel
RDF indexing technique that supports efficient SPARQL solution in compressed
space. Finally, we present a new compact structure to store ternary relationships
whose design is focused on the efficient representation of RDF data.
All of these proposals were experimentally evaluated with widely accepted
datasets, obtaining competitive results when they are compared against other
alternatives of the State of the Art
Optimal column layout for hybrid workloads
Data-intensive analytical applications need to support both efficient reads and writes. However, what is usually a good data layout for an update-heavy workload, is not well-suited for a read-mostly one and vice versa. Modern analytical data systems rely on columnar layouts and employ delta stores to inject new data and updates. We show that for hybrid workloads we can achieve close to one order of magnitude better performance by tailoring the column layout design to the data and query workload. Our approach navigates the possible design space of the physical layout: it organizes each column’s data by determining the number of partitions, their corresponding sizes and ranges, and the amount of buffer space and how it is allocated. We frame these design decisions as an optimization problem that, given workload knowledge and performance requirements, provides an optimal physical layout for the workload at hand. To evaluate this work, we build an in-memory storage engine, Casper, and we show that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads. Casper delivers up to 2.32x higher throughput for update-intensive workloads and up to 2.14x higher throughput for hybrid workloads. We further show how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization.http://www.vldb.org/pvldb/vol12/p2393-athanassoulis.pdfPublished versionPublished versio
Qd-tree: Learning Data Layouts for Big Data Analytics
Corporations today collect data at an unprecedented and accelerating scale,
making the need to run queries on large datasets increasingly important.
Technologies such as columnar block-based data organization and compression
have become standard practice in most commercial database systems. However, the
problem of best assigning records to data blocks on storage is still open. For
example, today's systems usually partition data by arrival time into row
groups, or range/hash partition the data based on selected fields. For a given
workload, however, such techniques are unable to optimize for the important
metric of the number of blocks accessed by a query. This metric directly
relates to the I/O cost, and therefore performance, of most analytical queries.
Further, they are unable to exploit additional available storage to drive this
metric down further.
In this paper, we propose a new framework called a query-data routing tree,
or qd-tree, to address this problem, and propose two algorithms for their
construction based on greedy and deep reinforcement learning techniques.
Experiments over benchmark and real workloads show that a qd-tree can provide
physical speedups of more than an order of magnitude compared to current
blocking schemes, and can reach within 2X of the lower bound for data skipping
based on selectivity, while providing complete semantic descriptions of created
blocks.Comment: ACM SIGMOD 202
- …