174 research outputs found
Learning a Partitioning Advisor with Deep Reinforcement Learning
Commercial data analytics products such as Microsoft Azure SQL Data Warehouse
or Amazon Redshift provide ready-to-use scale-out database solutions for
OLAP-style workloads in the cloud. While the provisioning of a database cluster
is usually fully automated by cloud providers, customers typically still have
to make important design decisions which were traditionally made by the
database administrator such as selecting the partitioning schemes.
In this paper we introduce a learned partitioning advisor for analytical
OLAP-style workloads based on Deep Reinforcement Learning (DRL). The main idea
is that a DRL agent learns its decisions based on experience by monitoring the
rewards for different workloads and partitioning schemes. We evaluate our
learned partitioning advisor in an experimental evaluation with different
databases schemata and workloads of varying complexity. In the evaluation, we
show that our advisor is not only able to find partitionings that outperform
existing approaches for automated partitioning design but that it also can
easily adjust to different deployments. This is especially important in cloud
setups where customers can easily migrate their cluster to a new set of
(virtual) machines
The End of a Myth: Distributed Transactions Can Scale
The common wisdom is that distributed transactions do not scale. But what if
distributed transactions could be made scalable using the next generation of
networks and a redesign of distributed databases? There would be no need for
developers anymore to worry about co-partitioning schemes to achieve decent
performance. Application development would become easier as data placement
would no longer determine how scalable an application is. Hardware provisioning
would be simplified as the system administrator can expect a linear scale-out
when adding more machines rather than some complex sub-linear function, which
is highly application specific.
In this paper, we present the design of our novel scalable database system
NAM-DB and show that distributed transactions with the very common Snapshot
Isolation guarantee can indeed scale using the next generation of RDMA-enabled
network technology without any inherent bottlenecks. Our experiments with the
TPC-C benchmark show that our system scales linearly to over 6.5 million
new-order (14.5 million total) distributed transactions per second on 56
machines.Comment: 12 page
Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables
In this paper, we propose Multi-Modal Databases (MMDBs), which is a new class
of database systems that can seamlessly query text and tables using SQL. To
enable seamless querying of textual data using SQL in an MMDB, we propose to
extend relational databases with so-called multi-modal operators (MMOps) which
are based on the advances of recent large language models such as GPT-3. The
main idea of MMOps is that they allow text collections to be treated as tables
without the need to manually transform the data. As we show in our evaluation,
our MMDB prototype can not only outperform state-of-the-art approaches such as
text-to-table in terms of accuracy and performance but it also requires
significantly less training data to fine-tune the model for an unseen text
collection
The End of Slow Networks: It's Time for a Redesign
Next generation high-performance RDMA-capable networks will require a
fundamental rethinking of the design and architecture of modern distributed
DBMSs. These systems are commonly designed and optimized under the assumption
that the network is the bottleneck: the network is slow and "thin", and thus
needs to be avoided as much as possible. Yet this assumption no longer holds
true. With InfiniBand FDR 4x, the bandwidth available to transfer data across
network is in the same ballpark as the bandwidth of one memory channel, and it
increases even further with the most recent EDR standard. Moreover, with the
increasing advances of RDMA, the latency improves similarly fast. In this
paper, we first argue that the "old" distributed database design is not capable
of taking full advantage of the network. Second, we propose architectural
redesigns for OLTP, OLAP and advanced analytical frameworks to take better
advantage of the improved bandwidth, latency and RDMA capabilities. Finally,
for each of the workload categories, we show that remarkable performance
improvements can be achieved
COSTREAM: Learned Cost Models for Operator Placement in Edge-Cloud Environments
In this work, we present COSTREAM, a novel learned cost model for Distributed
Stream Processing Systems that provides accurate predictions of the execution
costs of a streaming query in an edge-cloud environment. The cost model can be
used to find an initial placement of operators across heterogeneous hardware,
which is particularly important in these environments. In our evaluation, we
demonstrate that COSTREAM can produce highly accurate cost estimates for the
initial operator placement and even generalize to unseen placements, queries,
and hardware. When using COSTREAM to optimize the placements of streaming
operators, a median speed-up of around 21x can be achieved compared to
baselines.Comment: This paper has been accepted by IEEE ICDE 202
AnyDB: An Architecture-less DBMS for Any Workload
In this paper, we propose a radical new approach for scale-out distributed
DBMSs. Instead of hard-baking an architectural model, such as a shared-nothing
architecture, into the distributed DBMS design, we aim for a new class of
so-called architecture-less DBMSs. The main idea is that an architecture-less
DBMS can mimic any architecture on a per-query basis on-the-fly without any
additional overhead for reconfiguration. Our initial results show that our
architecture-less DBMS AnyDB can provide significant speed-ups across varying
workloads compared to a traditional DBMS implementing a static architecture.Comment: Submitted to 11th Annual Conference on Innovative Data Systems
Research (CIDR 21
FITing-Tree: A Data-aware Index Structure
Index structures are one of the most important tools that DBAs leverage to
improve the performance of analytics and transactional workloads. However,
building several indexes over large datasets can often become prohibitive and
consume valuable system resources. In fact, a recent study showed that indexes
created as part of the TPC-C benchmark can account for 55% of the total memory
available in a modern DBMS. This overhead consumes valuable and expensive main
memory, and limits the amount of space available to store new data or process
existing data.
In this paper, we present FITing-Tree, a novel form of a learned index which
uses piece-wise linear functions with a bounded error specified at construction
time. This error knob provides a tunable parameter that allows a DBA to FIT an
index to a dataset and workload by being able to balance lookup performance and
space consumption. To navigate this tradeoff, we provide a cost model that
helps determine an appropriate error parameter given either (1) a lookup
latency requirement (e.g., 500ns) or (2) a storage budget (e.g., 100MB). Using
a variety of real-world datasets, we show that our index is able to provide
performance that is comparable to full index structures while reducing the
storage footprint by orders of magnitude.Comment: 18 page
An End-to-end Neural Natural Language Interface for Databases
The ability to extract insights from new data sets is critical for decision
making. Visual interactive tools play an important role in data exploration
since they provide non-technical users with an effective way to visually
compose queries and comprehend the results. Natural language has recently
gained traction as an alternative query interface to databases with the
potential to enable non-expert users to formulate complex questions and
information needs efficiently and effectively. However, understanding natural
language questions and translating them accurately to SQL is a challenging
task, and thus Natural Language Interfaces for Databases (NLIDBs) have not yet
made their way into practical tools and commercial products.
In this paper, we present DBPal, a novel data exploration tool with a natural
language interface. DBPal leverages recent advances in deep models to make
query understanding more robust in the following ways: First, DBPal uses a deep
model to translate natural language statements to SQL, making the translation
process more robust to paraphrasing and other linguistic variations. Second, to
support the users in phrasing questions without knowing the database schema and
the query features, DBPal provides a learned auto-completion model that
suggests partial query extensions to users during query formulation and thus
helps to write complex queries
- …