429 research outputs found
ArcaDB: A Container-based Disaggregated Query Engine for Heterogenous Computational Environments
Modern enterprises rely on data management systems to collect, store, and
analyze vast amounts of data related with their operations. Nowadays, clusters
and hardware accelerators (e.g., GPUs, TPUs) have become a necessity to scale
with the data processing demands in many applications related to social media,
bioinformatics, surveillance systems, remote sensing, and medical informatics.
Given this new scenario, the architecture of data analytics engines must evolve
to take advantage of these new technological trends. In this paper, we present
ArcaDB: a disaggregated query engine that leverages container technology to
place operators at compute nodes that fit their performance profile. In ArcaDB,
a query plan is dispatched to worker nodes that have different computing
characteristics. Each operator is annotated with the preferred type of compute
node for execution, and ArcaDB ensures that the operator gets picked up by the
appropriate workers. We have implemented a prototype version of ArcaDB using
Java, Python, and Docker containers. We have also completed a preliminary
performance study of this prototype, using images and scientific data. This
study shows that ArcaDB can speed up query performance by a factor of 3.5x in
comparison with a shared-nothing, symmetric arrangement
DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory (Extended Version)
We present Dinomo, a novel key-value store for disaggregated persistent
memory (DPM). Dinomo is the first key-value store for DPM that simultaneously
achieves high common-case performance, scalability, and lightweight online
reconfiguration. We observe that previously proposed key-value stores for DPM
had architectural limitations that prevent them from achieving all three goals
simultaneously. Dinomo uses a novel combination of techniques such as ownership
partitioning, disaggregated adaptive caching, selective replication, and
lock-free and log-free indexing to achieve these goals. Compared to a
state-of-the-art DPM key-value store, Dinomo achieves at least 3.8x better
throughput on various workloads at scale and higher scalability, while
providing fast reconfiguration.Comment: This is an extended version of the full paper to appear in PVLDB
15.13 (VLDB 2023
Towards Transaction as a Service
This paper argues for decoupling transaction processing from existing
two-layer cloud-native databases and making transaction processing as an
independent service. By building a transaction as a service (TaaS) layer, the
transaction processing can be independently scaled for high resource
utilization and can be independently upgraded for development agility.
Accordingly, we architect an execution-transaction-storage three-layer
cloud-native database. By connecting to TaaS, 1) the AP engines can be
empowered with ACID TP capability, 2) multiple standalone TP engine instances
can be incorporated to support multi-master distributed TP for horizontal
scalability, 3) multiple execution engines with different data models can be
integrated to support multi-model transactions, and 4) high performance TP is
achieved through extensive TaaS optimizations and consistent evolution.
Cloud-native databases deserve better architecture: we believe that TaaS
provides a path forward to better cloud-native databases
BPF-oF: Storage Function Pushdown Over the Network
Storage disaggregation, wherein storage is accessed over the network, is
popular because it allows applications to independently scale storage capacity
and bandwidth based on dynamic application demand. However, the added network
processing introduced by disaggregation can consume significant CPU resources.
In many storage systems, logical storage operations (e.g., lookups,
aggregations) involve a series of simple but dependent I/O access patterns.
Therefore, one way to reduce the network processing overhead is to execute
dependent series of I/O accesses at the remote storage server, reducing the
back-and-forth communication between the storage layer and the application. We
refer to this approach as \emph{remote-storage pushdown}. We present BPF-oF, a
new remote-storage pushdown protocol built on top of NVMe-oF, which enables
applications to safely push custom eBPF storage functions to a remote storage
server.
The main challenge in integrating BPF-oF with storage systems is preserving
the benefits of their client-based in-memory caches. We address this challenge
by designing novel caching techniques for storage pushdown, including splitting
queries into separate in-memory and remote-storage phases and periodically
refreshing the client cache with sampled accesses from the remote storage
device. We demonstrate the utility of BPF-oF by integrating it with three
storage systems, including RocksDB, a popular persistent key-value store that
has no existing storage pushdown capability. We show BPF-oF provides
significant speedups in all three systems when accessed over the network, for
example improving RocksDB's throughput by up to 2.8 and tail latency by
up to 2.6
Towards Scalable OLTP Over Fast Networks
Online Transaction Processing (OLTP) underpins real-time data processing in many mission-critical applications, from banking to e-commerce.
These applications typically issue short-duration, latency-sensitive transactions that demand immediate processing.
High-volume applications, such as Alibaba's e-commerce platform, achieve peak transaction rates as high as 70 million transactions per second, exceeding the capacity of a single machine.
Instead, distributed OLTP database management systems (DBMS) are deployed across multiple powerful machines.
Historically, such distributed OLTP DBMSs have been primarily designed to avoid network communication, a paradigm largely unchanged since the 1980s.
However, fast networks challenge the conventional belief that network communication is the main bottleneck.
In particular, emerging network technologies, like Remote Direct Memory Access (RDMA), radically alter how data can be accessed over a network.
RDMA's primitives allow direct access to the memory of a remote machine within an order of magnitude of local memory access.
This development invalidates the notion that network communication is the primary bottleneck.
Given that traditional distributed database systems have been designed with the premise that the network is slow, they cannot efficiently exploit these fast network primitives, which requires us to reconsider how we design distributed OLTP systems.
This thesis focuses on the challenges RDMA presents and its implications on the design of distributed OLTP systems.
First, we examine distributed architectures to understand data access patterns and scalability in modern OLTP systems.
Drawing on these insights, we advocate a distributed storage engine optimized for high-speed networks.
The storage engine serves as the foundation of a database, ensuring efficient data access through three central components: indexes, synchronization primitives, and buffer management (caching).
With the introduction of RDMA, the landscape of data access has undergone a significant transformation.
This requires a comprehensive redesign of the storage engine components to exploit the potential of RDMA and similar high-speed network technologies.
Thus, as the second contribution, we design RDMA-optimized tree-based indexes — especially applicable for disaggregated databases to access remote data efficiently.
We then turn our attention to the unique challenges of RDMA.
One-sided RDMA, one of the network primitives introduced by RDMA, presents a performance advantage in enabling remote memory access while bypassing the remote CPU and the operating system.
This allows the remote CPU to process transactions uninterrupted, with no requirement to be on hand for network communication. However, that way, specialized one-sided RDMA synchronization primitives are required since traditional CPU-driven primitives are bypassed.
We found that existing RDMA one-sided synchronization schemes are unscalable or, even worse, fail to synchronize correctly, leading to hard-to-detect data corruption.
As our third contribution, we address this issue by offering guidelines to build scalable and correct one-sided RDMA synchronization primitives.
Finally, recognizing that maintaining all data in memory becomes economically unattractive, we propose a distributed buffer manager design that efficiently utilizes cost-effective NVMe flash storage.
By leveraging low-latency RDMA messages, our buffer manager provides a transparent memory abstraction, accessing the aggregated DRAM and NVMe storage across nodes.
Central to our approach is a distributed caching protocol that dynamically caches data.
With this approach, our system can outperform RDMA-enabled in-memory distributed databases while managing larger-than-memory datasets efficiently
Floki: a proactive data forwarding system for direct inter-function communication for serverless workflows
Serverless computing emerges as an architecture choice to build and run containerized data-intensive pipelines. It leaves the tedious work of infrastructure management and operations to the cloud provider, allowing developers to focus on their core business logic, decomposing their jobs into small containerized functions. To increase platform scalability and flexibility, providers take advantage of hardware disaggregation and require inter-function communication to go through shared object storage. Despite data persistence and recovery advantages, object storage is expensive in terms of performance and resources when dealing with data-intensive workloads. In this paper, we present Floki, a data forwarding system for direct and inter-function data exchange proactively enabling point-to-point communication between pipeline producer-consumer pairs of containerized functions through fixed-size memory buffers, pipes, and sockets. Compared with state-of-practice object storage, Floki shows up to 74.95Ă— of end-to-end time performance increase, reducing the largest data sharing time from 12.55 to 4.33 minutes, while requiring up to 50,738Ă— fewer disk resources, with up to roughly 96GB space release.This work was partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P, the Ministry of Science under contract PID2019- 107255GB-C21/AEI/10.13039/501100011033, and PID-126248OB-I00, and the Generalitat de Catalunya under contract 2014SGR1051.Peer ReviewedPostprint (author's final draft
Welcome to Zombieland: Practical and Energy-efficient Memory Disaggregation in a Datacenter
In this paper, we propose an effortless way for disaggregating the CPU-memory couple, two of the most important resources in cloud computing. Instead of redesigning each resource board, the disaggregation is done at the power supply domain level. In other words, CPU and memory still share the same board, but their power supply domains are separated. Besides this disaggregation, we make the two following contributions: (1) the prototyping of a new ACPI sleep state (called zombie and noted Sz) which allows to suspend a server (thus save energy) while making its memory remotely accessible; and (2) the prototyping of a rack-level system software which allows the transparent utilization of the entire rack resources (avoiding resource waste). We experimentally evaluate the effectiveness of our solution and show that it can improve the energy efficiency of state-of-the-art consolidation techniques by up to 86%, with minimal additional complexity
Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms
The enormous quantity of data produced every day together with advances in
data analytics has led to a proliferation of data management and analysis
systems. Typically, these systems are built around highly specialized
monolithic operators optimized for the underlying hardware. While effective in
the short term, such an approach makes the operators cumbersome to port and
adapt, which is increasingly required due to the speed at which algorithms and
hardware evolve. To address this limitation, we present Modularis, an execution
layer for data analytics based on sub-operators, i.e.,composable building
blocks resembling traditional database operators but at a finer granularity. To
demonstrate the advantages of our approach, we use Modularis to build a
distributed query processing system supporting relational queries running on an
RDMA cluster, a serverless cloud platform, and a smart storage engine.
Modularis requires minimal code changes to execute queries across these three
diverse hardware platforms, showing that the sub-operator approach reduces the
amount and complexity of the code. In fact, changes in the platform affect only
sub-operators that depend on the underlying hardware. We show the end-to-end
performance of Modularis by comparing it with a framework for SQL processing
(Presto), a commercial cluster database (SingleStore), as well as
Query-as-a-Service systems (Athena, BigQuery). Modularis outperforms all these
systems, proving that the design and architectural advantages of a modular
design can be achieved without degrading performance. We also compare Modularis
with a hand-optimized implementation of a join for RDMA clusters. We show that
Modularis has the advantage of being easily extensible to a wider range of join
variants and group by queries, all of which are not supported in the hand-tuned
join.Comment: Accepted at PVLDB vol. 1
Modern data analytics in the cloud era
Cloud Computing ist die dominante Technologie des letzten Jahrzehnts. Die Benutzerfreundlichkeit der verwalteten Umgebung in Kombination mit einer nahezu unbegrenzten Menge an Ressourcen und einem nutzungsabhängigen Preismodell ermöglicht eine schnelle und kosteneffiziente Projektrealisierung für ein breites Nutzerspektrum. Cloud Computing verändert auch die Art und Weise wie Software entwickelt, bereitgestellt und genutzt wird. Diese Arbeit konzentriert sich auf Datenbanksysteme, die in der Cloud-Umgebung eingesetzt werden. Wir identifizieren drei Hauptinteraktionspunkte der Datenbank-Engine mit der Umgebung, die veränderte Anforderungen im Vergleich zu traditionellen On-Premise-Data-Warehouse-Lösungen aufweisen. Der erste Interaktionspunkt ist die Interaktion mit elastischen Ressourcen. Systeme in der Cloud sollten Elastizität unterstützen, um den Lastanforderungen zu entsprechen und dabei kosteneffizient zu sein. Wir stellen einen elastischen Skalierungsmechanismus für verteilte Datenbank-Engines vor, kombiniert mit einem Partitionsmanager, der einen Lastausgleich bietet und gleichzeitig die Neuzuweisung von Partitionen im Falle einer elastischen Skalierung minimiert. Darüber hinaus führen wir eine Strategie zum initialen Befüllen von Puffern ein, die es ermöglicht, skalierte Ressourcen unmittelbar nach der Skalierung auszunutzen. Cloudbasierte Systeme sind von fast überall aus zugänglich und verfügbar. Daten werden häufig von zahlreichen Endpunkten aus eingespeist, was sich von ETL-Pipelines in einer herkömmlichen Data-Warehouse-Lösung unterscheidet. Viele Benutzer verzichten auf die Definition von strikten Schemaanforderungen, um Transaktionsabbrüche aufgrund von Konflikten zu vermeiden oder um den Ladeprozess von Daten zu beschleunigen. Wir führen das Konzept der PatchIndexe ein, die die Definition von unscharfen Constraints ermöglichen. PatchIndexe verwalten Ausnahmen zu diesen Constraints, machen sie für die Optimierung und Ausführung von Anfragen nutzbar und bieten effiziente Unterstützung bei Datenaktualisierungen. Das Konzept kann auf beliebige Constraints angewendet werden und wir geben Beispiele für unscharfe Eindeutigkeits- und Sortierconstraints. Darüber hinaus zeigen wir, wie PatchIndexe genutzt werden können, um fortgeschrittene Constraints wie eine unscharfe Multi-Key-Partitionierung zu definieren, die eine robuste Anfrageperformance bei Workloads mit unterschiedlichen Partitionsanforderungen bietet. Der dritte Interaktionspunkt ist die Nutzerinteraktion. Datengetriebene Anwendungen haben sich in den letzten Jahren verändert. Neben den traditionellen SQL-Anfragen für Business Intelligence sind heute auch datenwissenschaftliche Anwendungen von großer Bedeutung. In diesen Fällen fungiert das Datenbanksystem oft nur als Datenlieferant, während der Rechenaufwand in dedizierten Data-Science- oder Machine-Learning-Umgebungen stattfindet. Wir verfolgen das Ziel, fortgeschrittene Analysen in Richtung der Datenbank-Engine zu verlagern und stellen das Grizzly-Framework als DataFrame-zu-SQL-Transpiler vor. Auf dieser Grundlage identifizieren wir benutzerdefinierte Funktionen (UDFs) und maschinelles Lernen (ML) als wichtige Aufgaben, die von einer tieferen Integration in die Datenbank-Engine profitieren würden. Daher untersuchen und bewerten wir Ansätze für die datenbankinterne Ausführung von Python-UDFs und datenbankinterne ML-Inferenz.Cloud computing has been the groundbreaking technology of the last decade. The ease-of-use of the managed environment in combination with nearly infinite amount of resources and a pay-per-use price model enables fast and cost-efficient project realization for a broad range of users. Cloud computing also changes the way software is designed, deployed and used. This thesis focuses on database systems deployed in the cloud environment. We identify three major interaction points of the database engine with the environment that show changed requirements compared to traditional on-premise data warehouse solutions. First, software is deployed on elastic resources. Consequently, systems should support elasticity in order to match workload requirements and be cost-effective. We present an elastic scaling mechanism for distributed database engines, combined with a partition manager that provides load balancing while minimizing partition reassignments in the case of elastic scaling. Furthermore we introduce a buffer pre-heating strategy that allows to mitigate a cold start after scaling and leads to an immediate performance benefit using scaling. Second, cloud based systems are accessible and available from nearly everywhere. Consequently, data is frequently ingested from numerous endpoints, which differs from bulk loads or ETL pipelines in a traditional data warehouse solution. Many users do not define database constraints in order to avoid transaction aborts due to conflicts or to speed up data ingestion. To mitigate this issue we introduce the concept of PatchIndexes, which allow the definition of approximate constraints. PatchIndexes maintain exceptions to constraints, make them usable in query optimization and execution and offer efficient update support. The concept can be applied to arbitrary constraints and we provide examples of approximate uniqueness and approximate sorting constraints. Moreover, we show how PatchIndexes can be exploited to define advanced constraints like an approximate multi-key partitioning, which offers robust query performance over workloads with different partition key requirements. Third, data-centric workloads changed over the last decade. Besides traditional SQL workloads for business intelligence, data science workloads are of significant importance nowadays. For these cases the database system might only act as data delivery, while the computational effort takes place in data science or machine learning (ML) environments. As this workflow has several drawbacks, we follow the goal of pushing advanced analytics towards the database engine and introduce the Grizzly framework as a DataFrame-to-SQL transpiler. Based on this we identify user-defined functions (UDFs) and machine learning inference as important tasks that would benefit from a deeper engine integration and investigate approaches to push these operations towards the database engine
Recommended from our members
Designing key-value stores for emerging memory and disaggregation technologies
With the increasing convergence of applications to the cloud, cloud-based key-value stores (KVSs) should offer high performance, scalability, elasticity, utilization, and crash resilience. However, conventional storage technologies and monolithic server models make it challenging to achieve these goals. The transition to the new emerging memory and disaggregation technologies, such as PM (Persistent Memory), RDMA (Remote Direct Memory Access), and CXL (Compute Express Link), can readily offer opportunities to achieve these goals. However, these new technologies have distinct characteristics from the conventional technologies. Thus, to efficiently and reliably utilize them, KVSs must be carefully designed to avoid sub-optimal design choices without compromising their inherent hardware-guaranteed benefits. In this dissertation, we seek to answer the following question: how can we achieve a high-performance, scalable, elastic, and crash-recoverable KVS for disaggregated persistent memory (DPM)? In particular, we explore solutions to achieve these goals by introducing new indexing, caching, and partitioning techniques. We design new indexing data structures for a high-performance, scalable, and crash-recoverable data storage at PM, employ caching strategies for high performance by reducing expensive accesses to DPM, and tailor partitioning techniques to achieve elastic, scalable resource deployment. This dissertation first presents Recipe, a principled approach for converting concurrent DRAM indexes to crash-consistent indexes for PM. The main insight behind Recipe is that isolation provided by a certain class of concurrent DRAM indexes can be translated to crash consistency when the same index is used in PM. We present a set of conditions that enable the identification of this class of DRAM indexes, and the actions to be taken to convert each index to be persistent. Next, we presents Dinomo, the first key-value store for DPM based on RDMA interconnects that simultaneously achieves high common-case performance, scalability, and elasticity. Dinomo uses a novel combination of techniques such as ownership partitioning, disaggregated adaptive caching, selective replication, and lock-free and log-free PM indexing to achieve these goals. Finally, we present Shift, a cache-conscious KVS designs for CXL disaggregated memory. Shift sheds new light on the existing PM indexes and partitioning schemes originally proposed for the different system domains to achieve a high-performance, scalable, elastic, crash-recoverable KVS for CXL disaggregated memory. Furthermore, Shift employs lock intention log to improve the PM indexes to be partial-failure-resilient and non-hierarchical processing to take both advantages of KN cache and direct accesses to CXL disaggregated memory.Computer Science
- …