Search CORE

429 research outputs found

ArcaDB: A Container-based Disaggregated Query Engine for Heterogenous Computational Environments

Author: Rodriguez-Martinez Manuel
Ruiz-Rohena Kristalys
Publication venue
Publication date: 25/11/2023
Field of study

Modern enterprises rely on data management systems to collect, store, and analyze vast amounts of data related with their operations. Nowadays, clusters and hardware accelerators (e.g., GPUs, TPUs) have become a necessity to scale with the data processing demands in many applications related to social media, bioinformatics, surveillance systems, remote sensing, and medical informatics. Given this new scenario, the architecture of data analytics engines must evolve to take advantage of these new technological trends. In this paper, we present ArcaDB: a disaggregated query engine that leverages container technology to place operators at compute nodes that fit their performance profile. In ArcaDB, a query plan is dispatched to worker nodes that have different computing characteristics. Each operator is annotated with the preferred type of compute node for execution, and ArcaDB ensures that the operator gets picked up by the appropriate workers. We have implemented a prototype version of ArcaDB using Java, Python, and Docker containers. We have also completed a preliminary performance study of this prototype, using images and scientific data. This study shows that ArcaDB can speed up query performance by a factor of 3.5x in comparison with a shared-nothing, symmetric arrangement

arXiv.org e-Print Archive

DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory (Extended Version)

Author: Aguilera Marcos K.
Chidambaram Vijay
Keeton Kimberly
Lee Sekwon
Ponnapalli Soujanya
Singhal Sharad
Publication venue
Publication date: 18/09/2022
Field of study

We present Dinomo, a novel key-value store for disaggregated persistent memory (DPM). Dinomo is the first key-value store for DPM that simultaneously achieves high common-case performance, scalability, and lightweight online reconfiguration. We observe that previously proposed key-value stores for DPM had architectural limitations that prevent them from achieving all three goals simultaneously. Dinomo uses a novel combination of techniques such as ownership partitioning, disaggregated adaptive caching, selective replication, and lock-free and log-free indexing to achieve these goals. Compared to a state-of-the-art DPM key-value store, Dinomo achieves at least 3.8x better throughput on various workloads at scale and higher scalability, while providing fast reconfiguration.Comment: This is an extended version of the full paper to appear in PVLDB 15.13 (VLDB 2023

arXiv.org e-Print Archive

Towards Transaction as a Service

Author: Li Guoliang
Li Sihao
Ren Yang
Yu Ge
Zhang Yanfeng
Zhou Weixing
Publication venue
Publication date: 13/11/2023
Field of study

This paper argues for decoupling transaction processing from existing two-layer cloud-native databases and making transaction processing as an independent service. By building a transaction as a service (TaaS) layer, the transaction processing can be independently scaled for high resource utilization and can be independently upgraded for development agility. Accordingly, we architect an execution-transaction-storage three-layer cloud-native database. By connecting to TaaS, 1) the AP engines can be empowered with ACID TP capability, 2) multiple standalone TP engine instances can be incorporated to support multi-master distributed TP for horizontal scalability, 3) multiple execution engines with different data models can be integrated to support multi-model transactions, and 4) high performance TP is achieved through extensive TaaS optimizations and consistent evolution. Cloud-native databases deserve better architecture: we believe that TaaS provides a path forward to better cloud-native databases

arXiv.org e-Print Archive

BPF-oF: Storage Function Pushdown Over the Network

Author: Carin Jeremy
Cidon Asaf
Franke Hubertus
Jiang Sheng
Kaffes Kostis
Pfefferle Jonas
Stutsman Ryan
Yang Junfeng
Zarkadas Ioannis
Zhong Yuhong
Zussman Tal
Publication venue
Publication date: 11/12/2023
Field of study

Storage disaggregation, wherein storage is accessed over the network, is popular because it allows applications to independently scale storage capacity and bandwidth based on dynamic application demand. However, the added network processing introduced by disaggregation can consume significant CPU resources. In many storage systems, logical storage operations (e.g., lookups, aggregations) involve a series of simple but dependent I/O access patterns. Therefore, one way to reduce the network processing overhead is to execute dependent series of I/O accesses at the remote storage server, reducing the back-and-forth communication between the storage layer and the application. We refer to this approach as \emph{remote-storage pushdown}. We present BPF-oF, a new remote-storage pushdown protocol built on top of NVMe-oF, which enables applications to safely push custom eBPF storage functions to a remote storage server. The main challenge in integrating BPF-oF with storage systems is preserving the benefits of their client-based in-memory caches. We address this challenge by designing novel caching techniques for storage pushdown, including splitting queries into separate in-memory and remote-storage phases and periodically refreshing the client cache with sampled accesses from the remote storage device. We demonstrate the utility of BPF-oF by integrating it with three storage systems, including RocksDB, a popular persistent key-value store that has no existing storage pushdown capability. We show BPF-oF provides significant speedups in all three systems when accessed over the network, for example improving RocksDB's throughput by up to 2.8

\times

and tail latency by up to 2.6

\times

arXiv.org e-Print Archive

Towards Scalable OLTP Over Fast Networks

Author: Ziegler Tobias
Publication venue
Publication date: 24/10/2023
Field of study

Online Transaction Processing (OLTP) underpins real-time data processing in many mission-critical applications, from banking to e-commerce. These applications typically issue short-duration, latency-sensitive transactions that demand immediate processing. High-volume applications, such as Alibaba's e-commerce platform, achieve peak transaction rates as high as 70 million transactions per second, exceeding the capacity of a single machine. Instead, distributed OLTP database management systems (DBMS) are deployed across multiple powerful machines. Historically, such distributed OLTP DBMSs have been primarily designed to avoid network communication, a paradigm largely unchanged since the 1980s. However, fast networks challenge the conventional belief that network communication is the main bottleneck. In particular, emerging network technologies, like Remote Direct Memory Access (RDMA), radically alter how data can be accessed over a network. RDMA's primitives allow direct access to the memory of a remote machine within an order of magnitude of local memory access. This development invalidates the notion that network communication is the primary bottleneck. Given that traditional distributed database systems have been designed with the premise that the network is slow, they cannot efficiently exploit these fast network primitives, which requires us to reconsider how we design distributed OLTP systems. This thesis focuses on the challenges RDMA presents and its implications on the design of distributed OLTP systems. First, we examine distributed architectures to understand data access patterns and scalability in modern OLTP systems. Drawing on these insights, we advocate a distributed storage engine optimized for high-speed networks. The storage engine serves as the foundation of a database, ensuring efficient data access through three central components: indexes, synchronization primitives, and buffer management (caching). With the introduction of RDMA, the landscape of data access has undergone a significant transformation. This requires a comprehensive redesign of the storage engine components to exploit the potential of RDMA and similar high-speed network technologies. Thus, as the second contribution, we design RDMA-optimized tree-based indexes — especially applicable for disaggregated databases to access remote data efficiently. We then turn our attention to the unique challenges of RDMA. One-sided RDMA, one of the network primitives introduced by RDMA, presents a performance advantage in enabling remote memory access while bypassing the remote CPU and the operating system. This allows the remote CPU to process transactions uninterrupted, with no requirement to be on hand for network communication. However, that way, specialized one-sided RDMA synchronization primitives are required since traditional CPU-driven primitives are bypassed. We found that existing RDMA one-sided synchronization schemes are unscalable or, even worse, fail to synchronize correctly, leading to hard-to-detect data corruption. As our third contribution, we address this issue by offering guidelines to build scalable and correct one-sided RDMA synchronization primitives. Finally, recognizing that maintaining all data in memory becomes economically unattractive, we propose a distributed buffer manager design that efficiently utilizes cost-effective NVMe flash storage. By leveraging low-latency RDMA messages, our buffer manager provides a transparent memory abstraction, accessing the aggregated DRAM and NVMe storage across nodes. Central to our approach is a distributed caching protocol that dynamically caches data. With this approach, our system can outperform RDMA-enabled in-memory distributed databases while managing larger-than-memory datasets efficiently

TUbiblio

tuprints

Floki: a proactive data forwarding system for direct inter-function communication for serverless workflows

Author: Berral García Josep Lluís
Carrera Pérez David
Misale Claudia
Nestorov Anna Maria
Wang Chen
Youssef Alaa
Publication venue: Association for Computing Machinery (ACM)
Publication date: 01/01/2022
Field of study

Serverless computing emerges as an architecture choice to build and run containerized data-intensive pipelines. It leaves the tedious work of infrastructure management and operations to the cloud provider, allowing developers to focus on their core business logic, decomposing their jobs into small containerized functions. To increase platform scalability and flexibility, providers take advantage of hardware disaggregation and require inter-function communication to go through shared object storage. Despite data persistence and recovery advantages, object storage is expensive in terms of performance and resources when dealing with data-intensive workloads. In this paper, we present Floki, a data forwarding system for direct and inter-function data exchange proactively enabling point-to-point communication between pipeline producer-consumer pairs of containerized functions through fixed-size memory buffers, pipes, and sockets. Compared with state-of-practice object storage, Floki shows up to 74.95× of end-to-end time performance increase, reducing the largest data sharing time from 12.55 to 4.33 minutes, while requiring up to 50,738× fewer disk resources, with up to roughly 96GB space release.This work was partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P, the Ministry of Science under contract PID2019- 107255GB-C21/AEI/10.13039/501100011033, and PID-126248OB-I00, and the Generalitat de Catalunya under contract 2014SGR1051.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Welcome to Zombieland: Practical and Energy-efficient Memory Disaggregation in a Datacenter

Author: Barker Sean
Chiang
Gao Peter X.
Gu Juncheng
Hines Anemone Michael R.
Hines Michael R.
Holzle Urs
Krste
MiÅĆãşs Grzegorz
Nitu Vlad
RackOut The Case
Reiss Charles
Rumble Stephen M.
Tolia Niraj
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

In this paper, we propose an effortless way for disaggregating the CPU-memory couple, two of the most important resources in cloud computing. Instead of redesigning each resource board, the disaggregation is done at the power supply domain level. In other words, CPU and memory still share the same board, but their power supply domains are separated. Besides this disaggregation, we make the two following contributions: (1) the prototyping of a new ACPI sleep state (called zombie and noted Sz) which allows to suspend a server (thus save energy) while making its memory remotely accessible; and (2) the prototyping of a rack-level system software which allows the transparent utilization of the entire rack resources (avoiding resource waste). We experimentally evaluate the effectiveness of our solution and show that it can improve the energy efficiency of state-of-the-art consolidation techniques by up to 86%, with minimal additional complexity

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms

Author: Alonso Gustavo
Klimovic Ana
Koutsoukos Dimitrios
Marroquín Renato
Müller Ingo
Publication venue
Publication date: 01/09/2021
Field of study

The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt, which is increasingly required due to the speed at which algorithms and hardware evolve. To address this limitation, we present Modularis, an execution layer for data analytics based on sub-operators, i.e.,composable building blocks resembling traditional database operators but at a finer granularity. To demonstrate the advantages of our approach, we use Modularis to build a distributed query processing system supporting relational queries running on an RDMA cluster, a serverless cloud platform, and a smart storage engine. Modularis requires minimal code changes to execute queries across these three diverse hardware platforms, showing that the sub-operator approach reduces the amount and complexity of the code. In fact, changes in the platform affect only sub-operators that depend on the underlying hardware. We show the end-to-end performance of Modularis by comparing it with a framework for SQL processing (Presto), a commercial cluster database (SingleStore), as well as Query-as-a-Service systems (Athena, BigQuery). Modularis outperforms all these systems, proving that the design and architectural advantages of a modular design can be achieved without degrading performance. We also compare Modularis with a hand-optimized implementation of a join for RDMA clusters. We show that Modularis has the advantage of being easily extensible to a wider range of join variants and group by queries, all of which are not supported in the hand-tuned join.Comment: Accepted at PVLDB vol. 1

arXiv.org e-Print Archive

Repository for Publications and Research Data

Modern data analytics in the cloud era

Author: Kläbe Steffen
Publication venue
Publication date: 01/01/2023
Field of study

Cloud Computing ist die dominante Technologie des letzten Jahrzehnts. Die Benutzerfreundlichkeit der verwalteten Umgebung in Kombination mit einer nahezu unbegrenzten Menge an Ressourcen und einem nutzungsabhängigen Preismodell ermöglicht eine schnelle und kosteneffiziente Projektrealisierung für ein breites Nutzerspektrum. Cloud Computing verändert auch die Art und Weise wie Software entwickelt, bereitgestellt und genutzt wird. Diese Arbeit konzentriert sich auf Datenbanksysteme, die in der Cloud-Umgebung eingesetzt werden. Wir identifizieren drei Hauptinteraktionspunkte der Datenbank-Engine mit der Umgebung, die veränderte Anforderungen im Vergleich zu traditionellen On-Premise-Data-Warehouse-Lösungen aufweisen. Der erste Interaktionspunkt ist die Interaktion mit elastischen Ressourcen. Systeme in der Cloud sollten Elastizität unterstützen, um den Lastanforderungen zu entsprechen und dabei kosteneffizient zu sein. Wir stellen einen elastischen Skalierungsmechanismus für verteilte Datenbank-Engines vor, kombiniert mit einem Partitionsmanager, der einen Lastausgleich bietet und gleichzeitig die Neuzuweisung von Partitionen im Falle einer elastischen Skalierung minimiert. Darüber hinaus führen wir eine Strategie zum initialen Befüllen von Puffern ein, die es ermöglicht, skalierte Ressourcen unmittelbar nach der Skalierung auszunutzen. Cloudbasierte Systeme sind von fast überall aus zugänglich und verfügbar. Daten werden häufig von zahlreichen Endpunkten aus eingespeist, was sich von ETL-Pipelines in einer herkömmlichen Data-Warehouse-Lösung unterscheidet. Viele Benutzer verzichten auf die Definition von strikten Schemaanforderungen, um Transaktionsabbrüche aufgrund von Konflikten zu vermeiden oder um den Ladeprozess von Daten zu beschleunigen. Wir führen das Konzept der PatchIndexe ein, die die Definition von unscharfen Constraints ermöglichen. PatchIndexe verwalten Ausnahmen zu diesen Constraints, machen sie für die Optimierung und Ausführung von Anfragen nutzbar und bieten effiziente Unterstützung bei Datenaktualisierungen. Das Konzept kann auf beliebige Constraints angewendet werden und wir geben Beispiele für unscharfe Eindeutigkeits- und Sortierconstraints. Darüber hinaus zeigen wir, wie PatchIndexe genutzt werden können, um fortgeschrittene Constraints wie eine unscharfe Multi-Key-Partitionierung zu definieren, die eine robuste Anfrageperformance bei Workloads mit unterschiedlichen Partitionsanforderungen bietet. Der dritte Interaktionspunkt ist die Nutzerinteraktion. Datengetriebene Anwendungen haben sich in den letzten Jahren verändert. Neben den traditionellen SQL-Anfragen für Business Intelligence sind heute auch datenwissenschaftliche Anwendungen von großer Bedeutung. In diesen Fällen fungiert das Datenbanksystem oft nur als Datenlieferant, während der Rechenaufwand in dedizierten Data-Science- oder Machine-Learning-Umgebungen stattfindet. Wir verfolgen das Ziel, fortgeschrittene Analysen in Richtung der Datenbank-Engine zu verlagern und stellen das Grizzly-Framework als DataFrame-zu-SQL-Transpiler vor. Auf dieser Grundlage identifizieren wir benutzerdefinierte Funktionen (UDFs) und maschinelles Lernen (ML) als wichtige Aufgaben, die von einer tieferen Integration in die Datenbank-Engine profitieren würden. Daher untersuchen und bewerten wir Ansätze für die datenbankinterne Ausführung von Python-UDFs und datenbankinterne ML-Inferenz.Cloud computing has been the groundbreaking technology of the last decade. The ease-of-use of the managed environment in combination with nearly infinite amount of resources and a pay-per-use price model enables fast and cost-efficient project realization for a broad range of users. Cloud computing also changes the way software is designed, deployed and used. This thesis focuses on database systems deployed in the cloud environment. We identify three major interaction points of the database engine with the environment that show changed requirements compared to traditional on-premise data warehouse solutions. First, software is deployed on elastic resources. Consequently, systems should support elasticity in order to match workload requirements and be cost-effective. We present an elastic scaling mechanism for distributed database engines, combined with a partition manager that provides load balancing while minimizing partition reassignments in the case of elastic scaling. Furthermore we introduce a buffer pre-heating strategy that allows to mitigate a cold start after scaling and leads to an immediate performance benefit using scaling. Second, cloud based systems are accessible and available from nearly everywhere. Consequently, data is frequently ingested from numerous endpoints, which differs from bulk loads or ETL pipelines in a traditional data warehouse solution. Many users do not define database constraints in order to avoid transaction aborts due to conflicts or to speed up data ingestion. To mitigate this issue we introduce the concept of PatchIndexes, which allow the definition of approximate constraints. PatchIndexes maintain exceptions to constraints, make them usable in query optimization and execution and offer efficient update support. The concept can be applied to arbitrary constraints and we provide examples of approximate uniqueness and approximate sorting constraints. Moreover, we show how PatchIndexes can be exploited to define advanced constraints like an approximate multi-key partitioning, which offers robust query performance over workloads with different partition key requirements. Third, data-centric workloads changed over the last decade. Besides traditional SQL workloads for business intelligence, data science workloads are of significant importance nowadays. For these cases the database system might only act as data delivery, while the computational effort takes place in data science or machine learning (ML) environments. As this workflow has several drawbacks, we follow the goal of pushing advanced analytics towards the database engine and introduce the Grizzly framework as a DataFrame-to-SQL transpiler. Based on this we identify user-defined functions (UDFs) and machine learning inference as important tasks that would benefit from a deeper engine integration and investigate approaches to push these operations towards the database engine

Digitale Bibliothek Thüringen

Recommended from our members

Designing key-value stores for emerging memory and disaggregation technologies

Author: Lee Sekwon
Publication venue
Publication date: 08/05/2024
Field of study

With the increasing convergence of applications to the cloud, cloud-based key-value stores (KVSs) should offer high performance, scalability, elasticity, utilization, and crash resilience. However, conventional storage technologies and monolithic server models make it challenging to achieve these goals. The transition to the new emerging memory and disaggregation technologies, such as PM (Persistent Memory), RDMA (Remote Direct Memory Access), and CXL (Compute Express Link), can readily offer opportunities to achieve these goals. However, these new technologies have distinct characteristics from the conventional technologies. Thus, to efficiently and reliably utilize them, KVSs must be carefully designed to avoid sub-optimal design choices without compromising their inherent hardware-guaranteed benefits. In this dissertation, we seek to answer the following question: how can we achieve a high-performance, scalable, elastic, and crash-recoverable KVS for disaggregated persistent memory (DPM)? In particular, we explore solutions to achieve these goals by introducing new indexing, caching, and partitioning techniques. We design new indexing data structures for a high-performance, scalable, and crash-recoverable data storage at PM, employ caching strategies for high performance by reducing expensive accesses to DPM, and tailor partitioning techniques to achieve elastic, scalable resource deployment. This dissertation first presents Recipe, a principled approach for converting concurrent DRAM indexes to crash-consistent indexes for PM. The main insight behind Recipe is that isolation provided by a certain class of concurrent DRAM indexes can be translated to crash consistency when the same index is used in PM. We present a set of conditions that enable the identification of this class of DRAM indexes, and the actions to be taken to convert each index to be persistent. Next, we presents Dinomo, the first key-value store for DPM based on RDMA interconnects that simultaneously achieves high common-case performance, scalability, and elasticity. Dinomo uses a novel combination of techniques such as ownership partitioning, disaggregated adaptive caching, selective replication, and lock-free and log-free PM indexing to achieve these goals. Finally, we present Shift, a cache-conscious KVS designs for CXL disaggregated memory. Shift sheds new light on the existing PM indexes and partitioning schemes originally proposed for the different system domains to achieve a high-performance, scalable, elastic, crash-recoverable KVS for CXL disaggregated memory. Furthermore, Shift employs lock intention log to improve the PM indexes to be partial-failure-resilient and non-hierarchical processing to take both advantages of KN cache and direct accesses to CXL disaggregated memory.Computer Science

Texas ScholarWorks