80 research outputs found

    Cheap Data Analytics using Cold Storage Devices

    Get PDF
    Enterprise databases use storage tiering to lower capital and operational expenses. In such a setting, data waterfalls from an SSD-based high-performance tier when it is "hot" (frequently accessed) to a disk-based capacity tier and finally to a tape-based archival tier when "cold" (rarely accessed). To address the unprecedented growth in the amount of cold data, hardware vendors introduced new devices named Cold Storage Devices (CSD) explicitly targeted at cold data workloads. With access latencies in tens of seconds and cost/GB as low as $0.01/GB/month, CSD provide a middle ground between the low-latency (ms), high-cost, HDD-based capacity tier, and high-latency (min to h), low-cost, tape-based, archival tier. Driven by the price/performance aspect of CSD, this paper makes a case for using CSD as a replacement for both capacity and archival tiers of enterprise databases. Although CSD offer major cost savings, we show that current database systems can suffer from severe performance drop when CSD are used as a replacement for HDD due to the mismatch between design assumptions made by the query execution engine and actual storage characteristics of the CSD. We then build a CSD-driven query execution framework, called Skipper, that modifies both the database execution engine and CSD scheduling algorithms to be aware of each other. Using results from our implementation of the architecture based on PostgreSQL and OpenStack Swift, we show that Skipper is capable of completely masking the high latency overhead of CSD, thereby opening up CSD for wider adoption as a storage tier for cheap data analytics over cold data

    Towards Scalable OLTP Over Fast Networks

    Get PDF
    Online Transaction Processing (OLTP) underpins real-time data processing in many mission-critical applications, from banking to e-commerce. These applications typically issue short-duration, latency-sensitive transactions that demand immediate processing. High-volume applications, such as Alibaba's e-commerce platform, achieve peak transaction rates as high as 70 million transactions per second, exceeding the capacity of a single machine. Instead, distributed OLTP database management systems (DBMS) are deployed across multiple powerful machines. Historically, such distributed OLTP DBMSs have been primarily designed to avoid network communication, a paradigm largely unchanged since the 1980s. However, fast networks challenge the conventional belief that network communication is the main bottleneck. In particular, emerging network technologies, like Remote Direct Memory Access (RDMA), radically alter how data can be accessed over a network. RDMA's primitives allow direct access to the memory of a remote machine within an order of magnitude of local memory access. This development invalidates the notion that network communication is the primary bottleneck. Given that traditional distributed database systems have been designed with the premise that the network is slow, they cannot efficiently exploit these fast network primitives, which requires us to reconsider how we design distributed OLTP systems. This thesis focuses on the challenges RDMA presents and its implications on the design of distributed OLTP systems. First, we examine distributed architectures to understand data access patterns and scalability in modern OLTP systems. Drawing on these insights, we advocate a distributed storage engine optimized for high-speed networks. The storage engine serves as the foundation of a database, ensuring efficient data access through three central components: indexes, synchronization primitives, and buffer management (caching). With the introduction of RDMA, the landscape of data access has undergone a significant transformation. This requires a comprehensive redesign of the storage engine components to exploit the potential of RDMA and similar high-speed network technologies. Thus, as the second contribution, we design RDMA-optimized tree-based indexes — especially applicable for disaggregated databases to access remote data efficiently. We then turn our attention to the unique challenges of RDMA. One-sided RDMA, one of the network primitives introduced by RDMA, presents a performance advantage in enabling remote memory access while bypassing the remote CPU and the operating system. This allows the remote CPU to process transactions uninterrupted, with no requirement to be on hand for network communication. However, that way, specialized one-sided RDMA synchronization primitives are required since traditional CPU-driven primitives are bypassed. We found that existing RDMA one-sided synchronization schemes are unscalable or, even worse, fail to synchronize correctly, leading to hard-to-detect data corruption. As our third contribution, we address this issue by offering guidelines to build scalable and correct one-sided RDMA synchronization primitives. Finally, recognizing that maintaining all data in memory becomes economically unattractive, we propose a distributed buffer manager design that efficiently utilizes cost-effective NVMe flash storage. By leveraging low-latency RDMA messages, our buffer manager provides a transparent memory abstraction, accessing the aggregated DRAM and NVMe storage across nodes. Central to our approach is a distributed caching protocol that dynamically caches data. With this approach, our system can outperform RDMA-enabled in-memory distributed databases while managing larger-than-memory datasets efficiently

    Toward timely, predictable and cost-effective data analytics

    Get PDF
    Modern industrial, government, and academic organizations are collecting massive amounts of data at an unprecedented scale and pace. The ability to perform timely, predictable and cost-effective analytical processing of such large data sets in order to extract deep insights is now a key ingredient for success. Traditional database systems (DBMS) are, however, not the first choice for servicing these modern applications, despite 40 years of database research. This is due to the fact that modern applications exhibit different behavior from the one assumed by DBMS: a) timely data exploration as a new trend is characterized by ad-hoc queries and a short user interaction period, leaving little time for DBMS to do good performance tuning, b) accurate statistics representing relevant summary information about distributions of ever increasing data are frequently missing, resulting in suboptimal plan decisions and consequently poor and unpredictable query execution performance, and c) cloud service providers - a major winner in the data analytics game due to the low cost of (shared) storage - have shifted the control over data storage from DBMS to the cloud providers, making it harder for DBMS to optimize data access. This thesis demonstrates that database systems can still provide timely, predictable and cost-effective analytical processing, if they use an agile and adaptive approach. In particular, DBMS need to adapt at three levels (to workload, data and hardware characteristics) in order to stabilize and optimize performance and cost when faced with requirements posed by modern data analytics applications. Workload-driven data ingestion is introduced with NoDB as a means to enable efficient data exploration and reduce the data-to-insight time (i.e., the time to load the data and tune the system) by doing these steps lazily and incrementally as a side-effect of posed queries as opposed to mandatory first steps. Data-driven runtime access path decision making introduced with Smooth Scan alleviates suboptimal query execution, postponing the decision on access paths from query optimization, where statistics are heavily exploited, to query execution, where the system can obtain more details about data distributions. Smooth Scan uses access path morphing from one physical alternative to another to fit the observed data distributions, which removes the need for a priori access path decisions and substantially improves the predictability of DBMS. Hardware-driven query execution introduced with Skipper enables the usage of cold storage devices (CSD) as a cost-effective solution for storing the ever increasing customer data. Skipper uses an out-of-order CSD-driven query execution model based on multi-way joins coupled with efficient cache and I/O scheduling policies to hide the non-uniform access latencies of CSD. This thesis advocates runtime adaptivity as a key to dealing with raising uncertainty about workload characteristics that modern data analytics applications exhibit. Overall, the techniques introduced in this thesis through the three levels of adaptivity (workload, data and hardware-driven adaptivity) increase the usability of database systems and the user satisfaction in the case of big data exploration, making low-cost data analytics reality

    Transactional and analytical data management on persistent memory

    Get PDF
    Die zunehmende Anzahl von Smart-Geräten und Sensoren, aber auch die sozialen Medien lassen das Datenvolumen und damit die geforderte Verarbeitungsgeschwindigkeit stetig wachsen. Gleichzeitig müssen viele Anwendungen Daten persistent speichern oder sogar strenge Transaktionsgarantien einhalten. Die neuartige Speichertechnologie Persistent Memory (PMem) mit ihren einzigartigen Eigenschaften scheint ein natürlicher Anwärter zu sein, um diesen Anforderungen effizient nachzukommen. Sie ist im Vergleich zu DRAM skalierbarer, günstiger und dauerhaft. Im Gegensatz zu Disks ist sie deutlich schneller und direkt adressierbar. Daher wird in dieser Dissertation der gezielte Einsatz von PMem untersucht, um den Anforderungen moderner Anwendung gerecht zu werden. Nach der Darlegung der grundlegenden Arbeitsweise von und mit PMem, konzentrieren wir uns primär auf drei Aspekte der Datenverwaltung. Zunächst zerlegen wir mehrere persistente Daten- und Indexstrukturen in ihre zugrundeliegenden Entwurfsprimitive, um Abwägungen für verschiedene Zugriffsmuster aufzuzeigen. So können wir ihre besten Anwendungsfälle und Schwachstellen, aber auch allgemeine Erkenntnisse über das Entwerfen von PMem-basierten Datenstrukturen ermitteln. Zweitens schlagen wir zwei Speicherlayouts vor, die auf analytische Arbeitslasten abzielen und eine effiziente Abfrageausführung auf beliebigen Attributen ermöglichen. Während der erste Ansatz eine verknüpfte Liste von mehrdimensionalen gruppierten Blöcken verwendet, handelt es sich beim zweiten Ansatz um einen mehrdimensionalen Index, der Knoten im DRAM zwischenspeichert. Drittens zeigen wir unter Verwendung der bisherigen Datenstrukturen und Erkenntnisse, wie Datenstrom- und Ereignisverarbeitungssysteme mit transaktionaler Zustandsverwaltung verbessert werden können. Dabei schlagen wir ein neuartiges Transactional Stream Processing (TSP) Modell mit geeigneten Konsistenz- und Nebenläufigkeitsprotokollen vor, die an PMem angepasst sind. Zusammen sollen die diskutierten Aspekte eine Grundlage für die Entwicklung noch ausgereifterer PMem-fähiger Systeme bilden. Gleichzeitig zeigen sie, wie Datenverwaltungsaufgaben PMem ausnutzen können, indem sie neue Anwendungsgebiete erschließen, die Leistung, Skalierbarkeit und Wiederherstellungsgarantien verbessern, die Codekomplexität vereinfachen sowie die ökonomischen und ökologischen Kosten reduzieren.The increasing number of smart devices and sensors, but also social media are causing the volume of data and thus the demanded processing speed to grow steadily. At the same time, many applications need to store data persistently or even comply with strict transactional guarantees. The novel storage technology Persistent Memory (PMem), with its unique properties, seems to be a natural candidate to meet these requirements efficiently. Compared to DRAM, it is more scalable, less expensive, and durable. In contrast to disks, it is significantly faster and directly addressable. Therefore, this dissertation investigates the deliberate employment of PMem to fit the needs of modern applications. After presenting the fundamental work of and with PMem, we focus primarily on three aspects of data management. First, we disassemble several persistent data and index structures into their underlying design primitives to reveal the trade-offs for various access patterns. It allows us to identify their best use cases and vulnerabilities but also to gain general insights into the design of PMem-based data structures. Second, we propose two storage layouts that target analytical workloads and enable an efficient query execution on arbitrary attributes. While the first approach employs a linked list of multi-dimensional clustered blocks that potentially span several storage layers, the second approach is a multi-dimensional index that caches nodes in DRAM. Third, we show how to improve stream and event processing systems involving transactional state management using the preceding data structures and insights. In this context, we propose a novel Transactional Stream Processing (TSP) model with appropriate consistency and concurrency protocols adapted to PMem. Together, the discussed aspects are intended to provide a foundation for developing even more sophisticated PMemenabled systems. At the same time, they show how data management tasks can take advantage of PMem by opening up new application domains, improving performance, scalability, and recovery guarantees, simplifying code complexity, plus reducing economic and environmental costs

    Memory Power Consumption in Main-Memory Database Systems

    Get PDF
    In main-memory database systems, memory can consume a substantial amount of power, comparable to that of the processors. However, existing memory power-saving mechanisms are much less effective than processor power management. Unless the system is almost idle, memory power consumption will be high. The reason for poor memory power proportionality is that the bulk of memory power consumption is attributable to background power, which is determined by memory power state residency. The memory workload in existing systems is evenly distributed over the memory modules and also in time, which precludes the occurrence of long idle intervals. As a result, deep low-power states, which could significantly reduce background power consumption, are rarely entered. In this work, we aim to reduce the memory power consumption of main-memory data- base systems. We start by investigating and explaining the patterns of memory power consumption, under various workloads. We then propose two techniques, implemented at the database system level, that skew memory traffic, creating long periods of idleness in a subset of memory modules. This allows those modules to enter low-power states, reducing overall memory power consumption. We prototyped these techniques in DimmStore, an experimental database system. The first technique is rate-aware data placement, which places data on memory modules according to its access frequency. The background power in the unused or least-used modules is reduced, without affecting background power in the most-used modules. Rate- aware placement saves power and has little performance impact. Under a TPC-C workload, rate-aware placement resulted in memory power savings up to 44%, with a maximum throughput reduction of 10%. The second technique is memory access gating, which targets background power in less- frequently accessed memory modules by inserting periodic idle intervals. Memory gating reduces power consumption of memory modules for which rate-aware placement alone does not create sufficient idleness to reduce power consumption. With gating, memory accesses to these modules become concentrated outside of the idle intervals, creating the opportunity for low-power state use. However, because it delays memory accesses, memory gating impacts performance. Higher memory power savings and lower performance impact occur in workloads with lower memory access rates. Thus, in the YCSB workload with a medium transaction rate, memory gating reduced memory power by 26%, adding 0.25 ms (30%) of transaction latency, compared to DimmStore without gating. In the more memory intensive TPC-C workload and low to medium transaction rate, gating can save 5% of memory power, adding 1.5 ms (60%) of transaction latency, compared to DimmStore without gating

    Doctor of Philosophy

    Get PDF
    dissertationWe develop a novel framework for friend-to-friend (f2f) distributed services (F3DS) by which applications can easily offer peer-to-peer (p2p) services among social peers with resource sharing governed by approximated levels of social altruism. Our frame- work differs significantly from typical p2p collaboration in that it provides a founda- tion for distributed applications to cooperate based on pre-existing trust and altruism among social peers. With the goal of facilitating the approximation of relative levels of altruism among social peers within F3DS, we introduce a new metric: SocialDistance. SocialDistance is a synthetic metric that combines direct levels of altruism between peers with an altruism decay for each hop to approximate indirect levels of altruism. The resulting multihop altruism levels are used by F3DS applications to proportion and prioritize the sharing of resources with other social peers. We use SocialDistance to implement a novel flash file/patch distribution method, SocialSwarm. SocialSwarm uses the SocialDistance metric as part of its resource allocation to overcome the neces- sity of (and inefficiency created by) resource bartering among friends participating in a BitTorrent swarm. We find that SocialSwarm achieves an average file download time reduction of 25% to 35% in comparison with standard BitTorrent under a variety of configurations and conditions, including file sizes, maximum SocialDistance, as well as leech and seed counts. The most socially connected peers yield up to a 47% decrease in download completion time in comparison with average nonsocial BitTorrent swarms. We also use the F3DS framework to implement novel malware detection application- F3DS Antivirus (F3AV)-and evaluate it on the Amazon cloud. We show that with f2f sharing of resources, F3AV achieves a 65% increase in the detection rate of 0- to 1-day-old malware among social peers as compared to the average of individual scanners. Furthermore, we show that F3AV provides the greatest diversity of mal- ware scanners (and thus malware protection) to social hubs-those nodes that are positioned to provide strategic defense against socially aware malware
    • …
    corecore