256 research outputs found

    Leveraging Non-Volatile Memory in Modern Storage Management Architectures

    Get PDF
    Non-volatile memory technologies (NVM) introduce a novel class of devices that combine characteristics of both storage and main memory. Like storage, NVM is not only persistent, but also denser and cheaper than DRAM. Like DRAM, NVM is byte-addressable and has lower access latency. In recent years, NVM has gained a lot of attention both in academia and in the data management industry, with views ranging from skepticism to over excitement. Some critics claim that NVM is not cheap enough to replace flash-based SSDs nor is it fast enough to replace DRAM, while others see it simply as a storage device. Supporters of NVM have observed that its low latency and byte-addressability requires radical changes and a complete rewrite of storage management architectures. This thesis takes a moderate stance between these two views. We consider that, while NVM might not replace flash-based SSD or DRAM in the near future, it has the potential to reduce the gap between them. Furthermore, treating NVM as a regular storage media does not fully leverage its byte-addressability and low latency. On the other hand, completely redesigning systems to be NVM-centric is impractical. Proposals that attempt to leverage NVM to simplify storage management result in completely new architectures that face the same challenges that are already well-understood and addressed by the traditional architectures. Therefore, we take three common storage management architectures as a starting point, and propose incremental changes to enable them to better leverage NVM. First, in the context of log-structured merge-trees, we investigate the impact of storing data in NVM, and devise methods to enable small granularity accesses and NVM-aware caching policies. Second, in the context of B+Trees, we propose to extend the buffer pool and describe a technique based on the concept of optimistic consistency to handle corrupted pages in NVM. Third, we employ NVM to enable larger capacity and reduced costs in a index+log key-value store, and combine it with other techniques to build a system that achieves low tail latency. This thesis aims to describe and evaluate these techniques in order to enable storage management architectures to leverage NVM and achieve increased performance and lower costs, without major architectural changes.:1 Introduction 1.1 Non-Volatile Memory 1.2 Challenges 1.3 Non-Volatile Memory & Database Systems 1.4 Contributions and Outline 2 Background 2.1 Non-Volatile Memory 2.1.1 Types of NVM 2.1.2 Access Modes 2.1.3 Byte-addressability and Persistency 2.1.4 Performance 2.2 Related Work 2.3 Case Study: Persistent Tree Structures 2.3.1 Persistent Trees 2.3.2 Evaluation 3 Log-Structured Merge-Trees 3.1 LSM and NVM 3.2 LSM Architecture 3.2.1 LevelDB 3.3 Persistent Memory Environment 3.4 2Q Cache Policy for NVM 3.5 Evaluation 3.5.1 Write Performance 3.5.2 Read Performance 3.5.3 Mixed Workloads 3.6 Additional Case Study: RocksDB 3.6.1 Evaluation 4 B+Trees 4.1 B+Tree and NVM 4.1.1 Category #1: Buffer Extension 4.1.2 Category #2: DRAM Buffered Access 4.1.3 Category #3: Persistent Trees 4.2 Persistent Buffer Pool with Optimistic Consistency 4.2.1 Architecture and Assumptions 4.2.2 Embracing Corruption 4.3 Detecting Corruption 4.3.1 Embracing Corruption 4.4 Repairing Corruptions 4.5 Performance Evaluation and Expectations 4.5.1 Checksums Overhead 4.5.2 Runtime and Recovery 4.6 Discussion 5 Index+Log Key-Value Stores 5.1 The Case for Tail Latency 5.2 Goals and Overview 5.3 Execution Model 5.3.1 Reactive Systems and Actor Model 5.3.2 Message-Passing Communication 5.3.3 Cooperative Multitasking 5.4 Log-Structured Storage 5.5 Networking 5.6 Implementation Details 5.6.1 NVM Allocation on RStore 5.6.2 Log-Structured Storage and Indexing 5.6.3 Garbage Collection 5.6.4 Logging and Recovery 5.7 Systems Operations 5.8 Evaluation 5.8.1 Methodology 5.8.2 Environment 5.8.3 Other Systems 5.8.4 Throughput Scalability 5.8.5 Tail Latency 5.8.6 Scans 5.8.7 Memory Consumption 5.9 Related Work 6 Conclusion Bibliography A PiBenc

    Enabling efficient OS paging for main-memory OLTP databases

    Full text link

    Adaptive Cache Mode Selection for Queries over Raw Data

    Get PDF
    Caching the results of intermediate query results for future re-use is a common technique for improving the performance of analytics over raw data sources. An important design choice in this regard is whether to lazily cache only the offsets of satisfying tuples, or to eagerly cache the entire tuples. Lazily cached offsets have the benefit of smaller memory requirement and lower initial caching overhead, but they are much more expensive to reuse. In this paper, we explore this tradeoff and show that neither lazy nor the eager caching mode is optimal for all situations. Instead, the ideal caching mode depends on the workload, the dataset and the cache size. We further show that choosing the sub-optimal caching mode can result in a performance penalty of over 200%. We solve this problem using an adaptive online approach that uses information about query history, cache behavior and cache size to choose the optimal caching mode automatically. Experiments on TPC-H based workloads show that our approach enables execution time to differ by, at most, 16% from the optimal caching mode, and by just 4% on the average

    SAHARA: Memory footprint reduction of cloud databases with automated table partitioning

    Get PDF
    Enterprises increasingly move their databases into the cloud. As a result, database-as-a-service providers are challenged to meet the performance guarantees assured in service-level agreements (SLAs) while keeping hardware costs as low as possible. Being cost-effective is particularly crucial for cloud databases where the provisioned amount of DRAM dominates the hardware costs. A way to decrease the memory footprint is to leverage access skew in the workload by moving rarely accessed cold data to cheaper storage layers and retaining only frequently accessed hot data in main memory. In this paper, we present SAHARA, an advisor that proposes a table partitioning for column stores with minimal memory footprint while still adhering to all performance SLAs. SAHARA collects lightweight workload statistics, classifies data as hot and cold, and calculates optimal or near-optimal range partitioning layouts with low optimization time using a novel cost model. We integrated SAHARA into a commercial cloud database and show in our experiments for real-world and synthetic benchmarks a memory footprint reduction of 2.5Ă— while still fulfilling all performance SLAs provided by the customer or advertised by the DBaaS provider

    ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data

    Get PDF
    As data continues to be generated at exponentially growing rates in heterogeneous formats, fast analytics to extract meaningful information is becoming increasingly important. Systems widely use in-memory caching as one of their primary techniques to speed up data analytics. However, caches in data analytics systems cannot rely on simple caching policies and a fixed data layout to achieve good performance. Different datasets and workloads require different layouts and policies to achieve optimal performance. This paper presents ReCache, a cache-based performance accelerator that is reactive to the cost and heterogeneity of diverse raw data formats. Using timing measurements of caching operations and selection operators in a query plan, ReCache accounts for the widely varying costs of reading, parsing, and caching data in nested and tabular formats. Combining these measurements with information about frequently accessed data fields in the workload, ReCache automatically decides whether a nested or relational column-oriented layout would lead to better query performance. Furthermore, ReCache keeps track of commonly utilized operators to make informed cache admission and eviction decisions. Experiments on synthetic and real-world datasets show that our caching techniques decrease caching overhead for individual queries by an average of 59%. Furthermore, over the entire workload, ReCache reduces execution time by 19-75% compared to existing techniques

    Performance Evaluation and Benchmarking of Event Processing Systems

    Get PDF
    Tese de Doutoramento em Ciências e Tecnologias da Informação apresentada à Faculdade de Ciências e Tecnologia da Universidade de Coimbra.Esta dissertação tem por objetivo estudar e comparar o desempenho dos sistemas de processamento de eventos, bem como propor novas técnicas que melhorem sua eficiência e escalabilidade. Nos últimos anos os sistemas de processamento de eventos têm tido uma difusão bastante rápida, tanto no meio acadêmico, onde deram origem a vários projetos de investigação, como na indústria, onde fomentaram o aparecimento de dezenas de startups e fazem-se hoje presentes nos mais diversos domínios de aplicação. No entanto, tem-se observado uma falta generalizada de informação, metodologias de avaliação e ferramentas no que diz respeito ao desempenho das plataformas de processamento de eventos. Até recentemente, não era conhecido ao certo que fatores afetam mais o seu desempenho, se os sistemas seriam capazes de escalar e adaptar-se às mudanças frequentes nas condições de carga, ou se teriam alguma limitação específica. Além disso, a falta de benchmarks padronizados impedia que se estabelecesse qualquer comparação objetiva entre os diversos produtos. Este trabalho visa preencher estas lacunas, e para isso foram abordados quatro tópicos principais. Primeiramente, desenvolvemos o framework FINCoS, um conjunto de ferramentas de benchmarking para a geração de carga e medição de desempenho de sistemas de processamento de eventos. O framework foi especificamente concebido de modo a ser independente dos produtos testados e da carga de trabalho utilizada, permitindo, assim, a sua reutilização em diversos estudos de desempenho e benchmarks. Em seguida, definimos uma série de microbenchmarks e conduzimos um estudo alargado de desempenho envolvendo três sistemas distintos. Essa análise não só permitiu identificar alguns fatores críticos para o desempenho das plataformas de processamento de eventos, como também expôs limitações importantes dos produtos, tais como má utilização de recursos e falhas devido à falta de memória. A partir dos resultados obtidos, passamos a nos dedicar à investigação de melhorias de desempenho. A fim de aprimorar a utilização de recursos, propusemos novos algoritmos e avaliamos esquemas de organização de dados alternativos que não só reduziram substancialmente o consumo de memória, como também se mostraram significativamente mais eficientes ao nível da microarquitetura. Para dirimir o problema de falta de memória, propusemos SlideM, um algoritmo de paginação que seletivamente envia partes do estado de queries contínuas para disco quando a memória física se torna-se insuficiente. Desenvolvemos também uma estratégia baseada no algoritmo SlideM para partilhar recursos computacionais durante o processamento de queries simultâneas. Concluímos esta dissertação propondo o benchmark Pairs. O benchmark visa avaliar a capacidade das plataformas de processamento de eventos em responder rapidamente a números progressivamente maiores de queries e taxas de entrada de dados cada vez mais altas. Para isso, a carga de trabalho do benchmark foi cuidadosamente concebida de modo a exercitar as operações encontradas com maior frequência em aplicações reais de processamento de eventos, tais como agregação, correlação e detecção de padrões. O benchmark Pairs também se diferencia de propostas anteriores em áreas relacionadas por permitir avaliar outros aspectos fundamentais, como adaptabilidade e escalabilidade com relação ao número de queries. De uma forma geral, esperamos que os resultados e propostas apresentados neste trabalho venham a contribuir para ampliar o entendimento acerca do desempenho das plataformas de processamento de eventos, e sirvam como estímulo para novos projetos de investigação que levem a melhorias adicionais à geração atual de sistemas.This thesis aims at studying, comparing, and improving the performance and scalability of event processing (EP) systems. In the last 15 years, event processing systems have gained increased attention from academia and industry, having found application in a number of mission-critical scenarios and motivated the onset of several research projects and specialized startups. Nonetheless, there has been a general lack of information, evaluation methodologies and tools in what concerns the performance of EP platforms. Until recently, it was not clear which factors impact most their performance, if the systems would scale well and adapt to changes in load conditions or if they had any serious limitations. Moreover, the lack of standardized benchmarks hindered any objective comparison among the diverse platforms. In this thesis, we tackle these problems by acting in several fronts. First, we developed FINCoS, a set of benchmarking tools for load generation and performance measurement of event processing systems. The framework has been designed to be independent on any particular workload or product so that it can be reused in multiple performance studies and benchmark kits. FINCoS has been made publicly available under the terms of the GNU General Public License and is also currently hosted at the Standard Performance Evaluation Corporation (SPEC) repository of peer-reviewed tools for quantitative system evaluation and analysis. We then defined a set of microbenchmarks and used them to conduct an extensive performance study on three EP systems. This analysis helped identifying critical factors affecting the performance of event processing platforms and exposed important limitations of the products, such as poor utilization of resources, trashing or failures in the presence of memory shortages, and no/incipient query plan sharing capabilities. With these results in hands, we moved our focus to performance enhancement. To improve resource utilization, we proposed novel algorithms and evaluated alternative data organization schemes that not only reduce substantially memory consumption, but also are significantly more efficient at the microarchitectural level. Our experimental evaluation corroborated the efficacy of the proposed optimizations: together they provided a 6-fold reduction in memory usage and order-of-magnitude increase on query throughput. In addition, we addressed the problem of memory-constrained applications by introducing SlideM, an optimal buffer management algorithm that selectively offloads sliding windows state to disk when main memory becomes insufficient. We also developed a strategy based on SlideM to share computational resources when processing multiple aggregation queries over overlapping sliding windows. Our experimental results demonstrate that, contrary to common sense, storing windows data on disk can be appropriate even for applications with very high event arrival rates. We concluded this thesis by proposing the Pairs benchmark. Pairs was designed to assess the ability of EP platforms in processing increasingly larger numbers of simultaneous queries and event arrival rates while providing quick answers. The benchmark workload exercises several common features that appear repeatedly in most event processing applications, including event filtering, aggregation, correlation and pattern detection. Furthermore, differently from previous proposals in related areas, Pairs allows evaluating important aspects of event processing systems such as adaptivity and query scalability. In general, we expect that the findings and proposals presented in this thesis serve to broaden the understanding on the performance of event processing platforms and open avenues for additional improvements in the current generation of EP systems.FCT Nº 45121/200

    Change Management Systems for Seamless Evolution in Data Centers

    Get PDF
    Revenue for data centers today is highly dependent on the satisfaction of their enterprise customers. These customers often require various features to migrate their businesses and operations to the cloud. Thus, clouds today introduce new features at a swift pace to onboard new customers and to meet the needs of existing ones. This pace of innovation continues to grow on super linearly, e.g., Amazon deployed 1400 new features in 2017. However, such a rapid pace of evolution adds complexities both for users and the cloud. Clouds struggle to keep up with the deployment speed, and users struggle to learn which features they need and how to use them. The pace of these evolutions has brought us to a tipping point: we can no longer use rules of thumb to deploy new features, and customers need help to identify which features they need. We have built two systems: Janus and Cherrypick, to address these problems. Janus helps data center operators roll out new changes to the data center network. It automatically adapts to the data center topology, routing, traffic, and failure settings. The system reduces the risk of new deployments for network operators as they can now pick deployment strategies which are less likely to impact users’ performance. Cherrypick finds near-optimal cloud configurations for big data analytics. It adapts allows users to search through the new machine types the clouds are constantly introducing and find ones with a near-optimal performance that meets their budget. Cherrypick can adapt to new big-data frameworks and applications as well as the new machine types the clouds are constantly introducing. As the pace of cloud innovations increases, it is critical to have tools that allow operators to deploy new changes as well as those that would enable users to adapt to achieve good performance at low cost. The tools and algorithms discussed in this thesis help accomplish these goals

    TPC-E vs. TPC-C: Characterizing the New TPC-E Benchmark via an I/O Comparison Study

    Get PDF
    TPC-E is a new OLTP benchmark recently approved by the Transaction Processing Performance Council (TPC). In this paper, we compare TPC-E with the familiar TPCC benchmark in order to understand the behavior of the new TPC-E benchmark. In particular, we compare the I/O access patterns of the two benchmarks by analyzing two OLTP disk traces. We find that (i) TPC-E is more read intensive with a 9.7:1 I/O read to write ratio, while TPC-C sees a 1.9:1 read-to-write ratio; and (ii) although TPC-E uses pseudo-realistic data, TPC-E’s I/O access pattern is as random as TPC-C. The latter suggests that like TPC-C, TPC-E can benefit from SSDs, which have superior random I/O support. To verify this, we replay both disk traces on an Intel X25-E SSD and see dramatic improvements for both TPC-C and TPC-E

    A shared-disk parallel cluster file system

    Get PDF
    Dissertação apresentada para obtenção do Grau de Doutor em Informática Pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaToday, clusters are the de facto cost effective platform both for high performance computing (HPC) as well as IT environments. HPC and IT are quite different environments and differences include, among others, their choices on file systems and storage: HPC favours parallel file systems geared towards maximum I/O bandwidth, but which are not fully POSIX-compliant and were devised to run on top of (fault prone) partitioned storage; conversely, IT data centres favour both external disk arrays (to provide highly available storage) and POSIX compliant file systems, (either general purpose or shared-disk cluster file systems, CFSs). These specialised file systems do perform very well in their target environments provided that applications do not require some lateral features, e.g., no file locking on parallel file systems, and no high performance writes over cluster-wide shared files on CFSs. In brief, we can say that none of the above approaches solves the problem of providing high levels of reliability and performance to both worlds. Our pCFS proposal makes a contribution to change this situation: the rationale is to take advantage on the best of both – the reliability of cluster file systems and the high performance of parallel file systems. We don’t claim to provide the absolute best of each, but we aim at full POSIX compliance, a rich feature set, and levels of reliability and performance good enough for broad usage – e.g., traditional as well as HPC applications, support of clustered DBMS engines that may run over regular files, and video streaming. pCFS’ main ideas include: · Cooperative caching, a technique that has been used in file systems for distributed disks but, as far as we know, was never used either in SAN based cluster file systems or in parallel file systems. As a result, pCFS may use all infrastructures (LAN and SAN) to move data. · Fine-grain locking, whereby processes running across distinct nodes may define nonoverlapping byte-range regions in a file (instead of the whole file) and access them in parallel, reading and writing over those regions at the infrastructure’s full speed (provided that no major metadata changes are required). A prototype was built on top of GFS (a Red Hat shared disk CFS): GFS’ kernel code was slightly modified, and two kernel modules and a user-level daemon were added. In the prototype, fine grain locking is fully implemented and a cluster-wide coherent cache is maintained through data (page fragments) movement over the LAN. Our benchmarks for non-overlapping writers over a single file shared among processes running on different nodes show that pCFS’ bandwidth is 2 times greater than NFS’ while being comparable to that of the Parallel Virtual File System (PVFS), both requiring about 10 times more CPU. And pCFS’ bandwidth also surpasses GFS’ (600 times for small record sizes, e.g., 4 KB, decreasing down to 2 times for large record sizes, e.g., 4 MB), at about the same CPU usage.Lusitania, Companhia de Seguros S.A, Programa IBM Shared University Research (SUR
    • …
    corecore