2,220 research outputs found

    Implications of non-volatile memory as primary storage for database management systems

    Get PDF
    Traditional Database Management System (DBMS) software relies on hard disks for storing relational data. Hard disks are cheap, persistent, and offer huge storage capacities. However, data retrieval latency for hard disks is extremely high. To hide this latency, DRAM is used as an intermediate storage. DRAM is significantly faster than disk, but deployed in smaller capacities due to cost and power constraints, and without the necessary persistency feature that disks have. Non-Volatile Memory (NVM) is an emerging storage class technology which promises the best of both worlds. It can offer large storage capacities, due to better scaling and cost metrics than DRAM, and is non-volatile (persistent) like hard disks. At the same time, its data retrieval time is much lower than that of hard disks and it is also byte-addressable like DRAM. In this paper, we explore the implications of employing NVM as primary storage for DBMS. In other words, we investigate the modifications necessary to be applied on a traditional relational DBMS to take advantage of NVM features. As a case study, we have modified the storage engine (SE) of PostgreSQL enabling efficient use of NVM hardware. We detail the necessary changes and challenges such modifications entail and evaluate them using a comprehensive emulation platform. Results indicate that our modified SE reduces query execution time by up to 40% and 14.4% when compared to disk and NVM storage, with average reductions of 20.5% and 4.5%, respectively.The research leading to these results has received funding from the European Union’s 7th Framework Programme under grant agreement number 318633, the Ministry of Science and Technology of Spain under contract TIN2015-65316-P, and a HiPEAC collaboration grant awarded to Naveed Ul Mustafa.Peer ReviewedPostprint (author's final draft

    RAIDX: RAID EXTENDED FOR HETEROGENEOUS ARRAYS

    Get PDF
    The computer hard drive market has diversified with the establishment of solid state disks (SSDs) as an alternative to magnetic hard disks (HDDs). Each hard drive technology has its advantages: the SSDs are faster than HDDs but the HDDs are cheaper. Our goal is to construct a parallel storage system with HDDs and SSDs such that the parallel system is as fast as the SSDs. Achieving this goal is challenging since the slow HDDs store more data and become bottlenecks, while the SSDs remain idle. RAIDX is a parallel storage system designed for disks of different speeds, capacities and technologies. The RAIDX hardware consists of an array of disks; the RAIDX software consists of data structures and algorithms that allow the disks to be viewed as a single storage unit that has capacity equal to the sum of the capacities of its disks, failure rate lower than the failure rate of its individual disks, and speeds close to that of its faster disks. RAIDX achieves its performance goals with the aid of its novel parallel data organization technique that allows storage data to be moved on the fly without impacting the upper level file system. We show that storage data accesses satisfy the locality of reference principle, whereby only a small fraction of storage data are accessed frequently. RAIDX has a monitoring program that identifies frequently accessed blocks and a migration program that moves frequently accessed blocks to faster disks. The faster disks are caches that store the solo copy of frequently accessed data. Experimental evaluation has shown that a HDD+SSD RAIDX array is as fast as an all-SSD array when the workload shows locality of reference

    Data partitioning and load balancing in parallel disk systems

    Get PDF
    Parallel disk systems provide opportunities for exploiting I/O parallelism in two possible ways, namely via inter-request and intra-request parallelism. In this paper we discuss the main issues in performance tuning of such systems, namely striping and load balancing, and show their relationship to response time and throughput. We outline the main components of an intelligent file system that optimizes striping by taking into account the requirements of the applications, and performs load balancing by judicious file allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but effective heuristics that incur only little overhead. We present performance experiments based on synthetic workloads and real-life traces

    MINIMIZATION OF RESOURCE CONSUMPTION THROUGH WORKLOAD CONSOLIDATION IN LARGE-SCALE DISTRIBUTED DATA PLATFORMS

    Get PDF
    The rapid increase in the data volumes encountered in many application domains has led to widespread adoption of parallel and distributed data management systems like parallel databases and MapReduce-based frameworks (e.g., Hadoop) in recent years. Use of such parallel and distributed frameworks is expected to accelerate in the coming years, putting further strain on already-scarce resources like compute power, network bandwidth, and energy. To reduce total execution times, there is a trend towards increasing execution parallelism by spreading out data across a large number of machines. However, this often increases the total resource consumption, and especially energy consumption, significantly because of process startup costs and other overheads (e.g., communication overheads). In this dissertation, we develop several data management techniques to minimize resource consumption through workload consolidation. In this dissertation, we introduce a key metric called query span, i.e., number of machines involved in the execution of a query or a job. In order to minimize the per query resource consumption we propose to minimize query span. To that end, we develop several workload-driven data partitioning and replica selection algorithms that attempt to minimize the average query span by exploiting the fact that most distributed environments need to use replication for fault tolerance. Extensive experiments on various datasets show that judicious data placement and replication can dramatically reduce the average query spans resulting in significant reductions in resource consumption. We show our results primarily on two applications, distributed data warehouse system and distributed information retrieval. In the first case, we show that minimizing average query spans can minimize overall resource consumption for a given workload and can also improve the performance of complex analytical queries. In the second case, our approach minimizes the overall search cost as well as effectively trades off search cost with load imbalance. The best case of resource efficiency for any underlying data processing system is achieved when the job or the query can be run efficiently on a single machine (i.e., query span=1). In the final part of dissertation, we discuss an in-memory MapReduce system optimized for performing complex analytics tasks on input data sizes that fit in a single machine's memory. We argue that systems like Hadoop that are designed to operate across a large number of machines are not optimal in performance for small and medium sized complex analytics tasks because of high startup costs, heavy disk activity, and wasteful checkpointing. We have developed a prototype runtime called HONE that is API compatible with standard (distributed) Hadoop. In other words, we can take existing Hadoop code and run it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime environment for execution on datasets of varying sizes. Overall, in this dissertation, our key contributions in this work include identification of key metric query span and its relationship with overall resource consumption in scale-out architectures. We introduce several workload-aware techniques to optimize this key metric. We go on to demonstrate the effectiveness of query span minimization on different application scenarios. In order to take advantage of scale-up architectures effectively we develop novel in-memory MapReduce system HONE for single machine. Our thorough experiments on real and synthetic datasets demonstrate the efficacy of our proposed approaches
    • …
    corecore