138 research outputs found

    The state of SQL-on-Hadoop in the cloud

    Get PDF
    Managed Hadoop in the cloud, especially SQL-on-Hadoop, has been gaining attention recently. On Platform-as-a-Service (PaaS), analytical services like Hive and Spark come preconfigured for general-purpose and ready to use. Thus, giving companies a quick entry and on-demand deployment of ready SQL-like solutions for their big data needs. This study evaluates cloud services from an end-user perspective, comparing providers including: Microsoft Azure, Amazon Web Services, Google Cloud, and Rackspace. The study focuses on performance, readiness, scalability, and cost-effectiveness of the different solutions at entry/test level clusters sizes. Results are based on over 15,000 Hive queries derived from the industry standard TPC-H benchmark. The study is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines. The ALOJA Project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. The study benchmarks cloud providers across a diverse range instance types, and uses input data scales from 1GB to 1TB, in order to survey the popular entry-level PaaS SQL-on-Hadoop solutions, thereby establishing a common results-base upon which subsequent research can be carried out by the project. Initial results already show the main performance trends to both hardware and software configuration, pricing, similarities and architectural differences of the evaluated PaaS solutions. Whereas some providers focus on decoupling storage and computing resources while offering network-based elastic storage, others choose to keep the local processing model from Hadoop for high performance, but reducing flexibility. Results also show the importance of application-level tuning and how keeping up-to-date hardware and software stacks can influence performance even more than replicating the on-premises model in the cloud.This work is partially supported by the Microsoft Azure for Research program, the European Research Council (ERC) under the EUs Horizon 2020 programme (GA 639595), the Spanish Ministry of Education (TIN2015-65316-P), and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    Bridging the Gap between Application and Solid-State-Drives

    Get PDF
    Data storage is one of the important and often critical parts of the computing system in terms of performance, cost, reliability, and energy. Numerous new memory technologies, such as NAND flash, phase change memory (PCM), magnetic RAM (STT-RAM) and Memristor, have emerged recently. Many of them have already entered the production system. Traditional storage optimization and caching algorithms are far from optimal because storage I/Os do not show simple locality. To provide optimal storage we need accurate predictions of I/O behavior. However, the workloads are increasingly dynamic and diverse, making the long and short time I/O prediction challenge. Because of the evolution of the storage technologies and the increasing diversity of workloads, the storage software is becoming more and more complex. For example, Flash Translation Layer (FTL) is added for NAND-flash based Solid State Disks (NAND-SSDs). However, it introduces overhead such as address translation delay and garbage collection costs. There are many recent studies aim to address the overhead. Unfortunately, there is no one-size-fits-all solution due to the variety of workloads. Despite rapidly evolving in storage technologies, the increasing heterogeneity and diversity in machines and workloads coupled with the continued data explosion exacerbate the gap between computing and storage speeds. In this dissertation, we improve the data storage performance from both top-down and bottom-up approach. First, we will investigate exposing the storage level parallelism so that applications can avoid I/O contentions and workloads skew when scheduling the jobs. Second, we will study how architecture aware task scheduling can improve the performance of the application when PCM based NVRAM are equipped. Third, we will develop an I/O correlation aware flash translation layer for NAND-flash based Solid State Disks. Fourth, we will build a DRAM-based correlation aware FTL emulator and study the performance in various filesystems

    RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

    Full text link
    RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.Comment: Submitted to ACM Computing Surveys. arXiv admin note: substantial text overlap with arXiv:2306.0876

    Simulating and analyzing commercial workloads and computer systems

    Get PDF

    Cooperative caching for object storage

    Full text link
    Data is increasingly stored in data lakes, vast immutable object stores that can be accessed from anywhere in the data center. By providing low cost and scalable storage, today immutable object-storage based data lakes are used by a wide range of applications with diverse access patterns. Unfortunately, performance can suffer for applications that do not match the access patterns for which the data lake was designed. Moreover, in many of today's (non-hyperscale) data centers, limited bisectional bandwidth will limit data lake performance. Today many computer clusters integrate caches both to address the mismatch between application performance requirements and the capabilities of the shared data lake, and to reduce the demand on the data center network. However, per-cluster caching; i) means the expensive cache resources cannot be shifted between clusters based on demand, ii) makes sharing expensive because data accessed by multiple clusters is independently cached by each of them, and iii) makes it difficult for clusters to grow and shrink if their servers are being used to cache storage. In this dissertation, we present two novel data-center wide cooperative cache architectures, Datacenter-Data-Delivery Network (D3N) and Directory-Based Datacenter-Data-Delivery Network (D4N) that are designed to be part of the data lake itself rather than part of the computer clusters that use it. D3N and D4N distribute caches across the data center to enable data sharing and elasticity of cache resources where requests are transparently directed to nearby cache nodes. They dynamically adapt to changes in access patterns and accelerate workloads while providing the same consistency, trust, availability, and resilience guarantees as the underlying data lake. We nd that exploiting the immutability of object stores significantly reduces the complexity and provides opportunities for cache management strategies that were not feasible for previous cooperative cache systems for le or block-based storage. D3N is a multi-layer cooperative cache that targets workloads with large read-only datasets like big data analytics. It is designed to be easily integrated into existing data lakes with only limited support for write caching of intermediate data, and avoiding any global state by, for example, using consistent hashing for locating blocks and making all caching decisions based purely on local information. Our prototype is performant enough to fully exploit the (5 GB/s read) SSDs and (40, Gbit/s) NICs in our system and improve the runtime of realistic workloads by up to 3x. The simplicity of D3N has enabled us, in collaboration with industry partners, to upstream the two-layer version of D3N into the existing code base of the Ceph object store as a new experimental feature, making it available to the many data lakes around the world based on Ceph. D4N is a directory-based cooperative cache that provides a reliable write tier and a distributed directory that maintains a global state. It explores the use of global state to implement more sophisticated cache management policies and enables application-specific tuning of caching policies to support a wider range of applications than D3N. In contrast to previous cache systems that implement their own mechanism for maintaining dirty data redundantly, D4N re-uses the existing data lake (Ceph) software for implementing a write tier and exploits the semantics of immutable objects to move aged objects to the shared data lake. This design greatly reduces the barrier to adoption and enables D4N to take advantage of sophisticated data lake features such as erasure coding. We demonstrate that D4N is performant enough to saturate the bandwidth of the SSDs, and it automatically adapts replication to the working set of the demands and outperforms the state of art cluster cache Alluxio. While it will be substantially more complicated to integrate the D4N prototype into production quality code that can be adopted by the community, these results are compelling enough that our partners are starting that effort. D3N and D4N demonstrate that cooperative caching techniques, originally designed for file systems, can be employed to integrate caching into today’s immutable object-based data lakes. We find that the properties of immutable object storage greatly simplify the adoption of these techniques, and enable integration of caching in a fashion that enables re-use of existing battle tested software; greatly reducing the barrier of adoption. In integrating the caching in the data lake, and not the compute cluster, this research opens the door to efficient data center wide sharing of data and resources
    corecore