131 research outputs found

    Improving the Performance of Big Data Analytics Platforms by Task and I/O Granularity Adjustment

    Get PDF
    Department of Computer Science and EngineeringWith the massive increase in the amount of semi-structured and unstructured web data, big data analytics platforms have emerged and started to evolve rapidly. Apache Hadoop has been developed for batch processing on a large dataset, and systems for interactive and general purpose applications have been developed alongside NoSQL databases. Numerous efforts have been made to improve the performance of Hadoop and NoSQL databases, including utilizing a new device called NVMM for NoSQL databases. Nonetheless, their performance is still far from satisfactory due to inadequate granularity for tasks and I/O. In this dissertation, we present novel techniques to improve the performance of Apache Hadoop and NVMM-based LSM-tree by adjusting task and I/O granularity. First, we analyze YARN container overhead and present dynamic input split size adjustment scheme, which can logically combine multiple HDFS blocks and increase the input size of each container, thereby enabling a single map wave and reducing the number of containers and their initialization overhead. Experimental results shows that we can avoid recurring container overhead by selecting the right size for input splits and reducing the number of containers. Second, we present a novel HDFS block coalescing scheme that mitigates the YARN con tainer overhead. Our assorted block coalescing scheme combines multiple HDFS blocks and creates large input splits of various sizes, reducing the number of containers and their initializa tion overhead. Our experimental study shows the block coalescing scheme significantly reduces the container overhead while it achieves good load balancing and job scheduling fairness without impairing the degree of overlap between map phase and reduce phase. Third, we discuss design choice of using NVMM for indexing structure in NoSQL databases and present ZipperDB, a key-value store that redesigns LSM-tree for byte-addressable persistent memory. To benefit from the byte-addressability of persistent memory, ZipperDB employs byte addressable persistent SkipLists and performs Zipper Compaction, a novel in-place compaction algorithm that merges two adjacent persistent SkipLists without compromising the failure atomicity. The byte-addressable compaction helps mitigate the write amplification problem, which is known to be the root cause of the write stall problem in LSM-tree. Finally, we present ListDB, a write-optimized key-value store for NVMM to overcome the gap between DRAM and NVMM write latencies and thereby, resolve the write stall problem. ListDB consists of three novel techniques: (i) byte-addressable Index-Unified Logging, which incrementally converts write-ahead logs into SkipLists, (ii) Braided SkipList, a simple NUMA aware SkipList that effectively reduces the NUMA effects of NVMM, and (iii) NUMA-aware Zipper Compaction. Using the three techniques, ListDB makes background flush and com paction fast enough to resolve the infamous write stall problem and shows 1.6x and 25x higher write throughputs than PACTree and Intel Pmem-RocksDB, respectively.ope

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    Efficient resource sharing for big data applications in shared clusters

    Get PDF
    Modern data centers are shifting to shared clusters where the resources are shared among multiple users and frameworks. A key enabler for such shared clusters is a cluster resource management system which allocates resources among different frameworks. One key problem in these shared clusters is how to efficiently share cluster resources between multiple applications and users in an elastic and non-disruptive manner. Current cluster schedulers typically utilize kill-based preemption to coordinate resource sharing, achieve fairness and satisfy SLOs during resource contention by simply killing low priority jobs and restarting them later when resources are available. This simple preemption policy ensures fast service times of high priority jobs and prevents a single user/application from occupying too many resources and starving others; however, without saving the progress of preempted jobs, this policy causes significant resource waste and delays the response time of long running or low priority jobs. The issue of dynamic resource sharing becomes even more problematic when there are different types of applications running on the same cluster (e.g., batch processing systems running alongside real-time streaming systems). Different application types will often have varying quality of service metrics (e.g., higher throughput versus lower latency) which can make resource sharing among these applications contentious. In this dissertation, we show the impact of kill-based preemption in modern shared clusters and propose two solutions to more efficiently share resources in shared cluster environments by utilizing checkpoint-based preemption and supporting elasticity in distributed data stream processing systems.Ph.D

    Keeping checkpoint/restart viable for exascale systems

    Get PDF
    Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms

    Memory Subsystems for Security, Consistency, and Scalability

    Get PDF
    In response to the continuous demand for the ability to process ever larger datasets, as well as discoveries in next-generation memory technologies, researchers have been vigorously studying memory-driven computing architectures that shall allow data-intensive applications to access enormous amounts of pooled non-volatile memory. As applications continue to interact with increasing amounts of components and datasets, existing systems struggle to eÿciently enforce the principle of least privilege for security. While non-volatile memory can retain data even after a power loss and allow for large main memory capacity, programmers have to bear the burdens of maintaining the consistency of program memory for fault tolerance as well as handling huge datasets with traditional yet expensive memory management interfaces for scalability. Today’s computer systems have become too sophisticated for existing memory subsystems to handle many design requirements. In this dissertation, we introduce three memory subsystems to address challenges in terms of security, consistency, and scalability. Specifcally, we propose SMVs to provide threads with fne-grained control over access privileges for a partially shared address space for security, NVthreads to allow programmers to easily leverage nonvolatile memory with automatic persistence for consistency, and PetaMem to enable memory-centric applications to freely access memory beyond the traditional process boundary with support for memory isolation and crash recovery for security, consistency, and scalability

    Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

    Get PDF
    High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the first supercomputer capable executing more than an exaflop, 10^18 floating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extreme-scale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains effective on current systems, increasing the scale of today\u27s systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniques include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and software-based memory fault correction. In this thesis, I examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, I evaluate the potential impact of rollback avoidance on these systems. I then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, I examine the feasibility of using this technique to protect against memory faults in kernel memory

    Unified fault-tolerance framework for hybrid task-parallel message-passing applications

    Get PDF
    We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
    corecore