8,577 research outputs found

    Exploring Scientific Application Performance Using Large Scale Object Storage

    Full text link
    One of the major performance and scalability bottlenecks in large scientific applications is parallel reading and writing to supercomputer I/O systems. The usage of parallel file systems and consistency requirements of POSIX, that all the traditional HPC parallel I/O interfaces adhere to, pose limitations to the scalability of scientific applications. Object storage is a widely used storage technology in cloud computing and is more frequently proposed for HPC workload to address and improve the current scalability and performance of I/O in scientific applications. While object storage is a promising technology, it is still unclear how scientific applications will use object storage and what the main performance benefits will be. This work addresses these questions, by emulating an object storage used by a traditional scientific application and evaluating potential performance benefits. We show that scientific applications can benefit from the usage of object storage on large scales.Comment: Preprint submitted to WOPSSS workshop at ISC 201

    Evaluating the benefits of key-value databases for scientific applications

    Get PDF
    The convergence of Big Data applications with High-Performance Computing requires new methodologies to store, manage and process large amounts of information. Traditional storage solutions are unable to scale and that results in complex coding strategies. For example, the brain atlas of the Human Brain Project has the challenge to process large amounts of high-resolution brain images. Given the computing needs, we study the effects of replacing a traditional storage system with a distributed Key-Value database on a cell segmentation application. The original code uses HDF5 files on GPFS through an intricate interface, imposing synchronizations. On the other hand, by using Apache Cassandra or ScyllaDB through Hecuba, the application code is greatly simplified. Thanks to the Key-Value data model, the number of synchronizations is reduced and the time dedicated to I/O scales when increasing the number of nodes.This project/research has received funding from the European Unions Horizon 2020 Framework Programme for Research and Innovation under the Speci c Grant Agreement No. 720270 (Human Brain Project SGA1) and the Speci c Grant Agreement No. 785907 (Human Brain Project SGA2). This work has also been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), and by Generalitat de Catalunya (contract 2017-SGR-1414).Postprint (author's final draft

    Havens: Explicit Reliable Memory Regions for HPC Applications

    Full text link
    Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16), September 2016, Waltham, MA, US

    Characterizing Deep-Learning I/O Workloads in TensorFlow

    Full text link
    The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment.Comment: Accepted for publication at pdsw-DISCS 201

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Get PDF
    In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks
    corecore