724 research outputs found

    Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications

    Full text link
    Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two distinct exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of two use case implementations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) implementing dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv) fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing

    Performance analysis and optimization of in-situ integration of simulation with data analysis: zipping applications up

    Get PDF
    This paper targets an important class of applications that requires combining HPC simulations with data analysis for online or real-time scientific discovery. We use the state-of-the-art parallel-IO and data-staging libraries to build simulation-time data analysis workflows, and conduct performance analysis with real-world applications of computational fluid dynamics (CFD) simulations and molecular dynamics (MD) simulations. Driven by in-depth performance inefficiency analysis, we design an end-to-end application-level approach to eliminating the interlocks and synchronizations existent in the present methods. Our new approach employs both task parallelism and pipeline parallelism to reduce synchronizations effectively. In addition, we design a fully asynchronous, fine-grain, and pipelining runtime system, which is named Zipper. Zipper is a multi-threaded distributed runtime system and executes in a layer below the simulation and analysis applications. To further reduce the simulation application's stall time and enhance the data transfer performance, we design a concurrent data transfer optimization that uses both HPC network and parallel file system for improved bandwidth. The scalability of the Zipper system has been verified by a performance model and various empirical large scale experiments. The experimental results on an Intel multicore cluster as well as a Knight Landing HPC system demonstrate that the Zipper based approach can outperform the fastest state-of-the-art I/O transport library by up to 220% using 13,056 processor cores

    ASCR/HEP Exascale Requirements Review Report

    Full text link
    This draft report summarizes and details the findings, results, and recommendations derived from the ASCR/HEP Exascale Requirements Review meeting held in June, 2015. The main conclusions are as follows. 1) Larger, more capable computing and data facilities are needed to support HEP science goals in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of the demand at the 2025 timescale is at least two orders of magnitude -- and in some cases greater -- than that available currently. 2) The growth rate of data produced by simulations is overwhelming the current ability, of both facilities and researchers, to store and analyze it. Additional resources and new techniques for data analysis are urgently needed. 3) Data rates and volumes from HEP experimental facilities are also straining the ability to store and analyze large and complex data volumes. Appropriately configured leadership-class facilities can play a transformational role in enabling scientific discovery from these datasets. 4) A close integration of HPC simulation and data analysis will aid greatly in interpreting results from HEP experiments. Such an integration will minimize data movement and facilitate interdependent workflows. 5) Long-range planning between HEP and ASCR will be required to meet HEP's research needs. To best use ASCR HPC resources the experimental HEP program needs a) an established long-term plan for access to ASCR computational and data resources, b) an ability to map workflows onto HPC resources, c) the ability for ASCR facilities to accommodate workflows run by collaborations that can have thousands of individual members, d) to transition codes to the next-generation HPC platforms that will be available at ASCR facilities, e) to build up and train a workforce capable of developing and using simulations and analysis to support HEP scientific research on next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio

    Dynamic workflow management for large scale scientific applications

    Get PDF
    The increasing computational and data requirements of scientific applications have made the usage of large clustered systems as well as distributed resources inevitable. Although executing large applications in these environments brings increased performance, the automation of the process becomes more and more challenging. The use of complex workflow management systems has been a viable solution for this automation process. In this thesis, we study a broad range of workflow management tools and compare their capabilities especially in terms of dynamic and conditional structures they support, which are crucial for the automation of complex applications. We then apply some of these tools to two real-life scientific applications: i) simulation of DNA folding, and ii) reservoir uncertainty analysis. Our implementation is based on Pegasus workflow planning tool, DAGMan workflow execution system, Condor-G computational scheduler, and Stork data scheduler. The designed abstract workflows are converted to concrete workflows using Pegasus where jobs are matched to resources; DAGMan makes sure these jobs execute reliably and in the correct order on the remote resources; Condor-G performs the scheduling for the computational tasks and Stork optimizes the data movement between different components. Integrated solution with these tools allows automation of large scale applications, as well as providing complete reliability and efficiency in executing complex workflows. We have also developed a new site selection mechanism on top of these systems, which can choose the most available computing resources for the submission of the tasks. The details of our design and implementation, as well as experimental results are presented

    Failure-awareness and dynamic adaptation in data scheduling

    Get PDF
    Over the years, scientific applications have become more complex and more data intensive. Especially large scale simulations and scientific experiments in areas such as physics, biology, astronomy and earth sciences demand highly distributed resources to satisfy excessive computational requirements. Increasing data requirements and the distributed nature of the resources made I/O the major bottleneck for end-to-end application performance. Existing systems fail to address issues such as reliability, scalability, and efficiency in dealing with wide area data access, retrieval and processing. In this study, we explore data-intensive distributed computing and study challenges in data placement in distributed environments. After analyzing different application scenarios, we develop new data scheduling methodologies and the key attributes for reliability, adaptability and performance optimization of distributed data placement tasks. Inspired by techniques used in microprocessor and operating system architectures, we extend and adapt some of the known low-level data handling and optimization techniques to distributed computing. Two major contributions of this work include (i) a failure-aware data placement paradigm for increased fault-tolerance, and (ii) adaptive scheduling of data placement tasks for improved end-to-end performance. The failure-aware data placement includes early error detection, error classification, and use of this information in scheduling decisions for the prevention of and recovery from possible future errors. The adaptive scheduling approach includes dynamically tuning data transfer parameters over wide area networks for efficient utilization of available network capacity and optimized end-to-end data transfer performance

    Fifth ERCIM workshop on e-mobility

    Get PDF
    • …
    corecore