2,137 research outputs found

    Designing an Adaptive Application-Level Checkpoint Management System for Malleable MPI Applications

    Full text link
    Dynamic resource management opens up numerous opportunities in High Performance Computing. It improves the system-level services as well as application performance. Checkpointing can also be deemed as a system-level service and can reap the benefits offered by dynamism. A checkpointing system can have better resource availability by integrating with a malleable resource management system. In addition to fault tolerance, the checkpointing system can cater to the data redistribution demand of malleable applications during resource change. Therefore, we propose iCheck, an adaptive application-level checkpoint management system that can efficiently utilize the system and application level dynamism to provide better checkpointing and data redistribution services to applications.Comment: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22

    Decentralized Online Scheduling of Malleable NP-hard Jobs

    Get PDF
    In this work, we address an online job scheduling problem in a large distributed computing environment. Each job has a priority and a demand of resources, takes an unknown amount of time, and is malleable, i.e., the number of allotted workers can fluctuate during its execution. We subdivide the problem into (a) determining a fair amount of resources for each job and (b) assigning each job to an according number of processing elements. Our approach is fully decentralized, uses lightweight communication, and arranges each job as a binary tree of workers which can grow and shrink as necessary. Using the NP-complete problem of propositional satisfiability (SAT) as a case study, we experimentally show on up to 128 machines (6144 cores) that our approach leads to near-optimal utilization, imposes minimal computational overhead, and performs fair scheduling of incoming jobs within a few milliseconds

    Efficient Scalable Computing through Flexible Applications and Adaptive Workloads

    Get PDF
    In this paper we introduce a methodology for dynamic job reconfiguration driven by the programming model runtime in collaboration with the global resource manager. We improve the system throughput by exploiting malleability techniques (in terms of number of MPI ranks) through the reallocation of resources assigned to a job during its execution. In our proposal, the OmpSs runtime reconfigures the number of MPI ranks during the execution of an application in cooperation with the Slurm workload manager. In addition, we take advantage of OmpSs offload semantics to allow application developers deal with data redistribution. By combining these elements a job is able to expand itself in order to exploit idle nodes or be shrunk if other queued jobs could be initiated. This novel approach adapts the system workload in order to increase the throughput as well as make a smarter use of the underlying resources. Our experiments demonstrate that this approach can reduce the total execution time of a practical workload by more than 40% while reducing the amount of resources by 30%.This work is supported by the Project TIN2014-53495-R and TIN2015-65316-P from MINECO and FEDER. Antonio J. Peña is cofinanced by MINECO under Juan de la Cierva fellowship number IJCI-2015-23266. Special thanks to José I. Aliaga for the conjugate gradient code.Peer ReviewedPostprint (author's final draft

    Performance-aware scheduling of parallel applications on non-dedicated clusters

    Get PDF
    This work presents a HPC framework that provides new strategies for resource management and job scheduling, based on executing different applications in shared compute nodes, maximizing platform utilization. The framework includes a scalable monitoring tool that is able to analyze the platform's compute node utilization. We also introduce an extension of CLARISSE, a middleware for data-staging coordination and control on large-scale HPC platforms that uses the information provided by the monitor in combination with application-level analysis to detect performance degradation in the running applications. This degradation, caused by the fact that the applications share the compute nodes and may compete for their resources, is avoided by means of dynamic application migration. A description of the architecture, as well as a practical evaluation of the proposal, shows significant performance improvements up to 20% in the makespan and 10% in energy consumption compared to a non-optimized execution.This work was partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under the grant TIN2016-79637-P "Towards Unification of HPC and Big Data Paradigms"; and the European Union's Horizon 2020 research and innovation program under Grant No. 801091, project "Exascale programming models for extreme data processing" (ASPIDE)
    • …
    corecore