205 research outputs found

    Multi-Resource List Scheduling of Moldable Parallel Jobs under Precedence Constraints

    Full text link
    The scheduling literature has traditionally focused on a single type of resource (e.g., computing nodes). However, scientific applications in modern High-Performance Computing (HPC) systems process large amounts of data, hence have diverse requirements on different types of resources (e.g., cores, cache, memory, I/O). All of these resources could potentially be exploited by the runtime scheduler to improve the application performance. In this paper, we study multi-resource scheduling to minimize the makespan of computational workflows comprised of parallel jobs subject to precedence constraints. The jobs are assumed to be moldable, allowing the scheduler to flexibly select a variable set of resources before execution. We propose a multi-resource, list-based scheduling algorithm, and prove that, on a system with dd types of schedulable resources, our algorithm achieves an approximation ratio of 1.619d+2.545d+11.619d+2.545\sqrt{d}+1 for any dd, and a ratio of d+O(d23)d+O(\sqrt[3]{d^2}) for large dd. We also present improved results for independent jobs and for jobs with special precedence constraints (e.g., series-parallel graphs and trees). Finally, we prove a lower bound of dd on the approximation ratio of any list scheduling scheme with local priority considerations. To the best of our knowledge, these are the first approximation results for moldable workflows with multiple resource requirements

    When Amdahl Meets Young/Daly

    Get PDF
    International audienceThis paper investigates the optimal number of processors to execute a parallel job, whose speedup profile obeys Amdahl's law, on a large-scale platform subject to fail-stop and silent errors. We combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both error sources. We provide an exact formula to express the execution overhead incurred by a periodic checkpointing pattern of length T and with P processors, and we give first-order approximations for the optimal values T * and P * as a function of the individual processor failure rate λind. A striking result is that P * is of the order λ −1/4 ind if the checkpointing cost grows linearly with the number of processors, and of the order λ −1/3 ind if the checkpointing cost stays bounded for any P. We conduct an extensive set of simulations to support the theoretical study. The results confirm the accuracy of first-order approximation under a wide range of parameter settings

    Optimal resilience patterns to cope with fail-stop and silent errors

    Get PDF
    This work focuses on resilience techniques at extreme scale. Many papers dealwith fail-stop errors. Many others dealwith silent errors (or silent data corruptions).But very few papers deal with fail-stop and silent errorssimultaneously. However, HPC applications will obviously have to cope with both error sources.This paper presents a unified framework and optimal algorithmic solutions to this double challenge.Silent errors are handled via verification mechanisms(either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, anda full characterization of the optimal pattern. Our results nicely extend several published solutionsand demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors. Extensive simulations based on real data confirm the accuracy of the model, and show that patterns that combine all resilience mechanisms are required to provide acceptable overheads

    Scheduling Independent Tasks with Voltage Overscaling

    Get PDF
    International audienceIn this paper, we discuss several scheduling algorithms to execute independent tasks with voltage overscaling. Given a frequency to execute the tasks, operating at a voltage below threshold leads to significant energy savings but also induces timing errors. A verification mechanism must be enforced to detect these errors. Contrarily to fail-stop or silent errors, timing errors are deterministic (but unpredictable). For each task, the general strategy is to select a voltage for execution, to check the result, and to select a higher voltage for re-execution if a timing error has occurred, and so on until a correct result is obtained. Switching from one voltage to another incurs a given cost, so it might be efficient to try and execute several tasks at the current voltage before switching to another one. Determining the optimal solution turns out to be unexpectedly difficult. However, we provide the optimal algorithm for a single task, the optimal algorithm when there are only two voltages, and the optimal level algorithm for a set of independent tasks, where a level algorithm is defined as an algorithm that executes all remaining tasks when switching to a given voltage. Furthermore, we show that the optimal level algorithm is in fact globally optimal (among all possible algorithms) when voltage switching costs are linear. Finally, we report a comprehensive set of simulations to assess the potential gain of voltage overscaling algorithms

    Two-Level Checkpointing and Verifications for Linear Task Graphs

    Get PDF
    International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience techniques must accommodate both error sources. To cope with the double challenge, a two-level checkpointing and rollback recovery approach can be used, with additional verifications for silent error detection. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpointing and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms that are less costly than the guaranteed ones but do not detect all silent errors. In this paper, we show how to combine all of these techniques for HPC applications whose dependency graph forms a linear chain. We present a sophisticated dynamic programming algorithm that returns the optimal solution in polynomial time. Simulation results demonstrate that the combined use of multi-level checkpointing and verifications leads to improved performance compared to the standard single-level checkpointing algorithm

    Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

    Get PDF
    International audienceIn this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study three execution scenarios: (i) a single speed is used during the whole execution; (ii) a second, possibly higher speed is used for any potential re-execution; (iii) different pairs of speeds can be used throughout the execution. For each scenario, we determine the optimal checkpointing and verification locations (and the optimal speeds for the third scenario) to minimize either objective. The different execution scenarios are then assessed and compared through an extensive set of experiments


    Get PDF
              Due to the swivel construction, the structural redundancy of cable-stayed bridge is reduced, and its seismic vulnerability is significantly higher than that of non-swirling construction structure and its own state of formation. Therefore, it is particularly important to study the damage changes of each component and stage system during the swivel construction of cable-stayed bridge under different horizontal earthquakes. Based on the construction of Rotary Cable-stayed Bridge in Haxi Street, the calculation formula of damage exceeding probability is established based on reliability theory, and the damage calibration of cable-stayed bridge components is carried out, and the finite element model of cable-stayed bridge rotating structure is established. The vulnerable parts of the main tower and the stay cable components of the cable-stayed bridge are identified and the incremental dynamic analysis is carried out. Finally, the seismic vulnerability curves of the main tower section, the stay cable and the rotating system are established. The results of the study show that the vulnerable areas of the H-shaped bridge towers are the abrupt changes in the main tower section near the upper and lower beams, and the vulnerable diagonal cables are the long cables anchored to the beam ends and the short cables near the main towerAt the same seismic level, the damage exceedance probability of main tower vulnerable section of cable-stayed bridge under transverse earthquake is greater than that under longitudinal earthquake, the damage exceedance probability of vulnerable stay cables under transverse seismic action is less than that under longitudinal seismic actionOn the premise of the same damage probability, the required ground motion intensity of the system can be reduced by 0.35g at most compared with the componentUnder the same seismic intensity, the system damage probability is 6.60 % higher than the component damage probability at most. The research results have reference significance for the construction of rotating cable-stayed bridges in areas lacking seismic records

    Assessing general-purpose algorithms to cope with fail-stop and silent errors

    Get PDF
    In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption.For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint,hence extending the classical formula by Young and Daly for fail-stop errors only. We further extendthe approach to include intermediate verifications, and to consider a bi-criteria problem involving both time and energy(linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bi-criteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via dynamic voltage and frequency scaling (DVFS).In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints.Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performanceof each algorithm, showing that the best overall performance is achieved under the most flexible scenariousing intermediate verifications and different speeds