800 research outputs found

    Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

    Get PDF
    High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the first supercomputer capable executing more than an exaflop, 10^18 floating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extreme-scale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains effective on current systems, increasing the scale of today\u27s systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniques include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and software-based memory fault correction. In this thesis, I examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, I evaluate the potential impact of rollback avoidance on these systems. I then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, I examine the feasibility of using this technique to protect against memory faults in kernel memory

    Lazy Fault Recovery for Redundant Mpi

    Get PDF
    Distributed Systems (DS) where multiple computers share a workload across a network, are used everywhere, from data intensive computations to storage and machine learning. DS provide a relatively cheap and efficient solution that allows stability with improved performance for computational intensive applications. In a DS faults and failures are the norm not the exception. At any moment data corruption can occur especially since a DS usually consists of hundred to thousands of units of commodity hardware. The large number and quality of components guarantees, by probability, that at any given time some of the components will not be working and some of them will not recover from failure. DS can experience problems caused by application bugs, operating systems bugs, failures with disks, memory, connectors, networking, power supply, and other components; therefore, constant monitoring and failure detection are fundamental. Automatic recovery must be integral to the system. One of the most commonly used programming languages for DS is Message Passing Interface (MPI). Unfortunately MPI does not support fault detection or recovery. In this thesis, we build a recovery mechanism based on replicas that works on top of the asynchronous fault detection implemented in previous work. Results shows that our recovery implementation is successful and the overhead in execution time is minimal

    Resilience for large ensemble computations

    Get PDF
    With the increasing power of supercomputers, ever more detailed models of physical systems can be simulated, and ever larger problem sizes can be considered for any kind of numerical system. During the last twenty years the performance of the fastest clusters went from the teraFLOPS domain (ASCI RED: 2.3 teraFLOPS) to the pre-exaFLOPS domain (Fugaku: 442 petaFLOPS), and we will soon have the first supercomputer with a peak performance cracking the exaFLOPS (El Capitan: 1.5 exaFLOPS). Ensemble techniques experience a renaissance with the availability of those extreme scales. Especially recent techniques, such as particle filters, will benefit from it. Current ensemble methods in climate science, such as ensemble Kalman filters, exhibit a linear dependency between the problem size and the ensemble size, while particle filters show an exponential dependency. Nevertheless, with the prospect of massive computing power come challenges such as power consumption and fault-tolerance. The mean-time-between-failures shrinks with the number of components in the system, and it is expected to have failures every few hours at exascale. In this thesis, we explore and develop techniques to protect large ensemble computations from failures. We present novel approaches in differential checkpointing, elastic recovery, fully asynchronous checkpointing, and checkpoint compression. Furthermore, we design and implement a fault-tolerant particle filter with pre-emptive particle prefetching and caching. And finally, we design and implement a framework for the automatic validation and application of lossy compression in ensemble data assimilation. Altogether, we present five contributions in this thesis, where the first two improve state-of-the-art checkpointing techniques, and the last three address the resilience of ensemble computations. The contributions represent stand-alone fault-tolerance techniques, however, they can also be used to improve the properties of each other. For instance, we utilize elastic recovery (2nd contribution) for mitigating resiliency in an online ensemble data assimilation framework (3rd contribution), and we built our validation framework (5th contribution) on top of our particle filter implementation (4th contribution). We further demonstrate that our contributions improve resilience and performance with experiments on various architectures such as Intel, IBM, and ARM processors.Amb l’increment de les capacitats de còmput dels supercomputadors, es poden simular models de sistemes físics encara més detallats, i es poden resoldre problemes de més grandària en qualsevol tipus de sistema numèric. Durant els últims vint anys, el rendiment dels clústers més ràpids ha passat del domini dels teraFLOPS (ASCI RED: 2.3 teraFLOPS) al domini dels pre-exaFLOPS (Fugaku: 442 petaFLOPS), i aviat tindrem el primer supercomputador amb un rendiment màxim que sobrepassa els exaFLOPS (El Capitan: 1.5 exaFLOPS). Les tècniques d’ensemble experimenten un renaixement amb la disponibilitat d’aquestes escales tan extremes. Especialment les tècniques més noves, com els filtres de partícules, se¿n beneficiaran. Els mètodes d’ensemble actuals en climatologia, com els filtres d’ensemble de Kalman, exhibeixen una dependència lineal entre la mida del problema i la mida de l’ensemble, mentre que els filtres de partícules mostren una dependència exponencial. No obstant, juntament amb les oportunitats de poder computar massivament, apareixen desafiaments com l’alt consum energètic i la necessitat de tolerància a errors. El temps de mitjana entre errors es redueix amb el nombre de components del sistema, i s’espera que els errors s’esdevinguin cada poques hores a exaescala. En aquesta tesis, explorem i desenvolupem tècniques per protegir grans càlculs d’ensemble d’errors. Presentem noves tècniques en punts de control diferencials, recuperació elàstica, punts de control totalment asincrònics i compressió de punts de control. A més, dissenyem i implementem un filtre de partícules tolerant a errors amb captació i emmagatzematge en caché de partícules de manera preventiva. I finalment, dissenyem i implementem un marc per la validació automàtica i l’aplicació de compressió amb pèrdua en l’assimilació de dades d’ensemble. En total, en aquesta tesis presentem cinc contribucions, les dues primeres de les quals milloren les tècniques de punts de control més avançades, mentre que les tres restants aborden la resiliència dels càlculs d’ensemble. Les contribucions representen tècniques independents de tolerància a errors; no obstant, també es poden utilitzar per a millorar les propietats de cadascuna. Per exemple, utilitzem la recuperació elàstica (segona contribució) per a mitigar la resiliència en un marc d’assimilació de dades d’ensemble en línia (tercera contribució), i construïm el nostre marc de validació (cinquena contribució) sobre la nostra implementació del filtre de partícules (quarta contribució). A més, demostrem que les nostres contribucions milloren la resiliència i el rendiment amb experiments en diverses arquitectures, com processadors Intel, IBM i ARM.Postprint (published version

    Implications of aero-engine deterioration for a military aircraft's performance

    Get PDF
    World developments have led the armed forces of many countries to become more aware of how their increasingly stringent financial budgets are spent. Major expenditure for military authorities is upon aero-engines. Some in-service deterioration in any mechanical device, such as an aircraft's gas-turbine engine, is inevitable. However, its extent and rate depend upon the qualities of design and manufacture, as well as on the maintenance/repair practices followed by the users. Each deterioration has an adverse effect on the performance and shortens the reliable operational life of the engine thereby resulting in higher life cycle costs. The adverse effect on the life-cycle cost can be reduced by determining the realistic fuel and life-usage and by having a better knowledge of the effects of each such deterioration on operational performance. Subsequently improvements can be made in the design and manufacture of adversely-affected components as well as with respect to maintenance / repair and operating practices. For a military aircraft's mission-profiles (consisting of several flight-segments), using computer simulations, the consequences of engine deterioration upon the aircraft's operational-effectiveness as well as fuel and life usage are predicted. These will help in making wiser management decisions (such as whether to remove the aero-engines from the aircraft for maintenance or to continue using them with some changes in the aircraft's mission profile), with the various types and extents of engine deterioration. Hence improved engine utilization, lower overall life-cycle costs and the optimal mission operational effectiveness for a squadron of aircraft can be achieved

    Investigation of near and post stall behaviour of axial compression systems

    Get PDF
    The design of modern gas-turbine engines is continuously being improved towards better performance, better efficiency and reduced cost. This trend of aero-engine design requires compression systems which produce higher pressure ratios and thus, have higher loaded blades and a closer spacing between blade rows. Such designs are more prone to aerodynamic instabilities and consequent stall and surge can be catastrophic. The majority of the research conducted on compressor stall and surge is limited to old designs with lower pressure ratios or single stage compression systems. In this thesis, the near and post stall behaviour of a modern multi-stage high-speed intermediate pressure compressor rig and an aero-engine three-shaft compression system are studied in detail. The main objective is to develop and validate reliable CFD models to predict surge and rotating stall and shed light on the underlying physical mechanism of the phenomena. CFD computations were performed to gain understanding of the current capability to model the flow behaviour of a multi-stage compressor rig near stall condition. Two turbulence models were tested and an extensive grid discretization study was performed. In order to improve the prediction of the compressor's stability boundary, a modification in the widely known Spalart-Allmaras turbulence model is proposed. Subsequently, unsteady CFD computations were carried out to evaluate the impact of flow unsteadiness in the performance prediction of this compressor rig. It was found that for operating conditions characterized by non-axisymmetric flow features, an unsteady full annulus model is required to predict the compressor performance. For low speeds, these flow features develop over a wide range of operating conditions. When the compressor operates at high speeds, these flow features are limited to operating conditions near the stability boundary. The above findings were validated against experimental results. Early stages of this research revealed that numerical calibration of a CFD surge computation in a three-shaft engine is a challenging task due to compressor matching. Hence, an iterative methodology for matching the compressors was introduced and validated against experimental data. This study considered a surge event where the engine was initially operating at mid power condition. When comparing the numerical result with measured data, it was found that the engine bleed system has a major impact on the aerodynamic loading predictions in the core system. Therefore, this system needs to be considered by component designers when accounting for robustness to surge loads. The post stall response of a three-shaft engine compression system which is initially operating at design was investigated. It was found that the maximum surge over-pressures are caused by a combined effect between a surge induced shock wave and high pressure gas travelling towards the core inlet during the surge blow-down period. Furthermore, it was demonstrated that the maximum surge loads are obtained for a surge event initiated by fuel-spike. Finally, a cheaper computational approach to model surge in axial compression systems is proposed. This approach consisted of using an unsteady single passage model to predict the flow behaviour during the surge event. After comparison with full annulus results for three different scenarios, it was concluded that the single passage is capable of predicting the blow-down period of surge which is characterized by a long period of flow reversal. This model fails to predict the correct time and length scales during surge onset and flow transition between reverse to forward flow at the beginning of recovery. These time instants are characterized by non-axisymmetric flow features. However, the single passage model shows a good correlation with the results obtained using a full annulus model for estimation of average values of static pressure and mass flows during surge. This can drastically reduce simulations times from months to days during compressor surge analysis.Open Acces

    The terminator : an AI-based framework to handle dependability threats in large-scale distributed systems

    Get PDF
    With the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systems’ logs for dependability purposes, such as failure prediction, with varying results. In this work, three novel AI-based techniques are proposed to address two major dependability problems, those of (i) error detection and (ii) failure prediction. The proposed error detection technique leverages the sentiments embedded in log messages in a novel way, making the approach HPC system-independent, i.e., the technique can be used to detect errors in any HPC system. On the other hand, two novel self-supervised transformer neural networks are developed for failure prediction, thereby obviating the need for labels, which are notoriously difficult to obtain in HPC systems. The first transformer technique, called Clairvoyant, accurately predicts the location of the failure, while the second technique, called Time Machine, extends Clairvoyant by also accurately predicting the lead time to failure (LTTF). Time Machine addresses the typical regression problem of LTTF as a novel multi-class classification problem, using a novel oversampling method for online time-based task training. Results from six real-world HPC clusters’ datasets show that our approaches significantly outperform the state-of-the-art methods on various metrics

    From detection to optimization: impact of soft errors on high-performance computing applications

    Get PDF
    As high-performance computing (HPC) continues to progress, constraints on HPC system design forces the handling of errors to higher levels in the software stack. Of the types of errors facing HPC, soft errors that silently corrupt system or application state are among the most severe. The behavior of HPC applications in the presence of soft errors is critical to gain insight for effective utilization of HPC systems. The need to understand this behavior can be used in developing algorithm-based error detection guided by application characteristics from fault injection and error propagation studies. Furthermore, the realization that applications are tolerant to small errors allows optimizations such as lossy compression on high-cost data transfers. Lossy compression adds small user controllable amounts of error when compressing data, to reduce data size before expensive data transfers saving time. This dissertation investigates and improves the resiliency of HPC applications to soft errors, and explores lossy compression as a new form of optimization for expensive, time-consuming data transfers

    Mitigation of failures in high performance computing via runtime techniques

    Get PDF
    As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions. The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost

    Large-Scale Simulations of Complex Turbulent Flows: Modulation of Turbulent Boundary Layer Separation and Optimization of Discontinuous Galerkin Methods for Next-Generation HPC Platforms

    Full text link
    The separation of spatially evolving turbulent boundary layer flow near regions of adverse pressure gradients has been the subject of numerous studies in the context of flow control. Although many studies have demonstrated the efficacy of passive flow control devices, such as vortex generators (VGs), in reducing the size of the separated region, the interactions between the salient flow structures produced by the VG and those of the separated flow are not fully understood. Here, wall-resolved large-eddy simulation of a model problem of flow over a backward-facing ramp is studied with a submerged, wall-mounted cube being used as a canonical VG. In particular, the turbulent transport that results in the modulation of the separated flow over the ramp is investigated by varying the size, location of the VG, and the spanwise spacing between multiple VGs, which in turn are expected to modify the interactions between the VG-induced flow structures and those of the separated region. The horseshoe vortices produced by the cube entrain the freestream turbulent flow towards the plane of symmetry. These localized regions of high vorticity correspond to turbulent kinetic energy production regions, which effectively transfer energy from the freestream to the near-wall regions. Numerical simulations indicate that: (i) the gradients and the fluctuations, scale with the size of the cube and thus lead to more effective modulation for large cubes, (ii) for a given cube height the different upstream cube positions affect the behavior of the horseshoe vortex---when placed too close to the leading edge, the horseshoe vortex is not sufficiently strong to affect the large-scale structures of the separated region, and when placed too far, the dispersed core of the streamwise vortex is unable to modulate the flow over the ramp, (iii) if the spanwise spacing between neighboring VGs is too small, the counter-rotating vortices are not sufficiently strong to affect the large-scale structures of the separated region, and if the spacing is too large, the flow modulation is similar to that of an isolated VG. Turbulent boundary layer flows are inherently multiscale, and numerical simulations of such systems often require high spatial and temporal resolution to capture the unsteady flow dynamics accurately. While the innovations in computer hardware and distributed computing have enabled advances in the modeling of such large-scale systems, computations of many practical problems of interest are infeasible, even on the largest supercomputers. The need for high accuracy and the evolving heterogeneous architecture of the next-generation high-performance computing centers has impelled interest in the development of high-order methods. While the new class of recovery-assisted discontinuous Galerkin (RADG) methods can provide arbitrary high-orders of accuracy, the large number of degrees of freedom increases costs associated with the arithmetic operations performed and the amount of data transferred on-node. The purpose of the second part of this thesis is to explore optimization strategies to improve the parallel efficiency of RADG. A cache data-tiling strategy is investigated for polynomial orders 1 through 6, which enhances the arithmetic intensity of RADG to make better utilization of on-node floating-point capability. In addition, a power-aware compute framework is suggested by analyzing the power-performance trade-offs when changing from double to single-precision floating-point types---energy savings of 5 W per node are observed---which suggests that a transprecision framework will likely offer better power-performance balance on modern HPC platforms.PHDMechanical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163206/1/suyashtn_1.pd

    Technoeconomic study of engine deterioration and compressor washing for military gas turbine engines

    Get PDF
    Despite spending much of their operating life in clear air, aircraft gas turbine engines are naturally prone to deterioration as they are generally not fitted with air filters. Engines are particularly at risk during takeoff and landing, and whilst operating in areas of pollution, sand, dust storms, etc. The build-up of contaminants, especially on the compressor surfaces, leads to a dramatic reduction in compressor efficiency, which gives rise to a loss of available power, increased fuel consumption and increased exhaust gas temperature. These conditions can lead to flight delays, inspection failures, withdrawal from service, increased operating costs and safety compromises.With the growing interest in life cycle costs for gas turbine engines, both engine manufacturers and operators are investigating the tradeoffs between performance improvements and associated maintenance costs. This report introduces the problem of output and efficiency degradation in two aero gas turbine engines (the T56–A–15 and the F110–GE–129) caused by various deterioration factors. Their causes are broadly discussed and the effects on powerplant performance are simulated and analyzed. One of the key factors leading to performance losses during operation of these engines is compressor fouling. The fouling can come from a wide variety of sources; hydrocarbons from fuel and lubricating oils; volcanic ash; pollen; marine aerosols; dust; smoke; pollution, etc. The presence of these fouling sources acts as a bonding agent for the solid contaminants, ‘gluing’ them to the compressor surfaces. Thus, the aggravation in terms of power output, fuel consumption and additional time to carry out a typical mission will be assessed and an economic analysis will be attempted in order to quantify the effects of compressor fouling on the additional costs which arise, because of this specific deterioration. The effect of compressor fouling can be maintained by frequent cleaning to improve efficiency, resulting, hence, in improved power output, fuel savings and prolonged engine life. Compressor cleaning is thoroughly presented, and the implementation of on-wing off-line cleaning on the performance of the F110 engine was investigated from a technical and economical standpoint. Finally, according to the results obtained, the optimal frequency of compressor washing for the F110 engine is estimated, in order to eliminate safety compromises, improve performance and reduce the engine’s life cycle cost
    • …
    corecore