10 research outputs found

    A Pattern Language for High-Performance Computing Resilience

    Full text link
    High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

    Application of elasto-magnetic based stress sensors for measurements of cable tension force in cable-stayed bridge

    No full text
    Recently, a novel stress sensor, which utilizes the elasto-magnetic (EM) effect of ferromagnetic materials, has been developed to measure stress in steel cables and wires. In this study, the effectiveness of this EM based stress sensors for monitoring of the cable tension force of a real scale cable-stayed bridge is investigated. The reference forces were used to calibrate and validate cable tension force measurements from the EM sensors. Tension force variations of two test cables during the second tensioning work on Hwa- Myung Bridge were monitored using the EM sensors

    CLEAN-ECC: High Reliability ECC for Adaptive Granularity Memory System

    No full text
    Adaptive-granularity memory architectures have been considered mainly because of main memory bottleneck and power efficiency. Meanwhile, highly reliable protection schemes are getting popular especially in large computing systems. Unfortunately, conventional ECC mechanisms including Chipkill require a large number of symbols to guarantee strong protection with acceptable overhead. We propose a novel memory protection scheme called CLEAN (Chipkill-LEvel reliable and Access granularity Negotiable), which enables us to balance the contradicting demands of fine-grained (FG) access and strong & efficient ECC. To close a potentially significant detection coverage gap due to CLEAN's detection mechanism coupled with permanent faults, we design a simple mechanism access granularity enforcement. By enforcing coarse-grained (CG) access, we can get only the advantage of higher protection comparable to Chipkill instead of achieving the adaptive access granularity together. CLEAN showed Chipkill level reliability as well as improvement in performance, system and memory power efficiency by up to 11.8%, 10.8% and 64.9% with mixes of SPEC2006 benchmarks.1

    Comparative Field Study of Cable Tension Measurement for a Cable-Stayed Bridge

    No full text
    Cable tension is one of the important indexes of cable integrity as well as bridge stability and can be measured by various tension measurement methods. In this study, three widely used methods (i.e., the lift-off test, electromagnetic sensor method, and vibration method) have been implemented for two multistrand cables of a cable-stayed bridge under construction. The test bridge is Hwamyung Bridge in Korea, which has a prestressed concrete box girder. The field tests are executed during the second tensioning stage just after the installation of the key segment. The tensions are estimated before and after tensioning the cable and 5 days later (i.e., after finishing the tensioning of all the cables). The tensions measured by the three methods are compared with the design tension of the tensioning stage, and all three methods show very good performance in accuracy with minimal difference. Their cost and difficulty are compared based on test experiences. Additionally, an improved vibration method is proposed by ignoring apparent negative bending stiffness identified from measurement errors and validated in this test by improving the accuracy.close0

    Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems

    No full text
    This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches

    Field application of elasto-magnetic stress sensors for monitoring of cable tension force in cable-stayed bridges

    No full text
    Recently, a novel stress sensor, which utilizes the elasto-magnetic (EM) effect of ferromagnetic materials, has been developed to measure stress in steel cables and wires. In this study, the effectiveness of this EM based stress sensors for monitoring of the cable tension force of a real scale cable-stayed bridge was investigated. Two EM stress sensors were installed on two selected multi-strand cables in Hwa-Myung Bridge, Busan, South Korea. Conventional lift-off test was conducted to obtain reference cable tension forces of two test cables. The reference forces were used to calibrate and validate cable tension force measurements from the EM sensors. Tension force variations of two test cables during the second tensioning work on Hwa-Myung Bridge were monitored using the EM sensors. Numerical simulations were conducted to compare and verify the monitoring results. Based on the results, the effectiveness of EM sensors for accurate field monitoring of the cable tension force of cable-stayed bridge is discussed.open0
    corecore