18 research outputs found
Recommended from our members
COMET: Communication-optimised multi-threaded error-detection technique
© 2016 ACM. Relentless technology scaling has made transistors more vulnerable to soft, or transient, errors. To keep systems robust against these, current error detection techniques use different types of redundancy at the hardware or the software level. A consequence of these additional protection mechanisms is that these systems tend to become slower. In particular, software error-detection techniques degrade performance considerably, limiting their uptake. This paper focuses on software redundant multi-threading error detection, a compiler-based technique that makes use of redundant cores within a multi-core system to perform error checking. Implementations of this scheme feature two threads that execute almost the same code: the main thread runs the original code and the checker thread executes code to verify the correctness of the original. The main thread communicates the values that require checking to the checker thread to use in its comparisons. We identify a major performance bottleneck in existing schemes: poorly performing inter-core communication and the generated code associated with it. Our study shows this is a major performance impediment within existing techniques since the two threads require extremely fine-grained communication, on the order of every few instructions. We alleviate this bottleneck with a series of code generation optimisations at the compiler level. We propose COMET (Communication-Optimised Multi-threaded Error-detection Technique), which improves performance across the NAS parallel benchmarks by 31.4% (on average) compared to the state-of-the-art, without affecting fault-coverage
Operation Graph Oriented Correlation of ASIC Chip Internal Information for Hardware Debug
This thesis presents a novel approach to operation-centric tracing for hardware debug with a retrospective analysis of traces which are distributed across a computer system. Therefore, these traces record entries about the operations at runtime, and a software tool correlates these entries after a problem occurred. This tool is based on a generic method using identifiers saved from operations. Because identifiers are changed along the path of an operation through the system and traces record different information, the entries are transformed to find matching entries in other traces. After the correlation, the method reconstructs the operation paths with help of an operation graph which describes for each type of operation the subtasks and their sequence. With these paths the designer gets a better overview about the chip or system activity, and can isolate the problem cause faster. The TRACE MATCHER implements the described method and it is evaluated with an example bridge chip. Therefore, the benefit for hardware debug, correctness of the reconstructed paths, the performance of their Implementation, and the configuration effort are evaluated. At the end guidelines for trace and system design describe how matching can be improved by carefully designed identifiers at operations
High Availability and Scalability of Mainframe Environments using System z and z/OS as example
Mainframe computers are the backbone of industrial and commercial computing, hosting the most relevant and critical data of businesses. One of the most important mainframe environments is IBM System z with the operating system z/OS. This book introduces mainframe technology of System z and z/OS with respect to high availability and scalability. It highlights their presence on different levels within the hardware and software stack to satisfy the needs for large IT organizations
Water borne transport of high level nuclear waste in very deep borehole disposal of high level nuclear waste
Thesis (S.B.)--Massachusetts Institute of Technology, Dept. of Nuclear Science and Engineering, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 52).The purpose of this report is to examine the feasibility of the very deep borehole experiment and to determine if it is a reasonable method of storing high level nuclear waste for an extended period of time. The objective of this thesis is to determine the escape mechanisms of radionuclides and to determine if naturally occurring salinity gradients could counteract this phenomenon. Because of the large dependence on the water density, the relationship between water density and the salinity was measured and agreed with the literature values with a less than 1% difference. The resultant relationship between the density and salinity is a linear relationship with the molality, and dependent upon the number of ions of the dissolved salt (e.g. CaCl₂ contains 3 and NaCl has 2). From the data, it was calculated that within a borehole with a host rock porosity of 10-⁵ Darcy, it would take approximately 10⁵ years for the radionuclides to escape. As the rock porosity decreases, the escape time scale increases, and the escape fraction decreases exponentially. Due to the conservative nature of the calculations, the actual escape timescale would be closer to 106 years and dominated by 1-129 in a reducing atmosphere. The expected borehole salinity values can offset the buoyancy effect due to a 50°C temperature increase.by Dion Tunick Cabeche.S.B
Recommended from our members
Low-cost duplication for separable error detection in computer arithmetic
Low-cost arithmetic error detection will be necessary in the future to ensure correct and safe system operation. However, current error detection mechanisms for arithmetic either have high area and energy overheads or are complex and offer incomplete protection against errors. Full duplication is simple, strong, and separable, but often is prohibitively costly. Alternative techniques such as arithmetic error coding require lower hardware and energy overheads than full duplication, but they do so at the expense of high design effort and error coverage holes. The goal of this research is to mitigate the deficiencies of duplication and arithmetic error coding to form an error detection scheme that may be readily employed in future systems. The techniques described by this work use a general duplication technique that employs an alternate number system in the duplicate arithmetic unit. These novel dual modular redundancy organizations are referred to as low-cost duplication, and they provide compelling efficiency and coverage advantages over prior arithmetic error detection mechanisms.Electrical and Computer Engineerin
Recommended from our members
Fine-grained containment domains for throughput processors
Continued scaling of semiconductor technology has made modern processors rely on large design margins to guarantee correct operation under worst case conditions. Design margins appear in the form of higher supply voltage or lower clock frequency, leading to inefficiency. In practice, it is rare to observe such worst-case conditions and the processor can run at a reduced voltage or higher frequency experiencing only few infrequent errors. Recent proposals have used hardware error detectors and recovery mechanisms to detect and re- cover from these rare errors, a technique known as timing speculation. While this is effective for out-of-order processors with inherent capability to recover from misspeculation, implementing similar hardware for throughput processors such as the Graphics Processing Units (GPUs) is prohibitively costly due to the massive amount of thread context that needs to be preserved. Further- more, recovery overhead is much higher since the SIMD (Single Instruction Multiple Data) execution model of GPUs require multiple threads to roll back together in case of an error. In this dissertation, I develop a hardware/software co-design approach to enable reduced-margin operation on GPUs that overcomes the limitations of existing techniques. The proposed scheme leverages the hierarchical programming model of GPUs to provide hierarchical and uncoordinated local checkpoint-recovery. By decomposing a program into a hierarchically nested tree of code blocks which I refer to as containment domains (CDs), the pro- gram becomes amenable to automatic analysis and tuning, and an optimum trade-off can be made between preservation and recovery overhead. To aid this optimization process, an analytical model is developed to estimate the performance efficiency of a given application setting at a given error rate. With the analytical model, an exhaustive search can be performed to find the optimal solution. The tunability also allows the proposed scheme to easily adapt to a wide range of error rates making it future proof against emerging uncertainties in semiconductor design. The proposed scheme combines software and hardware components to achieve the highest efficiency in preservation, restoration, and recovery. The software components include: 1) an API and runtime that lets the programmers describe the hierarchy of containment domains within an application and preserve the state required for rollback recovery, and 2) a compiler analysis that automatically inserts preservation routines for register variables. The hardware components include: 1) a stack structure to keep track of recovery program counters (PC), 2) a set of error containment mechanisms to guarantee that no erroneous data is propagated outside of a containment domain and 3) an error reporting architecture that keeps track of affected threads and initiate recovery of them.Electrical and Computer Engineerin
Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This dissertation develops fault tolerance algorithms for one-sided dense matrix factorizations, which handles Both hard and soft errors.
For hard errors, we propose methods based on diskless checkpointing and Algorithm Based Fault Tolerance (ABFT) to provide full matrix protection, including the left and right factor that are normally seen in dense matrix factorizations. A horizontal parallel diskless checkpointing scheme is devised to maintain the checkpoint data with scalable performance and low space overhead, while the ABFT checksum that is generated before the factorization constantly updates itself by the factorization operations to protect the right factor. In addition, without an available fault tolerant MPI supporting environment, we have also integrated the Checkpoint-on-Failure(CoF) mechanism into one-sided dense linear operations such as QR factorization to recover the running stack of the failed MPI process.
Soft error is more challenging because of the silent data corruption, which leads to a large area of erroneous data due to error propagation. Full matrix protection is developed where the left factor is protected by column-wise local diskless checkpointing, and the right factor is protected by a combination of a floating point weighted checksum scheme and soft error modeling technique. To allow practical use
on large scale system, we have also developed a complexity reduction scheme such that correct computing results can be recovered with low performance overhead.
Experiment results on large scale cluster system and multicore+GPGPU hybrid system have confirmed that our hard and soft error fault tolerance algorithms exhibit the expected error correcting capability, low space and performance overhead and compatibility with double precision floating point operation
A SEASAT report. Volume 1: Program summary
The program background and experiment objectives are summarized, and a description of the organization and interfaces of the project are provided. The mission plan and history are also included as well as user activities and a brief description of the data system. A financial and manpower summary and preliminary results of the mission are also included