2,277 research outputs found

    Parallel Implementation of Interval Matrix Multiplication

    Get PDF
    International audienceTwo main and not necessarily compatible objectives when implementing the product of two dense matrices with interval coefficients are accuracy and efficiency. In this work, we focus on an implementation on multicore architectures. One direction successfully explored to gain performance in execution time is the representation of intervals by their midpoints and radii rather than the classical representation by endpoints. Computing with the midpoint-radius representation enables the use of optimized floating-point BLAS and consequently the performances benefit from the performances of the BLAS routines. Several variants of interval matrix multiplication have been proposed, that correspond to various trade-offs between accuracy and efficiency, including some efficient ones proposed by Rump in 2012. However, in order to guarantee that the computed result encloses the exact one, these efficient algorithms rely on an assumption on the order of execution of floating-point operations which is not verified by most implementations of BLAS. In this paper, an algorithm for interval matrix product is proposed that verifies this assumption. Furthermore, several optimizations are proposed and the implementation on a multicore architecture compares reasonably well with a non-guaranteed implementation based on MKL, the optimized BLAS of Intel: the overhead is most of the time less than 2 and never exceeds 3. This implementation also exhibits a good scalability

    Chiminey: Reliable Computing and Data Management Platform in the Cloud

    Full text link
    The enabling of scientific experiments that are embarrassingly parallel, long running and data-intensive into a cloud-based execution environment is a desirable, though complex undertaking for many researchers. The management of such virtual environments is cumbersome and not necessarily within the core skill set for scientists and engineers. We present here Chiminey, a software platform that enables researchers to (i) run applications on both traditional high-performance computing and cloud-based computing infrastructures, (ii) handle failure during execution, (iii) curate and visualise execution outputs, (iv) share such data with collaborators or the public, and (v) search for publicly available data.Comment: Preprint, ICSE 201

    Ultra Reliable Computing Systems

    Get PDF
    For high security and safety applications as well as general purpose applications, it is necessary to have ultra reliable computing systems. This dissertation describes our system of self-testable and self-repairable digital devices, especially, EPLDs (Electrically Programmable Logic Devices). In addition to significantly improving the reliability of digital systems, our self-healing and re-configurable system design with added repair capability can also provide higher yields, lower testing costs, and faster time-to-market for the semiconductor industry. The digital system in our approach is composed of blocks, which realize combinational and sequential circuits using GALs (Generic Array Logic Devices). We describe three techniques for fault-locating and fault-repairing in these devices. The methodology we used for evaluation of these methods and a comparison with devices that have no self-repair capability was simulation of the self-repair algorithms. Our simulations show that the lifetime for a GAL-based EPLD that uses our multiple self-repairing methods is longer than the lifetime of a GAL-based EPLD that uses a single self-repair method or no self-repair method. Specifically, our work demonstrates that the lifetime of a GAL can be increased by adding extra columns in the AND array of a GAL and extra output ORs in a GAL. It also gives information on how many extra columns and extra ORs a GAL needs and which self-repairing method should be used to guarantee a given lifetime. Thus, we can estimate an ideal point, where the maximum reliability can be reached with the minimum cost

    A small-scale testbed for large-scale reliable computing

    Get PDF
    High performance computing (HPC) systems frequently suffer errors and failures from hardware components that negatively impact the performance of jobs run on these systems. We analyzed system logs from two HPC systems at Purdue University and created statistical models for memory and hard disk errors. We created a small-scale error injection testbed—using a customized QEMU build, libvirt, and Python—for HPC application programmers to test and debug their programs in a faulty environment so that programmers can write more robust and resilient programs before deploying them on an actual HPC system. The deliverables for this project are the fault injection program, the modified QEMU source code, and the statistical models used for driving the injection

    Energy-Efficient and Reliable Computing in Dark Silicon Era

    Get PDF
    Dark silicon denotes the phenomenon that, due to thermal and power constraints, the fraction of transistors that can operate at full frequency is decreasing in each technology generation. Moore’s law and Dennard scaling had been backed and coupled appropriately for five decades to bring commensurate exponential performance via single core and later muti-core design. However, recalculating Dennard scaling for recent small technology sizes shows that current ongoing multi-core growth is demanding exponential thermal design power to achieve linear performance increase. This process hits a power wall where raises the amount of dark or dim silicon on future multi/many-core chips more and more. Furthermore, from another perspective, by increasing the number of transistors on the area of a single chip and susceptibility to internal defects alongside aging phenomena, which also is exacerbated by high chip thermal density, monitoring and managing the chip reliability before and after its activation is becoming a necessity. The proposed approaches and experimental investigations in this thesis focus on two main tracks: 1) power awareness and 2) reliability awareness in dark silicon era, where later these two tracks will combine together. In the first track, the main goal is to increase the level of returns in terms of main important features in chip design, such as performance and throughput, while maximum power limit is honored. In fact, we show that by managing the power while having dark silicon, all the traditional benefits that could be achieved by proceeding in Moore’s law can be also achieved in the dark silicon era, however, with a lower amount. Via the track of reliability awareness in dark silicon era, we show that dark silicon can be considered as an opportunity to be exploited for different instances of benefits, namely life-time increase and online testing. We discuss how dark silicon can be exploited to guarantee the system lifetime to be above a certain target value and, furthermore, how dark silicon can be exploited to apply low cost non-intrusive online testing on the cores. After the demonstration of power and reliability awareness while having dark silicon, two approaches will be discussed as the case study where the power and reliability awareness are combined together. The first approach demonstrates how chip reliability can be used as a supplementary metric for power-reliability management. While the second approach provides a trade-off between workload performance and system reliability by simultaneously honoring the given power budget and target reliability

    Integration of tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (DAHPHRS), phase 1

    Get PDF
    Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified

    Distributed, layered and reliable computing nets to represent neuronal receptive fields

    Get PDF
    Abstract. Receptive fields of retinal and other sensory neurons show a large variety of spatiotemporal linear and non linear types of responses to local stimuli. In visual neurons, these responses present either asymmetric sensitive zones or center-surround organization. In most cases, the nature of the responses suggests the existence of a kind of distributed computation prior to the integration by the final cell which is evidently supported by the anatomy. We describe a new kind of discrete and continuous filters to model the kind of computations taking place in the receptive fields of retinal cells. To show their performance in the analysis of diferent non-trivial neuron-like structures, we use a computer tool specifically programmed by the authors to that efect. This tool is also extended to study the efect of lesions on the whole performance of our model nets

    High level design proof of a reliable computing platform

    Get PDF
    The main objectives are: to establish hardware/software platform for ultra-reliable computing; to use fault tolerant computer architecture; to use formal methods to prevent design and implementation errors; and to construct reliability model to quantify reliability estimate. The results show that: ultra-reliable control systems are hard to achieve; simple fault tolerant design is postulated; formal specification of design is constructed; and preliminary correctness proofs are obtained
    • …
    corecore