562 research outputs found

    Statistical Reliability Estimation of Microprocessor-Based Systems

    Get PDF
    What is the probability that the execution state of a given microprocessor running a given application is correct, in a certain working environment with a given soft-error rate? Trying to answer this question using fault injection can be very expensive and time consuming. This paper proposes the baseline for a new methodology, based on microprocessor error probability profiling, that aims at estimating fault injection results without the need of a typical fault injection setup. The proposed methodology is based on two main ideas: a one-time fault-injection analysis of the microprocessor architecture to characterize the probability of successful execution of each of its instructions in presence of a soft-error, and a static and very fast analysis of the control and data flow of the target software application to compute its probability of success. The presented work goes beyond the dependability evaluation problem; it also has the potential to become the backbone for new tools able to help engineers to choose the best hardware and software architecture to structurally maximize the probability of a correct execution of the target softwar

    Low-Cost On-Chip Clock Jitter Measurement Scheme

    Get PDF
    In this paper, we present a low-cost, on-chip clock jitter digital measurement scheme for high performance microprocessors. It enables in situ jitter measurement during the test or debug phase. It provides very high measurement resolution and accuracy, despite the possible presence of power supply noise (representing a major source of clock jitter), at low area and power costs. The achieved resolution is scalable with technology node and can in principle be increased as much as desired, at low additional costs in terms of area overhead and power consumption. We show that, for the case of high performance microprocessors employing ring oscillators (ROs) to measure process parameter variations (PPVs), our jitter measurement scheme can be implemented by reusing part of such ROs, thus allowing to measure clock jitter with a very limited cost increase compared with PPV measurement only, and with no impact on parameter variation measurement resolution

    A review of High Performance Computing foundations for scientists

    Full text link
    The increase of existing computational capabilities has made simulation emerge as a third discipline of Science, lying midway between experimental and purely theoretical branches [1, 2]. Simulation enables the evaluation of quantities which otherwise would not be accessible, helps to improve experiments and provides new insights on systems which are analysed [3-6]. Knowing the fundamentals of computation can be very useful for scientists, for it can help them to improve the performance of their theoretical models and simulations. This review includes some technical essentials that can be useful to this end, and it is devised as a complement for researchers whose education is focused on scientific issues and not on technological respects. In this document we attempt to discuss the fundamentals of High Performance Computing (HPC) [7] in a way which is easy to understand without much previous background. We sketch the way standard computers and supercomputers work, as well as discuss distributed computing and discuss essential aspects to take into account when running scientific calculations in computers.Comment: 33 page

    Software-Based Self-Test of Set-Associative Cache Memories

    Get PDF
    Embedded microprocessor cache memories suffer from limited observability and controllability creating problems during in-system tests. This paper presents a procedure to transform traditional march tests into software-based self-test programs for set-associative cache memories with LRU replacement. Among all the different cache blocks in a microprocessor, testing instruction caches represents a major challenge due to limitations in two areas: 1) test patterns which must be composed of valid instruction opcodes and 2) test result observability: the results can only be observed through the results of executed instructions. For these reasons, the proposed methodology will concentrate on the implementation of test programs for instruction caches. The main contribution of this work lies in the possibility of applying state-of-the-art memory test algorithms to embedded cache memories without introducing any hardware or performance overheads and guaranteeing the detection of typical faults arising in nanometer CMOS technologie

    Faster arithmetic for number-theoretic transforms

    Full text link
    We show how to improve the efficiency of the computation of fast Fourier transforms over F_p where p is a word-sized prime. Our main technique is optimisation of the basic arithmetic, in effect decreasing the total number of reductions modulo p, by making use of a redundant representation for integers modulo p. We give performance results showing a significant improvement over Shoup's NTL library.Comment: 9 pages, a few minor changes and reorganisation, to appear in JS

    Evaluation of Design Tools for Rapid Prototyping of Parallel Signal Processing Algorithms

    Get PDF
    Digital signal processing (DSP) has become a popular method for handling not only signal processing, but communications, and control system applications. A DSP application of interest to the Air Force is high speed avionics processing. The real time computing requirements of avionics processing exceed the capabilities of current single chip DSP processors, and parallelization of multiple DSP processors is a solution to handle such requirements. Designing and implementing a parallel DSP algorithm has been a lengthy process often requiring different design tools and extensive programming experience. Through the use of integrated software development tools, rapid prototyping becomes possible by simulating algorithms, generating code for workstations or DSP microprocessors, and generating hardware description language code for hardware synthesis. This research examines the use of one such tool, the Signal Processing WorkSystem (SPW) by the Alta Group of Cadence Design Systems, Inc., and how SPW supports the rapid prototyping process from an avionics algorithm design through simulation and hardware implementation. Throughout this process, SPW is evaluated as an aid to the avionics designer to meet design objectives and evaluate tradeoffs to find the best blend of efficiency and effectiveness. By designing a two dimensional fast Fourier transform algorithm as a specific avionics algorithm and exploring implementation options, SPW is shown to be a viable rapid prototyping solution allowing an avionics designer to focus on design trade-offs instead of implementation details while using parallelization to meet real-time application requirements

    Scalable Approach for Power Droop Reduction During Scan-Based Logic BIST

    Get PDF
    The generation of significant power droop (PD) during at-speed test performed by Logic Built-In Self Test (LBIST) is a serious concern for modern ICs. In fact, the PD originated during test may delay signal transitions of the circuit under test (CUT): an effect that may be erroneously recognized as delay faults, with consequent erroneous generation of test fails and increase in yield loss. In this paper, we propose a novel scalable approach to reduce the PD during at-speed test of sequential circuits with scan-based LBIST using the launch-on-capture scheme. This is achieved by reducing the activity factor of the CUT, by proper modification of the test vectors generated by the LBIST of sequential ICs. Our scalable solution allows us to reduce PD to a value similar to that occurring during the CUT in field operation, without increasing the number of test vectors required to achieve a target fault coverage (FC). We present a hardware implementation of our approach that requires limited area overhead. Finally, we show that, compared with recent alternative solutions providing a similar PD reduction, our approach enables a significant reduction of the number of test vectors (by more than 50%), thus the test time, to achieve a target FC

    Hardware-software co-design of an iris recognition algorithm

    Get PDF
    This paper describes the implementation of an iris recognition algorithm based on hardware-software co-design. The system architecture consists of a general-purpose 32- bit microprocessor and several slave coprocessors that accelerate the most intensive calculations. The whole iris recognition algorithm has been implemented on a low-cost Spartan 3 FPGA, achieving significant reduction in execution time when compared to a conventional software-based application. Experimental results show that with a clock speed of 40 MHz, an IrisCode is obtained in less than 523 ms from an image of 640x480 pixels, which is just 20% of the total time needed by a software solution running on the same microprocessor embedded in the architecture.Peer ReviewedPreprin

    Performance Models for Electronic Structure Methods on Modern Computer Architectures

    Get PDF
    Electronic structure codes are computationally intensive scientic applications used to probe and elucidate chemical processes at an atomic level. Maximizing the performance of these applications on any given hardware platform is vital in order to facilitate larger and more accurate computations. An important part of this endeavor is the development of protocols for measuring performance, and models to describe that performance as a function of system architecture. This thesis makes contributions in both areas, with a focus on shared memory parallel computer architectures and the Gaussian electronic structure code. Shared memory parallel computer systems are increasingly important as hardware man- ufacturers are unable to extract performance improvements by increasing clock frequencies. Instead the emphasis is on using multi-core processors to provide higher performance. These processor chips generally have complex cache hierarchies, and may be coupled together in multi-socket systems which exhibit highly non-uniform memory access (NUMA) characteristics. This work seeks to understand how cache characteristics and memory/thread placement affects the performance of electronic structure codes, and to develop performance models that can be used to describe and predict code performance by accounting for these effects. A protocol for performing memory and thread placement experiments on NUMA systems is presented and its implementation under both the Solaris and Linux operating systems is discussed. A placement distribution model is proposed and subsequently used to guide both memory/thread placement experiments and as an aid in the analysis of results obtained from experiments. In order to describe single threaded performance as a function of cache blocking a simple linear performance model is investigated for use when computing the electron repulsion integrals that lie at the heart of virtually all electronic structure methods. A parametric cache variation study is performed. This is achieved by combining parameters obtained for the linear performance model on existing hardware, with instruction and cache miss counts obtained by simulation, and predictions are made of performance as a function of cache architecture. Extension of the linear performance model to describe multi-threaded performance on complex NUMA architectures is discussed and investigated experimentally. Use of dynamic page migration to improve locality is also considered. Finally the use of large scale electronic structure calculations is demonstrated in a series of calculations aiming to study the charge distribution for a single positive ion solvated within a shell of water molecules of increasing size

    Innovative Techniques for Testing and Diagnosing SoCs

    Get PDF
    We rely upon the continued functioning of many electronic devices for our everyday welfare, usually embedding integrated circuits that are becoming even cheaper and smaller with improved features. Nowadays, microelectronics can integrate a working computer with CPU, memories, and even GPUs on a single die, namely System-On-Chip (SoC). SoCs are also employed on automotive safety-critical applications, but need to be tested thoroughly to comply with reliability standards, in particular the ISO26262 functional safety for road vehicles. The goal of this PhD. thesis is to improve SoC reliability by proposing innovative techniques for testing and diagnosing its internal modules: CPUs, memories, peripherals, and GPUs. The proposed approaches in the sequence appearing in this thesis are described as follows: 1. Embedded Memory Diagnosis: Memories are dense and complex circuits which are susceptible to design and manufacturing errors. Hence, it is important to understand the fault occurrence in the memory array. In practice, the logical and physical array representation differs due to an optimized design which adds enhancements to the device, namely scrambling. This part proposes an accurate memory diagnosis by showing the efforts of a software tool able to analyze test results, unscramble the memory array, map failing syndromes to cell locations, elaborate cumulative analysis, and elaborate a final fault model hypothesis. Several SRAM memory failing syndromes were analyzed as case studies gathered on an industrial automotive 32-bit SoC developed by STMicroelectronics. The tool displayed defects virtually, and results were confirmed by real photos taken from a microscope. 2. Functional Test Pattern Generation: The key for a successful test is the pattern applied to the device. They can be structural or functional; the former usually benefits from embedded test modules targeting manufacturing errors and is only effective before shipping the component to the client. The latter, on the other hand, can be applied during mission minimally impacting on performance but is penalized due to high generation time. However, functional test patterns may benefit for having different goals in functional mission mode. Part III of this PhD thesis proposes three different functional test pattern generation methods for CPU cores embedded in SoCs, targeting different test purposes, described as follows: a. Functional Stress Patterns: Are suitable for optimizing functional stress during I Operational-life Tests and Burn-in Screening for an optimal device reliability characterization b. Functional Power Hungry Patterns: Are suitable for determining functional peak power for strictly limiting the power of structural patterns during manufacturing tests, thus reducing premature device over-kill while delivering high test coverage c. Software-Based Self-Test Patterns: Combines the potentiality of structural patterns with functional ones, allowing its execution periodically during mission. In addition, an external hardware communicating with a devised SBST was proposed. It helps increasing in 3% the fault coverage by testing critical Hardly Functionally Testable Faults not covered by conventional SBST patterns. An automatic functional test pattern generation exploiting an evolutionary algorithm maximizing metrics related to stress, power, and fault coverage was employed in the above-mentioned approaches to quickly generate the desired patterns. The approaches were evaluated on two industrial cases developed by STMicroelectronics; 8051-based and a 32-bit Power Architecture SoCs. Results show that generation time was reduced upto 75% in comparison to older methodologies while increasing significantly the desired metrics. 3. Fault Injection in GPGPU: Fault injection mechanisms in semiconductor devices are suitable for generating structural patterns, testing and activating mitigation techniques, and validating robust hardware and software applications. GPGPUs are known for fast parallel computation used in high performance computing and advanced driver assistance where reliability is the key point. Moreover, GPGPU manufacturers do not provide design description code due to content secrecy. Therefore, commercial fault injectors using the GPGPU model is unfeasible, making radiation tests the only resource available, but are costly. In the last part of this thesis, we propose a software implemented fault injector able to inject bit-flip in memory elements of a real GPGPU. It exploits a software debugger tool and combines the C-CUDA grammar to wisely determine fault spots and apply bit-flip operations in program variables. The goal is to validate robust parallel algorithms by studying fault propagation or activating redundancy mechanisms they possibly embed. The effectiveness of the tool was evaluated on two robust applications: redundant parallel matrix multiplication and floating point Fast Fourier Transform
    corecore