319 research outputs found

    Performance study of synthetic AER generation on CPUs for Real-Time Video based on Spikes

    Get PDF
    Address-Event-Representation (AER) is a neuromorphic interchip communication protocol that allows for real-time virtual massive connectivity between huge number neurons located on different chips. When building multi-chip muti-layered AER systems it is absolutely necessary to have a computer interface that allows (a) to read AER interchip traffic into the computer and visualize it on screen, and (b) convert conventional frame-based video stream in the computer into AER and inject it at some point of the AER structure. This is necessary for test and debugging of complex AER systems. Previous work presented several software methods for converting digital frames into AER format. Those methods were not feasible for real-time conversion those days because the processor performance was insufficient. Nowadays, Multi-core processor architectures and cache hierarchies have evolved and the performance is much better than Pentium 4 Mobile of those years. In this paper we study frame-to-AER methods for realtime video applications (40ms per frame) using modern processor architectures, compilers, and processors oriented for stand-alone applications (mini-PC processors

    Parallel Flow-Based Hypergraph Partitioning

    Get PDF
    We present a shared-memory parallelization of flow-based refinement, which is considered the most powerful iterative improvement technique for hypergraph partitioning at the moment. Flow-based refinement works on bipartitions, so current sequential partitioners schedule it on different block pairs to improve k-way partitions. We investigate two different sources of parallelism: a parallel scheduling scheme and a parallel maximum flow algorithm based on the well-known push-relabel algorithm. In addition to thoroughly engineered implementations, we propose several optimizations that substantially accelerate the algorithm in practice, enabling the use on extremely large hypergraphs (up to 1 billion pins). We integrate our approach in the state-of-the-art parallel multilevel framework Mt-KaHyPar and conduct extensive experiments on a benchmark set of more than 500 real-world hypergraphs, to show that the partition quality of our code is on par with the highest quality sequential code (KaHyPar), while being an order of magnitude faster with 10 threads

    Testability and redundancy techniques for improved yield and reliability of CMOS VLSI circuits

    Get PDF
    The research presented in this thesis is concerned with the design of fault-tolerant integrated circuits as a contribution to the design of fault-tolerant systems. The economical manufacture of very large area ICs will necessitate the incorporation of fault-tolerance features which are routinely employed in current high density dynamic random access memories. Furthermore, the growing use of ICs in safety-critical applications and/or hostile environments in addition to the prospect of single-chip systems will mandate the use of fault-tolerance for improved reliability. A fault-tolerant IC must be able to detect and correct all possible faults that may affect its operation. The ability of a chip to detect its own faults is not only necessary for fault-tolerance, but it is also regarded as the ultimate solution to the problem of testing. Off-line periodic testing is selected for this research because it achieves better coverage of physical faults and it requires less extra hardware than on-line error detection techniques. Tests for CMOS stuck-open faults are shown to detect all other faults. Simple test sequence generation procedures for the detection of all faults are derived. The test sequences generated by these procedures produce a trivial output, thereby, greatly simplifying the task of test response analysis. A further advantage of the proposed test generation procedures is that they do not require the enumeration of faults. The implementation of built-in self-test is considered and it is shown that the hardware overhead is comparable to that associated with pseudo-random and pseudo-exhaustive techniques while achieving a much higher fault coverage through-the use of the proposed test generation procedures. The consideration of the problem of testing the test circuitry led to the conclusion that complete test coverage may be achieved if separate chips cooperate in testing each other's untested parts. An alternative approach towards complete test coverage would be to design the test circuitry so that it is as distributed as possible and so that it is tested as it performs its function. Fault correction relies on the provision of spare units and a means of reconfiguring the circuit so that the faulty units are discarded. This raises the question of what is the optimum size of a unit? A mathematical model, linking yield and reliability is therefore developed to answer such a question and also to study the effects of such parameters as the amount of redundancy, the size of the additional circuitry required for testing and reconfiguration, and the effect of periodic testing on reliability. The stringent requirement on the size of the reconfiguration logic is illustrated by the application of the model to a typical example. Another important result concerns the effect of periodic testing on reliability. It is shown that periodic off-line testing can achieve approximately the same level of reliability as on-line testing, even when the time between tests is many hundreds of hours

    Video Processing Acceleration using Reconfigurable Logic and Graphics Processors

    No full text
    A vexing question is `which architecture will prevail as the core feature of the next state of the art video processing system?' This thesis examines the substitutive and collaborative use of the two alternatives of the reconfigurable logic and graphics processor architectures. A structured approach to executing architecture comparison is presented - this includes a proposed `Three Axes of Algorithm Characterisation' scheme and a formulation of perfor- mance drivers. The approach is an appealing platform for clearly defining the problem, assumptions and results of a comparison. In this work it is used to resolve the advanta- geous factors of the graphics processor and reconfigurable logic for video processing, and the conditions determining which one is superior. The comparison results prompt the exploration of the customisable options for the graphics processor architecture. To clearly define the architectural design space, the graphics processor is first identifed as part of a wider scope of homogeneous multi-processing element (HoMPE) architectures. A novel exploration tool is described which is suited to the investigation of the customisable op- tions of HoMPE architectures. The tool adopts a systematic exploration approach and a high-level parameterisable system model, and is used to explore pre- and post-fabrication customisable options for the graphics processor. A positive result of the exploration is the proposal of a reconfigurable engine for data access (REDA) to optimise graphics processor performance for video processing-specific memory access patterns. REDA demonstrates the viability of the use of reconfigurable logic as collaborative `glue logic' in the graphics processor architecture

    Alternative Timing in Digital Logic

    Get PDF
    Abstract For many decades using a system clock has been the go-to method of timing circuits. CPUs in particular have been at least partially defined by the speed of their clock. As technology moves forward, this is proving more and more problematic. At first, clock rates increased as transistor sized reduced. Now, transistor sizes still go down while clock rates remain stable. As a result, the focus has shifted to trying to do more with each cycle. A greater emphasis has been placed on efficiency, because less power draw in each cycle means either less battery drain for mobile devices or more things that can be done within power limitations for circuits with a less transient power supply. To that end, I propose that alternative timing schemes have as yet untapped potential and warrant further industry focus and research. To demonstrate this, various methods of timing are discussed and analyzed, and a demonstration is provided for techniques that have no available statistics. What follows is an examination of existing and new ideas in circuit timing, with a focus on microprocessors. The first method discussed involves eliminating the clock entirely. The resulting asynchronous circuits are a well studied and discussed idea, which was dismissed previously as being not worth the cost. The progress of processor design in the last few years indicates a renewed study of asynchronous circuits is warranted. The other option explored is when the clock becomes aperiodic. If this elastic clock is one whose width can change from cycle to cycle, instructions with varying worst case timing can control the clock to run a system closer to average case time. This method has not received the same attention as asynchronous circuits, so some new ideas are proposed and demonstrated for generating and utilizing elastic clocks. Tests were run on a custom CPU design to prove the elastic clock design viable. The single-cycle processor was implemented with 45nm technology, and simulated using NanoSim. ii The results show that while the average power increases, the total energy required to execute the test program decreases. The savings are enough to offset the power overhead the new components require. The area overhead is 3% or less; better, if used in more complex designs. Given the complexity of typical pipeline CPUs, the area and power savings of a single-cycle design combined with the throughput improvement shown by the test makes this an interesting alternative for low power applications. Other uses of this technology are discussed and logically analyzed. iii Acknowledgment

    Evaluating the Impact of Transition Delay Faults in GPUs

    Get PDF
    This work proposes a method to evaluate the effects of transition delay faults (TDFs) in GPUs. The method takes advantage of low-level (i.e., RT- and gate-level) descriptions of a GPU to evaluate the effects of transition delay faults in GPUs, thus paving the way to model them as errors at the instruction level, which can contribute to the resilience evaluations of large and complex applications. For this purpose, the paper describes a setup that efficiently simulates transition delay faults. The results allow us to compare their effects with stuck-at-faults (SAFs) and perform an error classification correlating these faults as instruction-level errors. We resort to an open-source model of a GPU (FlexGripPlus) and a set of workloads for the evaluation. The experimental results show that, according to the application code style, TDFs can compromise the operation of an application from 1.3 to 11.63 times less than SAFs. Moreover, for all the analyzed applications, a considerable percentage of sites of the Integer (5.4% to 51.7%), Floating-point (0.9% to 2.4%), and Special Function unit (17.0% to 35.6%) can become critical if affected by a SAF or TDF. Finally, a correlation between the fault's impact from both fault models and the instructions executed by the applications reveals that SAFs in the functional units are more prone (from 45.6% to 60.4%) to propagate errors at the software level for all units than TDFs (from 17.9% to 58.8%)

    Design and Testing Strategies for Modular 3-D-Multiprocessor Systems Using Die-Level Through Silicon Via Technology

    Get PDF
    An innovative modular 3-D stacked multi-processor architecture is presented. The platform is composed of completely identical stacked dies connected together by through-silicon-vias (TSVs). Each die features four 32-bit embedded processors and associated memory modules, interconnected by a 3-D network-on-chip (NoC), which can route packets in the vertical direction. Superimposing identical planar dies minimizes design effort and manufacturing costs, ensuring at the same time high flexibility and reconfigurability. A single die can be used either as a fully testable standalone chip multi-processor (CMP), or integrated in a 3-D stack, increasing the overall core count and consequently the system performance. To demonstrate the feasibility of this architecture, fully functional samples have been fabricated using a conventional UMC 90 nm complementary metal–oxide–semiconductor process and stacked using an in-house, via-last Cu-TSV process. Initial results show that the proposed 3-D-CMP is capable of operating at a target frequency of 400 MHz, supporting a vertical data bandwidth of 3.2 Gb/s
    corecore