12 research outputs found

    Analysis and Optimization for Pipelined Asynchronous Systems

    Get PDF
    Most microelectronic chips used today--in systems ranging from cell phones to desktop computers to supercomputers--operate in basically the same way: they synchronize the operation of their millions of internal components using a clock that is distributed globally. This global clocking is becoming a critical design challenge in the quest for building chips that offer increasingly greater functionality, higher speed, and better energy efficiency. As an alternative, asynchronous or clockless design obviates the need for global synchronization; instead, components operate concurrently and synchronize locally only when necessary. This dissertation focuses on one class of asynchronous circuits: application specific stream processing systems (i.e. those that take in a stream of data items and produce a stream of processed results.) High-speed stream processors are a natural match for many high-end applications, including 3D graphics rendering, image and video processing, digital filters and DSPs, cryptography, and networking processors. This dissertation aims to make the design, analysis, optimization, and testing of circuits in the chosen domain both fast and efficient. Although much of the groundwork has already been laid by years of past work, my work identifies and addresses four critical missing pieces: i) fast performance analysis for estimating the throughput of a fine-grained pipelined system; ii) automated and versatile design space exploration; iii) a full suite of circuit level modules that connect together to implement a wide variety of system behaviors; and iv) testing and design for testability techniques that identify and target the types of errors found only in high-speed pipelined asynchronous systems. I demonstrate these techniques on a number of examples, ranging from simple applications that allow for easy comparison to hand-designed alternatives to more complex systems, such as a JPEG encoder. I also demonstrate these techniques through the design and test of a fully asynchronous GCD demonstration chip

    Bottleneck analysis and alleviation in pipelined systems: A fast hierarchical approach

    No full text
    Abstract—Fast bottleneck detection and elimination is an important component of any design flow that aims at producing high-throughput systems. Bottlenecks can be difficult to find and correct, because their causes are diverse and often subtle. In this paper, we build on our recent method for performance analysis to develop a method for bottleneck identification and alleviation for pipelined asynchronous systems. More specifically, this paper makes two contributions. First, we introduce a method that, given a throughput goal, identifies which parts of the pipelined system constrain its throughput. Each such bottleneck is categorized based on the type of structural transformation that could potentially alleviate it: increase degree of pipelining (stage splitting, stage duplication, and loop unrolling); decrease forward latency (stage merging and parallelization); and perform slack matching. The second contribution is a method that guides the user to systematically apply these modifications to alleviate the bottlenecks and reach a target throughput goal. We have validated the bottleneck analysis method on several examples and were able to attain the desired throughput goal in each case through iterative application of our bottleneck alleviation method. Runtimes were negligible in all cases (less than 50 ms). I

    Automated Microarchitectural Exploration for Achieving Throughput Targets in Pipelined Asynchronous Systems

    No full text
    Abstract—This paper presents a systematic approach for microarchitectural exploration in pipelined asynchronous systems, with the goal of achieving a specified throughput target while minimizing a given cost function (based on energy, area, etc.). The method includes a general framework that (i) allows for a rich extensible set of microarchitectural transformations for improving throughput; and (ii) can handle a variety of cost functions, such as area, energy, Eτ 2 and the energy-area product. In general, the space of transformations that can be applied to a given circuit is potentially infinite because an arbitrarily long sequence of transformations may be applicable. To compound the challenge, the value of the given cost function can change non-monotonically as successive transformations are applied (e.g., some transformations increase area, while others decrease area), thereby making it difficult to apply a typical branch-and-bound approach to prune the search space. Our method employs simple but effective heuristic search strategies (including greedy, lookahead, and breadth-first). A key contribution is to identify commutativity of certain transformations, thereby pruning the design space significantly. The approach was automated and applied to a number of examples. Various throughput targets were assumed: from 50 % to 20x throughput improvement. In each example, the approach was successful in meeting the throughput target. I

    Loop pipelining for high-throughput stream computation using self-timed rings

    No full text
    We present a technique for increasing the throughput of stream processing architectures by removing the bottlenecks caused by loop structures. We implement loops as self-timed pipelined rings that can operate on multiple data sets concurrently. Our contribution includes a transformation algorithm which takes as input a high-level program and gives as output the structure of an optimized pipeline ring. Our technique handles nested loops and is further enhanced by loop unrolling. Simulations run on benchmark examples show a 1.3 to 4.9x speedup without unrolling and a 2.6 to 9.7x speedup with twofold loop unrolling. 1

    Low-Overhead Testing of Delay Faults in High-Speed Asynchronous Pipelines ∗

    No full text
    We propose a low-overhead method for delay fault testing in high-speed asynchronous pipelines. The key features of our work are: (i) testing strategies can be administered using low-speed testing equipment; (ii) testing is minimallyintrusive, i.e. very little testing hardware needs to be added; (iii) testing methods are extended to pipelines with forks and joins, which is an important first step to testing pipelines with arbitrary topologies; (iv) test pattern generation takes into account the likely event that one delay fault causes several bits of data to become corrupted; and (v) test generation can leverage existing stuck-at ATPG tools. In describing our testing strategy, we use examples of faults from three very different high-speed pipeline styles: MOUSETRAP, GasP, and high-capacity (HC) pipelines. In addition, we give an in-depth example—including test pattern generation—for both linear and non-linear MOUSE-TRAP pipelines. 1

    PixelFlex2: A Comprehensive, Automatic, Casually-Aligned Multi-Projector Display

    No full text
    We introduce PixelFlex2, our newest scalable wall-sized, multi-projector display system. For it, we had to solve most of the difficult problems left open by its predecessor, PixelFlex, a proof-of-concept demonstration driven by a large, multi-headed SGI graphics system. PixelFlex2 retains the achievements of PixelFlex (high-performance through single-pass rendering, single-pixel accuracy for geometric blending with only casual placement of projectors) , while adding a) higher performance and scalability with a Linux PC-cluster, b) application support with either the distributed-rendering framework of Chromium or a performance-oriented, parallel-process framework supported by a proprietary API, c) improved geometric calibration by using a corner finder for feature detection, and d) photometric calibration with a single conventional camera using high dynamic range imaging techniques rather than an expensive photometer
    corecore