606 research outputs found

    Automated synthesis of delay-insensitive circuits

    Get PDF

    Randomized cache placement for eliminating conflicts

    Get PDF
    Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.Peer ReviewedPostprint (published version

    Correct-by-construction microarchitectural pipelining

    Get PDF
    This paper presents a method for correct-by-construction microarchitectural pipelining that handles cyclic systems with dependencies between iterations. Our method combines previously known bypass and retiming transformations with a few transformations valid only for elastic systems with early evaluation (namely, empty FIFO insertion, FIFO capacity sizing, insertion of anti-tokens, and introducing early evaluation multiplexors). By converting the design to a synchronous elastic form and then applying this extended set of transformations, one can pipeline a functional specification with an automatically generated distributed controller that implements stalling logic resolving data hazards off the critical path of the design. We have developed an interactive toolkit for exploring elastic microarchitectural transformations. The method is illustrated by pipelining a few simple examples of instruction set architecture ISA specifications.Peer ReviewedPostprint (published version

    Peephole optimization of asynchronous macromodule networks

    Get PDF
    Journal ArticleAbstract- Most high-level synthesis tools for asynchronous circuits take descriptions in concurrent hardware description languages and generate networks of macromodules or handshake components. In this paper, we propose a peephole optimizer for these networks. Our peephole optimizer first deduces an equivalent blackbox behavior for the network using Dill's tracetheoretic parallel composition operator. It then applies a new procedure called burst-mode reduction to obtain burst-mode machines from the deduced behavior. In a significant number of examples, our optimizer achieves gate-count improvements by a factor of five, and speed (cycle-time) improvements by a factor of two. Burst-mode reduction can be applied to any macromodule network that is delay insensitive as well as deterministic. A significant number of asynchronous circuits, especially those generated by asynchronous high-level synthesis tools, fall into this class, thus making our procedure widely applicable

    Exploiting robustness in asynchronous circuits to design fine-tunable systems

    Get PDF
    PhD ThesisRobustness property in a circuit defines its tolerance to the effects of process, voltage and temperature variations. The mode signaling and event communication between computing units in a asynchronous circuits makes them inherently robust. The level of robustness depends on the type of delay assumptions used in the design and specification process. In this thesis, two approaches to exploiting robustness in asynchronous circuits to design self-adapting and fine-tunable systems are investigated. In the first investigation, a Digitally Controllable Oscillator (DCO) and a computing unit are integrated such that the operating conditions of the computing unit modulated the operation of the DCO. In this investigation, the computing unit which is a self-timed counter interacts with the DCO in a four-phase handshake protocol. This mode of interaction ensures a DCO and computing unit system that can fine-tune its operation to adapt to the effects of variations. In this investigation, it is shown that such a system will operate correctly in wide range of voltage supply. In the second investigation, a Digital Pulse-Width Modulator (DPWM) with coarse and fine-tune controls is designed using two Kessels counters. The coarse control of the DPWM tuned the pulse ratio and pulse frequency while the fine-tune control exploited the robustness property of asynchronous circuits in an addition-based delay system to add or subtract delay(s) to the pulse width while maintaining a constant pulse frequency. The DPWM realized gave constant duty ratio regardless of the operating voltage. This type of DPWM has practical application in a DC-DC converter circuit to tune the output voltage of the converter in high resolution. The Kessels counter is a loadable self-timed modulo−n counter, which is realized by decomposition using Horner’s method, specified and verified using formal asynchronous design techniques. The decomposition method used introduced parallelism in the system by dividing the counter into a systolic array of cells, with each cell further decomposed into two parts that have distinct defined operations. Specification of the decomposed counter cell parts operation was in three stages. The first stage employed high-level specification using Labelled Petri nets (LPN). In this form, functional correctness of the decomposed counter is modelled and verified. In the second stage, a cell part is specified by combing all possible operations for that cell part in high-level form. With this approach, a combination of inputs from a defined control block activated the correct operation for a cell part. In the final stage, the LPNs were converted to Signal Transition Graphs, from which the logic circuits of the cells were synthesized using the WorkCraft Tool. In this thesis, the Kessels counter was implemented and fabricated in 350 nm CMOS Technology.Niger Delta Development Commission (NDD

    Handshake circuits : an intermediary between communicating processes and VLSI

    Get PDF

    Microarchitectural techniques to reduce interconnect power in clustered processors

    Get PDF
    Journal ArticleThe paper presents a preliminary evaluation of novel techniques that address a growing problem - power dissipation in on-chip interconnects. Recent studies have shown that around 50% of the dynamic power consumption in modern processors is within on-chip interconnects. The contribution of interconnect power to total chip power is expected to be higher in future communication-bound billion-transistor architectures. In this paper, we propose the design of a heterogeneous interconnect, where some wires are optimized for low latency and others are optimized for low power. We show that a large fraction of on-chip communications are latency insensitive. Effecting these non-critical transfers on low-power long-latency interconnects can result in significant power savings without unduly affecting performance. Two primary techniques are evaluated in this paper: (i) a dynamic critical path predictor that identifies results that are not urgently consumed, and (ii) an address prediction mechanism that requires addresses to be transferred off the critical path for verification purposes. Our results demonstrate that 49% of all interconnect transfers can be effected on power-efficient wires, while incurring a performance penalty of only 2.5%

    A hardware implementation of a Viterbi decoder for a (3,2/3) TCM code

    Get PDF
    The report details the design of a dedicated Viterbi decoder chip set for an Ungerboek (3,2/3) Trellis Coded Modulation code. It was the specific intention of the thesis to design a system that could be implemented on standard Field Programmable Gate Arrays (FPGA) yet still be able to cope with high bit rates. The focus of the research was to both evaluate and modify the existing VLSI design techniques and to develop new techniques to make this possible. Trellis Coded Modulation refers to a specific sub-class of convolutional codes that ire an example of coded modulation. In coded modulation there is a direct link between the encoding and modulation processes aimed at improving the performance of the code by introducing redundancy in the signal set used to transmit the code. Ungerboek developed a technique for mapping the encoded words onto points in the signal set, called mapping by set partitioning, that maximises the Euclidian distance between adjacent codewords, and hence maximises the minimum distance between any two output sequences in the code. The Viterbi algorithm is a maximum likelihood decoder for convolutional codes such as TCM. The operation of the Viterbi algorithm is based on using soft decision decoding to produce an estimate of how well the received sequence corresponds with any of the allowed code sequences. The code sequences which most closely matches the received sequence is then decoded to form the output of the decoder. A central problem in implementing systems using TCM with Viterbi decoding is that although the encoder is a relatively simple device, the decoder is not. The complexity of the Viterbi decoder for any given TCM scheme will be the major drawback in implementing the scheme. As such techniques for reducing the complexity of Viterbi decoders are of interest to developers of communication systems. The algorithms describing the implementation and operation of the Viterbi algorithm can be categorised into three main layers. The top layer holds the theoretical algorithm itself, in the second layer are the set of algorithms that describe the broad techniques used to manipulate the theoretical algorithm into a form in which it can be implemented, and the third layer of algorithms describe the implementations themselves. The work contained in this thesis concentrates on the second two layers of algorithms
    • …
    corecore