11 research outputs found

    Implementación de Circuitos Self-Timed de 2 y 4 Fases en FPGAs

    Full text link
    Versión electrónica de la ponencia presentada en Jornadas de Computación Reconfigurable y Aplicaciones, celebrado en Madrid en 2003Aunque los dispositivos programables tipo FPGAs están diseñados para la implementación eficiente de circuitos síncronos, en la actualidad constituyen la única opción disponible para prototipado rápido de circuitos self-timed. En este artículo se presentan algunas ideas para el diseño de estos circuitos en FPGAs, para dos principales protocolo: 2 y 4 fases. Como caso de estudio, se ha elegido la multiplicación binaria. Se ilustra el funcionamiento de estos circuitos y se realiza una comparación entre las dos opciones de sincronización. También se resumen los principales resultados en área, velocidad, retardo de pistas y fanout. Como marco tecnológico se utiliza una FPGA Xilinx Virtex II

    Family of 4-phase latch protocols

    Get PDF
    Journal ArticleA complete family of untimed asynchronous 4-phase pipeline protocols is derived and characterised. This family contains all untimed protocols where data becomes valid before the request signal rises. Starting with a specification of the most parallel such protocol, rules are provided for concurrency reduction to systematically generate the family of all 137 related protocols that can be pipelined. Graphical and textual nomenclatures are developed to represent protocol properties and behaviours. The protocols are categorised according to their behaviours when composed into linear and structured parallel pipelines. Six basic categories emerge, along with several properties such as a single state that determines whether a protocol is fully or half buffered. When equivalence classes are calculated for parallel pipeline behaviours they are dominated by 15 shapes (all of which are delay-insensitive) which are related by a simple lattice. Several published circuits are shown to map to 16 of our 137 family members. This work enhances the understanding of handshake protocols, their properties, and relationships between different implementations in terms of concurrency and behavioural properties

    Concurrency reduction of untimed latch protocols - theory and practice

    Get PDF
    Journal ArticleA systematic investigation into concurrency reduction of untimed asynchronous 4-phase latch controllers is reported. Starting with a state graph that exhibits maximal concurrency, rules are provided for systematically reducing its states and thereby curtailing its behaviors. The rules predict liveness and occupancy, as well as the regularity and behavior of their pipelines. The rules also reveal the precise extent of the design space and thus provide a secure platform on which to study the implications of concurrency reduction on power, performance and area by implementing and evaluating the complete set of abstracted controllers. This complete characterization enhances the understanding and usage of concurrency and its reduction in handshake protocols. Trade-offs have been observed and reported which will aid designers in trying to find the best protocols for a required specification. Finally, the best synthesized protocols in this class have been identified

    Average-case optimized technology mapping of one-hot domino circuits*

    Get PDF
    Journal ArticleThis paper presents a technology mapping technique for optimizing the average-case delay of asynchronous combinational circuits implemented using domino logic and one-hot encoded outputs. The technique minimizes the critical path for common input patterns at the possible expense of making less common critical paths longer. To demonstrate the application of this technique, we present a case study of a combinational length decoding block, an integral component of an Asynchronous Instruction Length Decoder (AILD) which can be used in PentiumR processors. The experimental results demonstrate that the average-case delay of our mapped circuits can be dramatically lower than the worst-case delay of the circuits obtained using conventional worst-case mapping techniques

    Micropipeline controller design and verification with applications in signal processing

    Get PDF

    Increasing rendering performance of graphics hardware

    Get PDF
    Graphics Processing Unit (GPU) performance is increasing faster than central processing unit (CPU) performance. This growth is driven by performance improvements that can be divided into the following three categories: algorithmic improvements, architectural improvements, and circuit-level improvements. In this dissertation I present techniques that improve the rendering performance of graphics hardware measured in speed, power consumption or image quality in each of these three areas. At the algorithmic level, I introduce a method for using graphics hardware to rapidly and efficiently generate summed-area tables, which are data structures that hold pre-computed two-dimensional integrals of subsets of a given image, and present several novel rendering techniques that take advantage of summed-area tables to produce dynamic, high-quality images at interactive frame rates. These techniques improve the visual quality of images rendered on current commodity GPUs without requiring modifications to the underlying hardware or architecture. At the architectural level, I propose modifications to the architecture of current GPUs that add conditional streaming capabilities. I describe a novel GPU-based ray-tracing algorithm that takes advantage of conditional output streams to reduce the memory bandwidth requirements by over an order of magnitude times when compared to previous techniques. At the circuit level, I propose a compute-on-demand paradigm for the design of high-speed and energy-efficient graphics components. The goal of the compute-on-demand paradigm is to only perform computation at the bit-level when needed. The compute-on-demand paradigm exploits the data-dependent nature of computation, and thereby obtains speed and energy improvements by optimizing designs for the common case. This approach is illustrated with the design of a high-speed Z-comparator that is implemented using asynchronous logic. Asynchronous or "clockless" circuits were chosen for my implementations since they allow for data-dependent completion times and reduced power consumption by disabling inactive components. The resulting circuit-level implementation runs over 1.5 times faster while on dissipating 25% the energy of a comparable synchronous comparator for the average case. Also at the circuit-level, I introduce a novel implementation of counterflow pipelining, which allows two streams of data to flow in opposite directions within the same pipeline without the need for complex arbitration. The advantages of this implementation are demonstrated by the design of a high-speed asynchronous Booth multiplier. While both the comparator and the multiplier are useful components of a graphics pipeline, the objective of this work was to propose the new design paradigm as a promising alternative to current graphics hardware design practices

    Self-timed field programmmable gate array architectures

    Get PDF

    Der testfreundliche Entwurf asynchroner Schaltungen

    Get PDF
    [no abstract
    corecore