5 research outputs found

    Generic algorithms and NULL Convention Logic hardware implementation for unsigned and signed quad-rail multiplication

    Get PDF
    This thesis focuses on designing generic quad-rail arithmetic circuits, such as signed and unsigned multipliers and Multiply and Accumulate (MAC) units, using the asynchronous delay-insensitive NULL Convention Logic (NCL) paradigm. This work helps to build a library of reusable components to be used for automated NCL circuit synthesis, which will aid in the integration of asynchronous design paradigms into the semiconductor industry --Abstract, page iii

    Null Convention Multiply And Accumulate Unit With Conditional Rounding, Scaling, And Saturation

    No full text
    Approaches for maximizing throughput of self-timed multiply-accumulate units (MACs) are developed and assessed using the NULL convention logic paradigm. In this class of self-timed circuits, the functional correctness is independent of any delays in circuit elements, through circuit construction, and independent of any wire delays, through the isochronic fork assumption, where wire delays are assumed to be much less than gate delays. Therefore self-timed circuits provide distinct advantages for System-on-a-Chip applications. First, a number of alternative MAC algorithms are compared and contrasted in terms of throughput and area to determine which approach will yield the maximum throughput with the least area. It was determined that two algorithms that meet these criteria well are the Modified Baugh-Wooley and Modified Booth2 algorithms. Dual-rail non-pipelined versions of these algorithms were first designed using the threshold combinational reduction method. The non-pipelined designs were then optimized for throughput using the gate-level pipelining method. Finally, each design was simulated using Synopsys to quantify the advantage of the dual-rail pipelined Modified Baugh-Wooley MAC, which yielded a speedup of 2.5 over its initial non-pipelined version. This design also required 20% fewer gates than the dual-rail pipelined Modified Booth2 MAC that had the same throughput. The resulting design employs a three-stage feed-forward multiply pipeline connected to a four-stage feedback multifunctional loop to perform a 72 + 32 × 32 MAC in 12.7 ns on average using a 0.25 μm CMOS process at 3.3 V, thus outperforming other delay-insensitive/self-timed MACs in the literature. © 2002 Elsevier Science B.V. All rights reserved

    NULL convention multiply and accumulate unit with conditional rounding, scaling, and saturation

    No full text
    Approaches for maximizing throughput of self-timed multiply-accumulate units (MACs) are developed and assessed using the NULL convention logic paradigm. In this class of self-timed circuits, the functional correctness is independent of any delays in circuit elements, through circuit construction, and independent of any wire delays, through the isochronic fork assumption [1,2], where wire delays are assumed to be much less than gate delays. Therefore self-timed circuits provide distinct advantages for System-on-a-Chip applications. First, a number of alternative MAC algorithms are compared and contrasted in terms of throughput and area to determine which approach will yield the maximum throughput with the least area. It was determined that two algorithms that meet these criteria well are the Modified Baugh-Wooley and Modified Booth2 algorithms. Dual-rail non-pipelined versions of these algorithms were first designed using the threshold combinational reduction method [3]. The non-pipelined designs were then optimized for throughput using the gate-level pipelining method [4]. Finally, each design was simulated using Synopsys to quantify the advantage of the dual-rail pipelined Modified Baugh-Wooley MAC, which yielded a speedup of 2.5 over its initial non-pipelined version. This design also required 20% fewer gates than the dual-rail pipelined Modified Booth2 MAC that had the same throughput. The resulting design employs a three-stage feed-forward multiply pipeline connected to a four-stage feedback multifunctional loop to perform a 72 + 32 x 32 MAC in 12.7 ns on average using a 0.25 mum CMOS process at 3.3 V, thus outperforming other delay-insensitive/self-timed MACs in the literature. (C) 2002 Elsevier Science B.V. All rights reserved

    NULL Convention Multiply and Accumulate Unit with Conditional Rounding, Scaling, and Saturation

    No full text
    Approaches for maximizing throughput of self-timed multiply-accumulate units (MACs) are developed and assessed using the NULL Convention Logic (NCL) paradigm. In this class of self-timed circuits, the functional correctness is independent of any delays in circuit elements, through circuit construction, and independent of any wire delays, through the isochronic fork assumption [1, 2], where wire delays are assumed to be much less than gate delays. Therefore self-timed circuits provide distinct advantages for System-on-a-Chip applications. First, a number of alternative MAC algorithms are compared and contrasted in terms of throughput and area to determine which approach will yield the maximum throughput with the least area. It was determined that two algorithms that meet these criteria well are the Modified Baugh-Wooley and Modified Booth2 algorithms. Dual-rail non-pipelined versions of these algorithms were first designed using the Threshold Combinational Reduction (TCR) method [3]. The non-pipelined designs were then optimized for throughput using the Gate-Level Pipelining (GLP) method [4]. Finally, each design was simulated using Synopsys to quantify the advantage of the dual-rail pipelined Modified Baugh-Wooley MAC, which yielded a speedup of 2.5 over it

    A null convention logic based platform for high speed low energy IP packet forwarding

    Get PDF
    By 2020, it is predicted that there will be over 5 billion people and 38.5 billion Internet-ofThings devices on the Internet. The data generated by all these users and devices will have to be transported quickly and efficiently. Routers forming the backbone of this Internet already support multiple 100 Gbps ports meaning that they would have to perform upwards of 200 Million destination addresses lookups per second in the packet forwarding block that lies in the router ‘data-path’. At the same time, there is also a huge demand to make the network infrastructure more energy efficient. The work presented in this thesis is motivated by the observation that traditional synchronous digital systems will have increasing difficulty keeping up with these conflicting demands. Further, with reducing device geometries, extremes in “process, voltage and temperature” (PVT) variability will undermine reliable synchronous operation. It is expected that asynchronous design techniques will be able to overcome many of these problems and offer a means of lowering energy while maintaining high throughput and low latency. This thesis investigates existing address lookup algorithms and investigates the possibility of combining various approaches to improve energy efficiency without affecting lookup performance. A quasi delay-insensitive asynchronous methodology - Null Convention Logic (NCL) - is then applied to this combined design. Techniques that take advantage of the characteristics of the design methodology and the lookup algorithm to further improve the area, energy and latency characteristics are also analysed. The IP address lookup scheme utilised here is a recent algorithmic approach that uses compact binary-tries and was selected for its high memory efficiency and throughput. The design is pipelined, and the prefix information is stored in large RAMs. A Boolean synchronous implementation of the algorithm is simulated to provide an initial performance benchmark. It is observed that during the address lookup process nearly 68% of the trie accesses are to nodes that contained no prefix information. Bloom filter structures that use non-cryptographic hashes and single-bit memory are introduced into the address lookup process to prevent these unnecessary accesses, thereby reducing the energy consumption. Three non-cryptographic hashing algorithms (CRC32, Jenkins and Murmur) are also analysed for their suitability in Bloom filters, and the CRC32 is found to offer the most suitable trade-off between complexity and performance. As a first step to applying the NCL design methodology, NCL implementations of the hashing algorithms are created and evaluated. A significant finding from these experiments is that, unlike Boolean systems, latency and throughput in NCL systems are only loosely coupled. An example Jenkins hash implementation with eight pipeline stages and a cycle time of 3.2 ns exhibits a total latency of 6 ns, whereas an equivalent synchronous implementation with a similar clock period exhibits a latency of 25.6 ns. Further investigations reveal that completion detection circuits within the NCL pipelines impair throughput significantly. Two enhancements to the NCL circuit library aimed particularly at optimising NCL completion detection are proposed and analysed. These are shown to enable completion detection circuits to be built with the same delay but with 30% smaller area and about 75% lower peak current compared to the conventional approach using gates from the standard NCL library. An NCL SRAM structure is also proposed to augment the conventional 6-T cell array with circuits to generate the handshaking signals for managing the NCL data flow. Additionally, a dedicated column of cells called the Null-storage column is added, which indicates if a particular address in the RAM stores no Data, i.e., it is in its Null state. This additional hardware imposes a small area overhead of about 10% but allows accesses to Null locations to be completed in 50% less time and consume 40% less energy than accesses to valid Data locations. An experimental NCL-based address lookup system is then designed that includes all of the developed NCL modules. Statistical delay models derived from circuit-level simulations of individual modules are used to emulate realistic circuit delay variability in the behavioural modules written in Verilog. Simulations of the assembled system demonstrate that unlike what was observed with the synchronous design, with NCL, the design that does not employ Bloom filters, but only the Null-storage column RAMs for prefix storage, exhibits the smallest area on the chip and also consumes the least energy per address lookup. It is concluded that to derive maximum benefit out of an asynchronous design approach; it is necessary to carefully select the architectural blocks that combine the peculiarities of the implemented algorithm with the capabilities of the NCL design methodology
    corecore