14 research outputs found
DUAL-RAIL GATE STRUCTURE FOR A COMPLEX DATA PATH
Dual-rail domino gates are restricted to create a reliable critical data path. According to this critical data path, the handshake circuits are greatly simplified, that provides the pipeline high throughput in addition to low power consumption. This paper presents a higher-throughput and ultralow-power asynchronous domino logic pipeline design method, targeting to latch-free and very fine-grain or gate-level design. The information pathways are comprised of a combination of dual-rail and single-rail domino gates. The 4 phase bundled-data protocol design most carefully resembles the style of synchronous circuits. Furthermore, the stable critical data path enables the adoption of single-rail domino gates within the noncritical data pathways. An 8 × 8 array style multiplier can be used for evaluating the suggested pipeline method. This saves lots of power by reduction of the overhead of logic circuits. In contrast to a bundled-data asynchronous domino logic pipeline, the suggested pipeline saves energy within the best situation and also the worst situation when processing different data patterns
Recommended from our members
Continuous-Time and Companding Digital Signal Processors Using Adaptivity and Asynchronous Techniques
The fully synchronous approach has been the norm for digital signal processors (DSPs) for many decades. Due to its simplicity, the classical DSP structure has been used in many applications. However, due to its rigid discrete-time operation, a classical DSP has limited efficiency or inadequate resolution for some emerging applications, such as processing of multimedia and biological signals. This thesis proposes fundamentally new approaches to designing DSPs, which are different from the classical scheme. The defining characteristic of all new DSPs examined in this thesis is the notion of "adaptivity" or "adaptability." Adaptive DSPs dynamically change their behavior to adjust to some property of their input stream, for example the rate of change of the input. This thesis presents both enhancements to existing adaptive DSPs, as well as new adaptive DSPs. The main class of DSPs that are examined throughout the thesis are continuous-time (CT) DSPs. CT DSPs are clock-less and event-driven; they naturally adapt their activity and power consumption to the rate of their inputs. The absence of a clock also provides a complete avoidance of aliasing in the frequency domain, hence improved signal fidelity. The core of this thesis deals with the complete and systematic design of a truly general-purpose CT DSP. A scalable design methodology for CT DSPs is presented. This leads to the main contribution of this thesis, namely a new CT DSP chip. This chip is the first general-purpose CT DSP chip, able to process many different classes of CT and synchronous signals. The chip has the property of handling various types of signals, i.e. various different digital modulations, both synchronous and asynchronous, without requiring any reconfiguration; such property is presented for the first time CT DSPs and is impossible for classical DSPs. As opposed to previous CT DSPs, which were limited to using only one type of digital format, and whose design was hard to scale for different bandwidths and bit-widths, this chip has a formal, robust and scalable design, due to the systematic usage of asynchronous design techniques. The second contribution of this thesis is a complete methodology to design adaptive delay lines. In particular, it is shown how to make the granularity, i.e. the number of stages, adaptive in a real-time delay line. Adaptive granularity brings about a significant improvement in the line's power consumption, up to 70% as reported by simulations on two design examples. This enhancement can have a direct large power impact on any CT DSP, since a delay line consumes the majority of a CT DSP's power. The robust methodology presented in this thesis allows safe dynamic reconfiguration of the line's granularity, on-the-fly and according to the input traffic. As a final contribution, the thesis also examines two additional DSPs: one operating the CT domain and one using the companding technique. The former operates only on level-crossing samples; the proposed methodology shows a potential for high-quality outputs by using a complex interpolation function. Finally, a companding DSP is presented for MPEG audio. Companding DSPs adapt their dynamic range to the amplitude of their input; the resulting can offer high-quality outputs even for small inputs. By applying companding to MPEG DSPs, it is shown how the DSP distortion can be made almost inaudible, without requiring complex arithmetic hardware
Circuit design and analysis for on-FPGA communication systems
On-chip communication system has emerged as a prominently important subject in Very-Large-
Scale-Integration (VLSI) design, as the trend of technology scaling favours logics more than interconnects.
Interconnects often dictates the system performance, and, therefore, research for new
methodologies and system architectures that deliver high-performance communication services
across the chip is mandatory. The interconnect challenge is exacerbated in Field-Programmable
Gate Array (FPGA), as a type of ASIC where the hardware can be programmed post-fabrication.
Communication across an FPGA will be deteriorating as a result of interconnect scaling. The programmable
fabrics, switches and the specific routing architecture also introduce additional latency
and bandwidth degradation further hindering intra-chip communication performance.
Past research efforts mainly focused on optimizing logic elements and functional units in FPGAs.
Communication with programmable interconnect received little attention and is inadequately understood.
This thesis is among the first to research on-chip communication systems that are built on
top of programmable fabrics and proposes methodologies to maximize the interconnect throughput
performance. There are three major contributions in this thesis: (i) an analysis of on-chip
interconnect fringing, which degrades the bandwidth of communication channels due to routing
congestions in reconfigurable architectures; (ii) a new analogue wave signalling scheme that significantly
improves the interconnect throughput by exploiting the fundamental electrical characteristics
of the reconfigurable interconnect structures. This new scheme can potentially mitigate
the interconnect scaling challenges. (iii) a novel Dynamic Programming (DP)-network to provide
adaptive routing in network-on-chip (NoC) systems. The DP-network architecture performs runtime
optimization for route planning and dynamic routing which, effectively utilizes the in-silicon
bandwidth. This thesis explores a new horizon in reconfigurable system design, in which new
methodologies and concepts are proposed to enhance the on-FPGA communication throughput
performance that is of vital importance in new technology processes
Recommended from our members
Design and performance optimization of asynchronous networks-on-chip
As digital systems continue to grow in complexity, the design of conventional synchronous systems is facing unprecedented challenges. The number of transistors on individual chips is already in the multi-billion range, and a greatly increasing number of components are being integrated onto a single chip. As a consequence, modern digital designs are under strong time-to-market pressure, and there is a critical need for composable design approaches for large complex systems.
In the past two decades, networks-on-chip (NoC’s) have been a highly active research area. In a NoC-based system, functional blocks are first designed individually and may run at different clock rates. These modules are then connected through a structured network for on-chip global communication. However, due to the rigidity of centrally-clocked NoC’s, there have been bottlenecks of system scalability, energy and performance, which cannot be easily solved with synchronous approaches. As a result, there has been significant recent interest in combing the notion of asynchrony with NoC designs. Since the NoC approach inherently separates the communication infrastructure, and its timing, from computational elements, it is a natural match for an asynchronous paradigm. Asynchronous NoC’s, therefore, enable a modular and extensible system composition for an ‘object-orient’ design style.
The thesis aims to significantly advance the state-of-art and viability of asynchronous and globally-asynchronous locally-synchronous (GALS) networks-on-chip, to enable high-performance and low-energy systems. The proposed asynchronous NoC’s are nearly entirely based on standard cells, which eases their integration into industrial design flows. The contributions are instantiated in three different directions.
First, practical acceleration techniques are proposed for optimizing the system latency, in order to break through the latency bottleneck in the memory interfaces of many on-chip parallel processors. Novel asynchronous network protocols are proposed, along with concrete NoC designs. A new concept, called ‘monitoring network’, is introduced. Monitoring networks are lightweight shadow networks used for fast-forwarding anticipated traffic information, ahead of the actual packet traffic. The routers are therefore allowed to initiate and perform arbitration and channel allocation in advance. The technique is successfully applied to two topologies which belong to two different categories – a variant mesh-of-trees (MoT) structure and a 2D-mesh topology. Considerable and stable latency improvements are observed across a wide range of traffic patterns, along with moderate throughput gains.
Second, for the first time, a high-performance and low-power asynchronous NoC router is compared directly to a leading commercial synchronous counterpart in an advanced industrial technology. The asynchronous router design shows significant performance improvements, as well as area and power savings. The proposed asynchronous router integrates several advanced techniques, including a low-latency circular FIFO for buffer design, and a novel end-to-end credit-based virtual channel (VC) flow control. In addition, a semi-automated design flow is created, which uses portions of a standard synchronous tool flow.
Finally, a high-performance multi-resource asynchronous arbiter design is developed. This small but important component can be directly used in existing asynchronous NoC’s for performance optimization. In addition, this standalone design promises use in opening up new NoC directions, as well as for general use in parallel systems. In the proposed arbiter design, the allocation of a resource to a client is divided into several steps. Multiple successive client-resource pairs can be selected rapidly in pipelined sequence, and the completion of the assignments can overlap in parallel.
In sum, the thesis provides a set of advanced design solutions for performance optimization of asynchronous and GALS networks-on-chip. These solutions are at different levels, from network protocols, down to router- and component-level optimizations, which can be directly applied to existing basic asynchronous NoC designs to provide a leap in performance improvement
Recommended from our members
Methods to improve the reliability and resiliency of near/sub-threshold digital circuits
Energy consumption is one of the primary bottlenecks to both large and small scale modern compute platforms. Reducing the operating voltage of digital circuits to voltages where the supply voltage is near or below the threshold of the transistors has recently gained attention as a method to reduce the energy required for computations by as much as 6 times. However, when operating at near/sub-threshold voltages (where the supply voltage is near or below the threshold of the transistors), imperfections in transistor manufacturing, changes in temperature, and other difficult-to-predict factors cause wide variations in the timing of Complementary Metal-Oxide Semiconductor (CMOS) circuits due to an increased sensitivity at lower voltages. These increased variations result in poor aggregate performance and cause increased rates of error occurrence in computation.
This work introduces several new methods to improve the reliability of near/sub-threshold circuits. The first is a design automation technique that is used to aid in low-voltage digital standard cell synthesis. Second, two circuit-level techniques are also introduced that aim to improve the reliability and resiliency of digital circuits by means of completion/error detection. These techniques are shown to improve speed and lower energy consumption at low overheads compared to previous methods. Most importantly, these circuit-level methods are specifically designed to operate at low voltages and can themselves tolerate variations and operation in harsh environments. Finally, a test-chip prototype designed in 65nm-CMOS demonstrates the practicality and feasibility of a proposed current sensing error detector
Increasing rendering performance of graphics hardware
Graphics Processing Unit (GPU) performance is increasing faster than central processing unit (CPU) performance. This growth is driven by performance improvements that can be divided into the following three categories: algorithmic improvements, architectural improvements, and circuit-level improvements. In this dissertation I present techniques that improve the rendering performance of graphics hardware measured in speed, power consumption or image quality in each of these three areas. At the algorithmic level, I introduce a method for using graphics hardware to rapidly and efficiently generate summed-area tables, which are data structures that hold pre-computed two-dimensional integrals of subsets of a given image, and present several novel rendering techniques that take advantage of summed-area tables to produce dynamic, high-quality images at interactive frame rates. These techniques improve the visual quality of images rendered on current commodity GPUs without requiring modifications to the underlying hardware or architecture. At the architectural level, I propose modifications to the architecture of current GPUs that add conditional streaming capabilities. I describe a novel GPU-based ray-tracing algorithm that takes advantage of conditional output streams to reduce the memory bandwidth requirements by over an order of magnitude times when compared to previous techniques. At the circuit level, I propose a compute-on-demand paradigm for the design of high-speed and energy-efficient graphics components. The goal of the compute-on-demand paradigm is to only perform computation at the bit-level when needed. The compute-on-demand paradigm exploits the data-dependent nature of computation, and thereby obtains speed and energy improvements by optimizing designs for the common case. This approach is illustrated with the design of a high-speed Z-comparator that is implemented using asynchronous logic. Asynchronous or "clockless" circuits were chosen for my implementations since they allow for data-dependent completion times and reduced power consumption by disabling inactive components. The resulting circuit-level implementation runs over 1.5 times faster while on dissipating 25% the energy of a comparable synchronous comparator for the average case. Also at the circuit-level, I introduce a novel implementation of counterflow pipelining, which allows two streams of data to flow in opposite directions within the same pipeline without the need for complex arbitration. The advantages of this implementation are demonstrated by the design of a high-speed asynchronous Booth multiplier. While both the comparator and the multiplier are useful components of a graphics pipeline, the objective of this work was to propose the new design paradigm as a promising alternative to current graphics hardware design practices
Recommended from our members
On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation
In this era of exascale computing, conventional synchronous design techniques are facing unprecedented challenges. The consumer electronics market is replete with many-core systems in the range of 16 cores to thousands of cores on chip, integrating multi-billion transistors. However, with this ever increasing complexity, the traditional design approaches are facing key issues such as increasing chip power, process variability, aging, thermal problems, and scalability. An alternative paradigm that has gained significant interest in the last decade is asynchronous design. Asynchronous designs have several potential advantages: they are naturally energy proportional, burning power only when active, do not require complex clock distribution, are robust to different forms of variability, and provide ease of composability for heterogeneous platforms. Networks-on-chip (NoCs) is an interconnect paradigm that has been introduced to deal with the ever-increasing system complexity. NoCs provide a distributed, scalable, and efficient interconnect solution for today’s many-core systems. Moreover, NoCs are a natural match with asynchronous design techniques, as they separate communication infrastructure and timing from the computational elements. To this end, globally-asynchronous locally-synchronous (GALS) systems that interconnect multiple processing cores, operating at different clock speeds, using an asynchronous NoC, have gained significant interest. While asynchronous NoCs have several advantages, they also face a key challenge of supporting new types of traffic patterns. Once such pattern is multicast communication, where a source sends packets to arbitrary number of destinations. Multicast is not only common in parallel computing, such as for cache coherency, but also for emerging areas such as neuromorphic computing. This important capability has been largely missing from asynchronous NoCs. This thesis introduces several efficient multicast solutions for these interconnects. In particular, techniques, and network architectures are introduced to support high-performance and low-power multicast. Two leading network topologies are the focus: a variant mesh-of-trees (MoT) and a 2D mesh. In addition, for a more realistic implementation and analysis, as well as significantly advancing the field of asynchronous NoCs, this thesis also targets synthesis of these NoCs on commercial FPGAs. While there has been significant advances in FPGA technologies, there has been only limited research on implementing asynchronous NoCs on FPGAs. To this end, a systematic computeraided design (CAD) methodology has been introduced to efficiently and safely map asynchronous NoCs on FPGAs. Overall, this thesis makes the following three contributions. The first contribution is a multicast solution for a variant MoT network topology. This topology consists of simple low-radix switches, and has been used in high-performance computing platforms. A novel local speculation technique is introduced, where a subset of the network’s switches are speculative that always broadcast every packet. These switches are very simple and have high performance. Speculative switches are surrounded by non-speculative ones that route packets based on their destinations and also throttle any redundant copies created by the former. This hybrid network architecture achieved significant performance and power benefits over other multicast approaches. The second contribution is a multicast solution for a 2D-mesh topology, which is more complex with higher-radix switches and also is more commonly used. A novel continuous-time replication strategy is introduced to optimize the critical multi-way forking operation of a multicast transmission. In this technique, a multicast packet is first stored in an input port of a switch, from where it is sent through distinct output ports towards different destinations concurrently, at each output’s own rate and in continuous time. This strategy is shown to have significant latency and energy benefits over an approach that performs multicast using multiple distinct serial unicasts to each destination. Finally, a systematic CAD methodology is introduced to synthesize asynchronous NoCs on commercial FPGAs. A two-fold goal is targeted: correctness and high performance. For ease of implementation, only existing FPGA synthesis tools are used. Moreover, since asynchronous NoCs involve special asynchronous components, a comprehensive guide is introduced to map these elements correctly and efficiently. Two asynchronous NoC switches are synthesized using the proposed approach on a leading Xilinx FPGA in 28 nm: one that only handles unicast, and the other that also supports multicast. Both showed significant energy benefits with some performance gains over a state-of-the-art synchronous switch
PHY Link Design and Optimization For High-Speed Low-Power Communication Systems
The ever-growing demands for high-bandwidth data transfer have been pushing towards advancing research efforts in the field of high-performing communication systems. Studies on the performance of single chip, e.g. faster multi-core processors and higher system memory capacity, have been explored. To further enhance the system performance, researches have been focused on the improvement of data-transfer bandwidth for chip-to-chip communication in the high-speed serial link. Many solutions have been addressed to overcome the bottleneck caused by the non-idealties such as bandwidth-limited electrical channel that connects two link devices and varieties of undesired noise in the communication systems. Nevertheless, with these solutions data have run into limitations of the timing margins for high-speed interfaces running at multiple gigabits per second data rates on low-cost Printed Circuit Board (PCB) material with constrained power budget. Therefore, the challenge in designing a physical layer (PHY) link for high-speed communication systems turns out to be power-efficient, reliable and cost-effective. In this context, this dissertation is intended to focus on architectural design, system-level and circuit-level verification of a PHY link as well as system performance optimization in respective of power, reliability and adaptability in high-speed communication systems.
The PHY is mainly composed of clock data recovery (CDR), equalizers (EQs) and high- speed I/O drivers. Symmetrical structure of the PHY link is usually duplicated in both link devices for bidirectional data transmission. By introducing training mechanisms into high-speed communication systems, the timing in one link device is adaptively aligned to the timing condition specified in the other link device despite of different skews or induced jitter resulting from process, voltage and temperature (PVT) variations in the individual link. With reliable timing relationships among the interface signals provided, the total system bandwidth is dramatically improved. On the other hand, interface training offers high flexibility for reuse without further investigation on high demanding components involved in high costs.
In the training mode, a CDR module is essential for reconstructing the transmitted bitstream to achieve the best data eye and to detect the edges of data stream in asynchronous systems or source-synchronous systems. Generally, the CDR works as a feedback control system that aligns its output clock to the center of the received data. In systems that contain multiple data links, the overall CDR power consumption increases linearly with the increase in number of links as one CDR is required for each link. Therefore, a power-efficient CDR plays a significant role in such systems with parallel links. Furthermore, a high performance CDR requires low jitter generation in spite of high input jitter. To minimize the trade-off between power consumption and CDR jitter, a novel CDR architecture is proposed by utilizing the proportional-integral (PI) controller and three times sampling scheme.
Meanwhile, signal integrity (SI) becomes critical as the data rate exceeds several gigabits per second. Distorted data due to the non-idealties in systems are likely to reduce the signal quality aggressively and result in intolerable transmission errors in worst case scenarios, thus affect the system effective bandwidth. Hence, additional trainings such as transmitter (Tx) and receiver (Rx) EQ trainings for SI purpose are inserted into the interface training. Besides, a simplified system architecture with unsymmetrical placement of adaptive Rx and Tx EQs in a single link device is proposed and analyzed by using different coefficient adaptation algorithms. This architecture enables to reduce a large number of EQs through the training, especially in case of parallel links. Meanwhile, considerable power and chip area are saved.
Finally, high-speed I/O driver against PVT variations is discussed. Critical issues such as overshoot and undershoot interfering with the data are primarily accompanied by impedance mismatch between the I/O driver and its transmitting channel. By applying PVT compensation technique I/O driver impedances can be effectively calibrated close to the target value. Different digital impedance calibration algorithms against PVT variations are implemented and compared for achieving fast calibration and low power requirements
A Behavioral Design Flow for Synthesis and Optimization of Asynchronous Systems
Asynchronous or clockless design is believed to hold the promise of alleviating many of the challenges currently facing microelectronic design. Distributing a high-speed clock signal across an entire chip is an increasing challenge, particularly as the number of transistors on chip continues to rise. With increasing heterogeneity in massively multi- core processors, the top-level system integration is already elastic in nature. Future computing technologies (e.g., nano, quantum, etc.) are expected to have unpredictable timing as well. Therefore, asynchronous design techniques are gaining relevance in mainstream design. Unfortunately, the field of asynchronous design lacks mature design tools for creating large-scale, high-performance or energy-efficient systems. This thesis attempts to fill the void by contributing a set of design methods and automated tools for synthesizing asynchronous systems from high-level specifications. In particular, this thesis provides methods and tools for: (i) generating high-speed pipelined implementations from behavioral specifications, (ii) sharing and scheduling resources to conserve area while providing high performance, and (iii) incorporating energy and power considerations into high-level design. These methods are incorporated into a comprehensive design flow that provides a choice of synthesis paths to the designer, and a mechanism to explore the spectrum between them. The first path specifically targets the highest-performance implementations using data-driven pipelined circuits. The second path provides an alternative approach that targets low-area implementations, providing for optimal resource sharing and optimal scheduling techniques to achieve performance targets. Finally, the third path through the design flow allows the entire spectrum between the two extremes to be explored. In particular, it is a hybrid approach that preserves a pipelined architecture but still allows sharing of resources. By varying performance targets, a wide range of designs can be realized. A variety of metrics are incorporated as constraints or cost functions: area, latency, cycle time, energy consumption, and peak power. Experimental results demonstrate the capability of the proposed design flow to quickly produce optimized specifications. By automating synthesis and optimization, this thesis shows that the designer effort necessary to produce a high-quality solution can be significantly reduced. It is hoped that this work provides a path towards more mature automation and design tools for asynchronous design
Analysis and Optimization for Pipelined Asynchronous Systems
Most microelectronic chips used today--in systems ranging from cell phones to desktop computers to supercomputers--operate in basically the same way: they synchronize the operation of their millions of internal components using a clock that is distributed globally. This global clocking is becoming a critical design challenge in the quest for building chips that offer increasingly greater functionality, higher speed, and better energy efficiency. As an alternative, asynchronous or clockless design obviates the need for global synchronization; instead, components operate concurrently and synchronize locally only when necessary. This dissertation focuses on one class of asynchronous circuits: application specific stream processing systems (i.e. those that take in a stream of data items and produce a stream of processed results.) High-speed stream processors are a natural match for many high-end applications, including 3D graphics rendering, image and video processing, digital filters and DSPs, cryptography, and networking processors. This dissertation aims to make the design, analysis, optimization, and testing of circuits in the chosen domain both fast and efficient. Although much of the groundwork has already been laid by years of past work, my work identifies and addresses four critical missing pieces: i) fast performance analysis for estimating the throughput of a fine-grained pipelined system; ii) automated and versatile design space exploration; iii) a full suite of circuit level modules that connect together to implement a wide variety of system behaviors; and iv) testing and design for testability techniques that identify and target the types of errors found only in high-speed pipelined asynchronous systems. I demonstrate these techniques on a number of examples, ranging from simple applications that allow for easy comparison to hand-designed alternatives to more complex systems, such as a JPEG encoder. I also demonstrate these techniques through the design and test of a fully asynchronous GCD demonstration chip