4 research outputs found

    Relative timing

    Get PDF
    Journal ArticleRelative Timing is introduced as an informal method for aggressive asynchronous design. It is demonstrated on three example circuits (C-Element, FIFO, and RAPPID Tag Unit), facilitating transformations from speed-independent circuits to burst-mode, relative timed, and pulse-mode circuits. Relative timing enables improved performance, area, power and testability in all three cases

    Design And Synthesis Of Clockless Pipelines Based On Self-resetting Stage Logic

    Get PDF
    For decades, digital design has been primarily dominated by clocked circuits. With larger scales of integration made possible by improved semiconductor manufacturing techniques, relying on a clock signal to orchestrate logic operations across an entire chip became increasingly difficult. Motivated by this problem, designers are currently considering circuits which can operate without a clock. However, the wide acceptance of these circuits by the digital design community requires two ingredients: (i) a unified design methodology supported by widely available CAD tools, and (ii) a granularity of design techniques suitable for synthesizing large designs. Currently, there is no unified established design methodology to support the design and verification of these circuits. Moreover, the majority of clockless design techniques is conceived at circuit level, and is subsequently so fine-grain, that their application to large designs can have unacceptable area costs. Given these considerations, this dissertation presents a new clockless technique, called self-resetting stage logic (SRSL), in which the computation of a block is reset periodically from within the block itself. SRSL is used as a building block for three coarse-grain pipelining techniques: (i) Stage-controlled self-resetting stage logic (S-SRSL) Pipelines: In these pipelines, the control of the communication between stages is performed locally between each pair of stages. This communication is performed in a uni-directional manner in order to simplify its implementation. (ii) Pipeline-controlled self-resetting stage logic (P-SRSL) Pipelines: In these pipelines, the communication between each pair of stages in the pipeline is driven by the oscillation of the last pipeline stage. Their communication scheme is identical to the one used in S-SRSL pipelines. (iii) Delay-tolerant self-resetting stage logic (D-SRSL) Pipelines: While communication in these pipelines is local in nature in a manner similar to the one used in S-SRL pipelines, this communication is nevertheless extended in both directions. The result of this bi-directional approach is an increase in the capability of the pipeline to handle stages with random delay. Based on these pipelining techniques, a new design methodology is proposed to synthesize clockless designs. The synthesis problem consists of synthesizing an SRSL pipeline from a gate netlist with a minimum area overhead given a specified data rate. A two-phase heuristic algorithm is proposed to solve this problem. The goal of the algorithm is to pipeline a given datapath by minimizing the area occupied by inter-stage latches without violating any timing constraints. Experiments with this synthesis algorithm show that while P-SRSL pipelines can reach high throughputs in shallow pipelines, D-SRSL pipelines can achieve comparable throughputs in deeper pipelines

    Architectural Exploration of KeyRing Self-Timed Processors

    Get PDF
    RÉSUMÉ Les dernières décennies ont vu l’augmentation des performances des processeurs contraintes par les limites imposées par la consommation d’énergie des systèmes électroniques : des très basses consommations requises pour les objets connectés, aux budgets de dépenses électriques des serveurs, en passant par les limitations thermiques et la durée de vie des batteries des appareils mobiles. Cette forte demande en processeurs efficients en énergie, couplée avec les limitations de la réduction d’échelle des transistors—qui ne permet plus d’améliorer les performances à densité de puissance constante—, conduit les concepteurs de circuits intégrés à explorer de nouvelles microarchitectures permettant d’obtenir de meilleures performances pour un budget énergétique donné. Cette thèse s’inscrit dans cette tendance en proposant une nouvelle microarchitecture de processeur, appelée KeyRing, conçue avec l’intention de réduire la consommation d’énergie des processeurs. La fréquence d’opération des transistors dans les circuits intégrés est proportionnelle à leur consommation dynamique d’énergie. Par conséquent, les techniques de conception permettant de réduire dynamiquement le nombre de transistors en opération sont très largement adoptées pour améliorer l’efficience énergétique des processeurs. La technique de clock-gating est particulièrement usitée dans les circuits synchrones, car elle réduit l’impact de l’horloge globale, qui est la principale source d’activité. La microarchitecture KeyRing présentée dans cette thèse utilise une méthode de synchronisation décentralisée et asynchrone pour réduire l’activité des circuits. Elle est dérivée du processeur AnARM, un processeur développé par Octasic sur la base d’une microarchitecture asynchrone ad hoc. Bien qu’il soit plus efficient en énergie que des alternatives synchrones, le AnARM est essentiellement incompatible avec les méthodes de synthèse et d’analyse temporelle statique standards. De plus, sa technique de conception ad hoc ne s’inscrit que partiellement dans les paradigmes de conceptions asynchrones. Cette thèse propose une approche rigoureuse pour définir les principes généraux de cette technique de conception ad hoc, en faisant levier sur la littérature asynchrone. La microarchitecture KeyRing qui en résulte est développée en association avec une méthode de conception automatisée, qui permet de s’affranchir des incompatibilités natives existant entre les outils de conception et les systèmes asynchrones. La méthode proposée permet de pleinement mettre à profit les flots de conception standards de l’industrie microélectronique pour réaliser la synthèse et la vérification des circuits KeyRing. Cette thèse propose également des protocoles expérimentaux, dont le but est de renforcer la relation de causalité entre la microarchitecture KeyRing et une réduction de la consommation énergétique des processeurs, comparativement à des alternatives synchrones équivalentes.----------ABSTRACT Over the last years, microprocessors have had to increase their performances while keeping their power envelope within tight bounds, as dictated by the needs of various markets: from the ultra-low power requirements of the IoT, to the electrical power consumption budget in enterprise servers, by way of passive cooling and day-long battery life in mobile devices. This high demand for power-efficient processors, coupled with the limitations of technology scaling—which no longer provides improved performances at constant power densities—, is leading designers to explore new microarchitectures with the goal of pulling more performances out of a fixed power budget. This work enters into this trend by proposing a new processor microarchitecture, called KeyRing, having a low-power design intent. The switching activity of integrated circuits—i.e. transistors switching on and off—directly affects their dynamic power consumption. Circuit-level design techniques such as clock-gating are widely adopted as they dramatically reduce the impact of the global clock in synchronous circuits, which constitutes the main source of switching activity. The KeyRing microarchitecture presented in this work uses an asynchronous clocking scheme that relies on decentralized synchronization mechanisms to reduce the switching activity of circuits. It is derived from the AnARM, a power-efficient ARM processor developed by Octasic using an ad hoc asynchronous microarchitecture. Although it delivers better power-efficiency than synchronous alternatives, it is for the most part incompatible with standard timing-driven synthesis and Static Timing Analysis (STA). In addition, its design style does not fit well within the existing asynchronous design paradigms. This work lays the foundations for a more rigorous definition of this rather unorthodox design style, using circuits and methods coming from the asynchronous literature. The resulting KeyRing microarchitecture is developed in combination with Electronic Design Automation (EDA) methods that alleviate incompatibility issues related to ad hoc clocking, enabling timing-driven optimizations and verifications of KeyRing circuits using industry-standard design flows. In addition to bridging the gap with standard design practices, this work also proposes comprehensive experimental protocols that aims to strengthen the causal relation between the reported asynchronous microarchitecture and a reduced power consumption compared with synchronous alternatives. The main achievement of this work is a framework that enables the architectural exploration of circuits using the KeyRing microarchitecture

    Null convention logic circuits for asynchronous computer architecture

    Get PDF
    For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a 'bundle' of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5× larger than synchronous designs. On the other hand its area is still 2.9× smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library
    corecore