Search CORE

4 research outputs found

Null convention logic circuits for asynchronous computer architecture

Author: Kim M
Publication venue: RMIT University
Publication date
Field of study

For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a 'bundle' of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5&times; larger than synchronous designs. On the other hand its area is still 2.9&times; smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library

RMIT Research Repository

Synthesis of variability-tolerant circuits with adaptive clocking

Author: Moreno Vega Alberto
Publication venue: Universitat Politècnica de Catalunya
Publication date: 08/03/2019
Field of study

Improvements in circuit manufacturing have allowed, along the years, increasingly complex designs. This has been enabled by the miniaturization that circuit components have undergone. But, in recent years, this scaling has shown decreasing benefits as we approach fundamental limits. Furthermore, the decrease in size is nowadays producing an increase in variability: unpredictable differences and changes in the behavior of components. Historically, this has been addressed by establishing guardband margins at the design stage. Nonetheless, as variability grows, the amount of pessimism introduced by these margins is taking an ever-increasing cost on performance and power consumption. In recent years, several approaches have been proposed to lower the impact of variability and reduce margins. One such technique is the substitution of a classical PLL clock by a Ring Oscillator Clock. The design of the Ring Oscillator Clock is done in such a way that its variability is highly correlated to that of the circuit. One of the contributions of this thesis is in the automatic design of such circuits. In particular, we propose a novel method to design digital delay lines with variability-tracking properties. Those designs are also suitable for other purposes, such as bundled-data circuits or performance monitors. The advantage of the proposed technique is based on the exclusive use of cells from a standard cell library, which lowers the design cost and complexity. The other focus of this thesis is on state encoding for asynchronous controllers. One of the main properties of asynchronous circuits is their ability to, implicitly, work under variable conditions. In the near future, this advantage might increase the relevance of this class of circuits. One of the hardest stages for the synthesis of these circuits is the state encoding. This thesis presents a SAT-based algorithm for solving the state encoding at the state level. It is shown, by means of a comprehensive benchmark suite, that results obtained by this technique improve significantly compared to results from similar approaches. Nonetheless, the main limitation of techniques at the state level is the state explosion problem, to which the sequential modeling of concurrency is often subject to. The last contribution of this thesis is a method to process asynchronous circuits in order to allow the use of state-based techniques for large instances. In particular, the process is divided into three stages: projection, signal insertion and re-composition. In the projection step, the behavior of the controller is simplified until the signal insertion can be performed by state-based techniques. Afterwards, the re-composition generalizes the insertion of the signal into the original controller. Experimental results show that this process enables the resolution of large controllers, in the order of 10 6 states, by state-based techniques. At the same time, only a minor impact in solution quality is observed, preserving one of the main advantages for state-based approaches.A lo largo de los años, mejoras en la fabricación de circuitos han permitido diseños cada vez más complejos. Esta tendencia, que ha tenido lugar gracias a la miniaturización de los componentes que forman estos circuitos, recientemente está mostrando beneficios decrecientes a medida que nos acercamos a ciertas limitaciones fundamentales. Además de estos beneficios decrecientes, la reducción en tamaño está produciendo un aumento, cada vez mayor, en la variabilidad: diferencias impredecibles y cambios en el comportamiento de los componentes. Esto se ha compensado históricamente con el uso de márgenes de seguridad en la fase de diseño. No obstante, a medida que la variabilidad crece, la cantidad de pesimismo que estos márgenes introducen está afectando significativamente el coste en rendimiento y consumo energético. En los últimos años se han propuesto diferentes técnicas para limitar el impacto de la variabilidad y reducir márgenes de seguridad. Una de estas técnicas consiste en substituir un reloj PLL clásico por un Ring Oscillator Clock. El diseño de un Ring Oscillator Clock se realiza de manera que su variabilidad este altamente correlacionada con la del circuito. Una de las contribuciones de esta tesis consiste en el diseño automático de estos relojes. Concretamente, se propone un nuevo método para diseñar líneas de retardo digitales (digital delay lines) que tengan como propiedad la capacidad de imitar la variabilidad de un circuito dado. Estos diseños son también apropiados para otros propósitos, tal y como circuitos con ?bundled-data? o monitorizadores de rendimiento. La ventaja del método propuesto con respecto a otras técnicas similares radica en el uso exclusivo de celdas provenientes de una librería de celdas estándar, lo que reduce considerablemente el coste de diseño y su complejidad. Por otro lado, esta tesis también se centra en la codificación de estados de circuitos asíncronos. Una de las principales propiedades de estos circuitos reside en su capacidad implícita para trabajar bajo condiciones de variabilidad. Es previsible que, en un futuro próximo, esta ventaja se vuelva aún más relevante. La síntesis de circuitos asíncronos consta de varias etapas, una de las cuales es la codificación de estados. Este trabajo presenta un algoritmo basado en SAT que permite resolver la codificación de estados a nivel de estado. Mediante el uso de un exhaustivo banco de pruebas, esta tesis muestra como resultados obtenidos por esta técnica mejoran significativamente en comparación con otros métodos similares. A pesar de ello, técnicas que trabajan a nivel de estado tienen como principal limitación el problema conocido como "explosión de estados" que aparece habitualmente cuando se modelan elementos concurrentes de manera secuencial. Así pues, la última contribución de esta tesis es la propuesta de un método para procesar circuitos asíncronos de manera que técnicas a nivel de estado sean usables para instancias grandes. En concreto, el proceso está dividido en tres fases: proyección, inserción de señal y re-composición. En la etapa de proyección, el comportamiento del controlador es simplificado suficientemente como para que la inserción de la señal se pueda realizar con técnicas a nivel de estado. A continuación, la re-composición generaliza esta inserción en el controlador original. Resultados experimentales muestran que este proceso permite la resolución de grandes controladores, del orden de 10^6 estados, mediante el uso de técnicas a nivel de estado. Al mismo tiempo, solo se observa un impacto mínimo en la calidad de las soluciones, preservando una de las mayores ventajas de los métodos a nivel de estado

Tesis Doctorals en Xarxa