72 research outputs found

    Null convention logic circuits for asynchronous computer architecture

    Get PDF
    For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a 'bundle' of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5× larger than synchronous designs. On the other hand its area is still 2.9× smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library

    Low power predictable memory and processing architectures

    Get PDF
    Great demand in power optimized devices shows promising economic potential and draws lots of attention in industry and research area. Due to the continuously shrinking CMOS process, not only dynamic power but also static power has emerged as a big concern in power reduction. Other than power optimization, average-case power estimation is quite significant for power budget allocation but also challenging in terms of time and effort. In this thesis, we will introduce a methodology to support modular quantitative analysis in order to estimate average power of circuits, on the basis of two concepts named Random Bag Preserving and Linear Compositionality. It can shorten simulation time and sustain high accuracy, resulting in increasing the feasibility of power estimation of big systems. For power saving, firstly, we take advantages of the low power characteristic of adiabatic logic and asynchronous logic to achieve ultra-low dynamic and static power. We will propose two memory cells, which could run in adiabatic and non-adiabatic mode. About 90% dynamic power can be saved in adiabatic mode when compared to other up-to-date designs. About 90% leakage power is saved. Secondly, a novel logic, named Asynchronous Charge Sharing Logic (ACSL), will be introduced. The realization of completion detection is simplified considerably. Not just the power reduction improvement, ACSL brings another promising feature in average power estimation called data-independency where this characteristic would make power estimation effortless and be meaningful for modular quantitative average case analysis. Finally, a new asynchronous Arithmetic Logic Unit (ALU) with a ripple carry adder implemented using the logically reversible/bidirectional characteristic exhibiting ultra-low power dissipation with sub-threshold region operating point will be presented. The proposed adder is able to operate multi-functionally

    An Ultra-Low-Power 75mV 64-Bit Current-Mode Majority-Function Adder

    Get PDF
    Ultra-low-power circuits are becoming more desirable due to growing portable device markets and they are also becoming more interesting and applicable today in biomedical, pharmacy and sensor networking applications because of the nano-metric scaling and CMOS reliability improvements. In this thesis, three main achievements are presented in ultra-low-power adders. First, a new majority function algorithm for carry and the sum generation is presented. Then with this algorithm and implied new architecture, we achieved a circuit with 75mV supply voltage operation. Last but not least, a 64 bit current-mode majority-function adder based on the new architecture and algorithm is successfully tested at 75mV supply voltage. The circuit consumed 4.5nW or 3.8pJ in one of the worst conditions

    The Fifth NASA Symposium on VLSI Design

    Get PDF
    The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design

    Average-case analysis of power consumption in embedded systems

    Get PDF
    Power efficiency is one of the most important constraints in the design of embedded systems since such systems are generally driven by batteries with limited energy budget or restricted power supply. In every embedded system, there are one or more processor cores to run the software and interact with the other hardware components of the system. The power consumption of the processor core(s) has an important impact on the total power dissipated in the system. Hence, the processor power optimization is crucial in satisfying the power consumption constraints, and developing low-power embedded systems. A key aspect of research in processor power optimization and management is “power estimation”. Having a fast and accurate method for processor power estimation at design time helps the designer to explore a large space of design possibilities, to make the optimal choices for developing a power efficient processor. Likewise, understanding the processor power dissipation behaviour of a specific software/application is the key for choosing appropriate algorithms in order to write power efficient software. Simulation-based methods for measuring the processor power achieve very high accuracy, but are available only late in the design process, and are often quite slow. Therefore, the need has arisen for faster, higher-level power prediction methods that allow the system designer to explore many alternatives for developing powerefficient hardware and software. The aim of this thesis is to present fast and high-level power models for the prediction of processor power consumption. Power predictability in this work is achieved in two ways: first, using a design method to develop power predictable circuits; second, analysing the power of the functions in the code which repeat during execution, then building the power model based on average number of repetitions. In the first case, a design method called Asynchronous Charge Sharing Logic (ACSL) is used to implement the Arithmetic Logic Unit (ALU) for the 8051 microcontroller. The ACSL circuits are power predictable due to the independency of their power consumption to the input data. Based on this property, a fast prediction method is presented to estimate the power of ALU by analysing the software program, and extracting the number of ALU-related instructions. This method achieves less than 1% error in power estimation and more than 100 times speedup in comparison to conventional simulation-based methods. In the second case, an average-case processor energy model is developed for the Insertion sort algorithm based on the number of comparisons that take place in the execution of the algorithm. The average number of comparisons is calculated using a high level methodology called MOdular Quantitative Analysis (MOQA). The parameters of the energy model are measured for the LEON3 processor core, but the model is general and can be used for any processor. The model has been validated through the power measurement experiments, and offers high accuracy and orders of magnitude speedup over the simulation-based method

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    NASA SERC 1990 Symposium on VLSI Design

    Get PDF
    This document contains papers presented at the first annual NASA Symposium on VLSI Design. NASA's involvement in this event demonstrates a need for research and development in high performance computing. High performance computing addresses problems faced by the scientific and industrial communities. High performance computing is needed in: (1) real-time manipulation of large data sets; (2) advanced systems control of spacecraft; (3) digital data transmission, error correction, and image compression; and (4) expert system control of spacecraft. Clearly, a valuable technology in meeting these needs is Very Large Scale Integration (VLSI). This conference addresses the following issues in VLSI design: (1) system architectures; (2) electronics; (3) algorithms; and (4) CAD tools

    Single event upset hardened embedded domain specific reconfigurable architecture

    Get PDF

    Emerging Technologies - NanoMagnets Logic (NML)

    Get PDF
    In the last decades CMOS technology has ruled the electronic scenario thanks to the constant scaling of transistor sizes. With the reduction of transistor sizes circuit area decreases, clock frequency increases and power consumption decreases accordingly. However CMOS scaling is now approaching its physical limits and many believe that CMOS technology will not be able to reach the end of the Roadmap. This is mainly due to increasing difficulties in the fabrication process, that is becoming very expensive, and to the unavoidable impact of leakage losses, particularly thanks to gate tunnel current. In this scenario many alternative technologies are studied to overcome the limitations of CMOS transistors. Among these possibilities, magnetic based technologies, like NanoMagnet Logic (NML) are among the most interesting. The reason of this interest lies in their magnetic nature, that opens up entire new possibilities in the design of logic circuits, like the possibility to mix logic and memory in the same device. Moreover they have no standby power consumption and potentially a much lower power consumption of CMOS transistors. In literature NML logic is well studied and theoretical and experimental proofs of concept were already found. However two important points are not enough considered in the analysis approach followed by most of the work in literature. First of all, no complex circuits are analyzed. NML logic is very different from CMOS technologies, so to completely understand the potential of this technology it is mandatory to investigate complex architectures. Secondly, most of the solutions proposed do not take into account the constraints derived from fabrication process, making them unrealistic and difficult to be fabricated experimentally. This thesis focuses therefore on NML logic keeping into account these two important limitations in the research approach followed in literature. The aim is to obtain a complete and accurate overview of NML logic, finding realistic circuital solutions and trying to improve at the same time their performance. After a brief and complete introduction (Chapter 1), the thesis is divided in two parts, which cover the two fundamental points followed in this three years of research: A circuits architecture analysis and a technological analysis. In the architecture analysis first an innovative VHDL model is described in Chapter 2. This model is extensively used in the analysis because it allows fast simulation of complex circuits, with, at the same time, the possibility to estimate circuit per- formance, like area and power consumption. In Chapter 3 the problem of signals synchronization in complex NML circuits is analyzed and solved, using as benchmark a simple but complete NML microprocessor. Different solutions based on asynchronous logic are studied and a new asynchronous solution, specifically designed to exploit the potential of NML logic, is developed. In Chapter 4 the layout of NML circuits is studied on a more physical level, considering the limitations of fabrication processes. The layout of NML circuits is therefore changed accordingly to these constraints. Secondly CMOS circuits architectures are compared to more simple architectures, evaluating therefore which one is more suited for NML logic. Finally the problem of interconnections in NML technology is analyzed and solutions to improve it are found. In Chapter 5 the problem of feedback signals in heavy pipelined technologies, like NML, is studied. Solutions to improve performances and synchronize signals are developed. Systolic arrays are then analyzed as possible candidate to exploit NML potential. Finally in Chapter 6 ToPoliNano, a simulator dedicated to NML and other emerging technologies, that we are developing, is described. This simulator allows to follow the same top-down approach followed for CMOS technology. The layout generator and the simulation engine are detailed described. In the first chapter of the technological analysis (Chapter 7), the performance of NML logic is explored throughout low level simulations. The aim is to understand if these circuits can be fabricated with optical lithography, allowing therefore the commercial development of NML logic. Basic logic gates and the clock system are there analyzed from a low level perspective. In Chapter 8 an innovative electric clock system for NML technology is shown and the first experimental results are reported. This clock system allows to achieve true low power for NML technology, obtaining a reduction of power consumption of 20 times considering the best CMOS transistors available. This power consumption takes into account all the losses, also the clock system losses. Moreover the solution presented can be fabricated with current technological processes. The research work behind this thesis represents an important breakthrough in NML logic. The solutions here presented allow the design and fabrication of complex NML circuits, considering the particular characteristics of this technology and considerably improving the performance. Moreover the technological solutions here presented allow the design and fabrication of circuits with available fabrication process with a considerable advantage over CMOS in terms of power consumption. This thesis represents therefore a considerable step froward in the study and development of NML technolog

    Strategies for neural networks in ballistocardiography with a view towards hardware implementation

    Get PDF
    A thesis submitted for the degree of Doctor of Philosophy at the University of LutonThe work described in this thesis is based on the results of a clinical trial conducted by the research team at the Medical Informatics Unit of the University of Cambridge, which show that the Ballistocardiogram (BCG) has prognostic value in detecting impaired left ventricular function before it becomes clinically overt as myocardial infarction leading to sudden death. The objective of this study is to develop and demonstrate a framework for realising an on-line BCG signal classification model in a portable device that would have the potential to find pathological signs as early as possible for home health care. Two new on-line automatic BeG classification models for time domain BeG classification are proposed. Both systems are based on a two stage process: input feature extraction followed by a neural classifier. One system uses a principal component analysis neural network, and the other a discrete wavelet transform, to reduce the input dimensionality. Results of the classification, dimensionality reduction, and comparison are presented. It is indicated that the combined wavelet transform and MLP system has a more reliable performance than the combined neural networks system, in situations where the data available to determine the network parameters is limited. Moreover, the wavelet transfonn requires no prior knowledge of the statistical distribution of data samples and the computation complexity and training time are reduced. Overall, a methodology for realising an automatic BeG classification system for a portable instrument is presented. A fully paralJel neural network design for a low cost platform using field programmable gate arrays (Xilinx's XC4000 series) is explored. This addresses the potential speed requirements in the biomedical signal processing field. It also demonstrates a flexible hardware design approach so that an instrument's parameters can be updated as data expands with time. To reduce the hardware design complexity and to increase the system performance, a hybrid learning algorithm using random optimisation and the backpropagation rule is developed to achieve an efficient weight update mechanism in low weight precision learning. The simulation results show that the hybrid learning algorithm is effective in solving the network paralysis problem and the convergence is much faster than by the standard backpropagation rule. The hidden and output layer nodes have been mapped on Xilinx FPGAs with automatic placement and routing tools. The static time analysis results suggests that the proposed network implementation could generate 2.7 billion connections per second performance
    • …
    corecore