283 research outputs found

    Bio-inspired learning and hardware acceleration with emerging memories

    Get PDF
    Machine Learning has permeated many aspects of engineering, ranging from the Internet of Things (IoT) applications to big data analytics. While computing resources available to implement these algorithms have become more powerful, both in terms of the complexity of problems that can be solved and the overall computing speed, the huge energy costs involved remains a significant challenge. The human brain, which has evolved over millions of years, is widely accepted as the most efficient control and cognitive processing platform. Neuro-biological studies have established that information processing in the human brain relies on impulse like signals emitted by neurons called action potentials. Motivated by these facts, the Spiking Neural Networks (SNNs), which are a bio-plausible version of neural networks have been proposed as an alternative computing paradigm where the timing of spikes generated by artificial neurons is central to its learning and inference capabilities. This dissertation demonstrates the computational power of the SNNs using conventional CMOS and emerging nanoscale hardware platforms. The first half of this dissertation presents an SNN architecture which is trained using a supervised spike-based learning algorithm for the handwritten digit classification problem. This network achieves an accuracy of 98.17% on the MNIST test data-set, with about 4X fewer parameters compared to the state-of-the-art neural networks achieving over 99% accuracy. In addition, a scheme for parallelizing and speeding up the SNN simulation on a GPU platform is presented. The second half of this dissertation presents an optimal hardware design for accelerating SNN inference and training with SRAM (Static Random Access Memory) and nanoscale non-volatile memory (NVM) crossbar arrays. Three prominent NVM devices are studied for realizing hardware accelerators for SNNs: Phase Change Memory (PCM), Spin Transfer Torque RAM (STT-RAM) and Resistive RAM (RRAM). The analysis shows that a spike-based inference engine with crossbar arrays of STT-RAM bit-cells is 2X and 5X more efficient compared to PCM and RRAM memories, respectively. Furthermore, the STT-RAM design has nearly 6X higher throughput per unit Watt per unit area than that of an equivalent SRAM-based (Static Random Access Memory) design. A hardware accelerator with on-chip learning on an STT-RAM memory array is also designed, requiring 1616 bits of floating-point synaptic weight precision to reach the baseline SNN algorithmic performance on the MNIST dataset. The complete design with STT-RAM crossbar array achieves nearly 20X higher throughput per unit Watt per unit mm^2 than an equivalent design with SRAM memory. In summary, this work demonstrates the potential of spike-based neuromorphic computing algorithms and its efficient realization in hardware based on conventional CMOS as well as emerging technologies. The schemes presented here can be further extended to design spike-based systems that can be ubiquitously deployed for energy and memory constrained edge computing applications

    Digital Simulations of Memristors Towards Integration with Reconfigurable Computing

    Get PDF
    The end of Moore’s Law has been predicted for decades. Demand for increased parallel computational performance has been increased by improvements in machine learning. This past decade has demonstrated the ever-increasing creativity and effort necessary to extract scaling improvements in CMOS fabrication processes. However, CMOS scaling is nearing its fundamental physical limits. A viable path for increasing performance is to break the von Neumann bottleneck. In-memory computing using emerging memory technologies (e.g. ReRam, STT, MRAM) offers a potential path beyond the end of Moore’s Law. However, there is currently very little support from industry tools for designers wishing to incorporate these devices and novel architectures. The primary issue for those using these tools is the lack of support for mixed-signal design, as HDLs such as Verilog were designed to work only with digital components. This work aims to improve the ability for designers to rapidly prototype their designs using these emerging memory devices, specifically memristors, by extending Verilog to support functional simulation of memristors with the Verilog Procedural Interface (VPI). In this work, demonstrations of the ability for the VPI to simulate memristors with the nonlinear ion-drift model and the behavior of a memristive crossbar array are presented

    Extended update plans

    Get PDF
    Formal methods are gaining popularity as a way of increasing the reliability of systems through the use of mathematically based techniques. Their domain is no longer restricted to purely academic environments and examples, as they are slowly moving into industrial settings. The slow rate at which this transition takes place is mainly due to the perceived difficulty of formalising the behaviour of systems. While this is undoubtedly true, it is not the case with all formal methods. Update Plans are a powerful formalism for the description of computer architectures and intermediate to low-level languages. They are a declarative specification language with an underlying imperative machine model. The descriptions using Update Plans are clear, compact, intuitive, unambiguous and simple to read. These characteristics allow for the minimisation of possible errors at early stages of the development process even before a verification takes place. In this thesis an overview of the Update Plans formalism is given and a number of realworld applications is shown. The investigation of the application area focuses on computer architectures for which various specifications already exist. The comparison of Update Plan specifications to other specifications provides a useful insight into the strengths and shortcomings of the formalism. The shortcomings, in particular the lack of synchronisation primitives and modularity, are addressed by the development and evaluation of several syntactic and semantic extensions described in this thesis. The extended formalism is also compared to other specification languages and conclusions are drawn

    Digital Simulations of Memristors Towards Integration with Reconfigurable Computing

    Get PDF
    The end of Moore’s Law has been predicted for decades. Demand for increased parallel computational performance has been increased by improvements in machine learning. This past decade has demonstrated the ever-increasing creativity and effort necessary to extract scaling improvements in CMOS fabrication processes. However, CMOS scaling is nearing its fundamental physical limits. A viable path for increasing performance is to break the von Neumann bottleneck. In-memory computing using emerging memory technologies (e.g. ReRam, STT, MRAM) offers a potential path beyond the end of Moore’s Law. However, there is currently very little support from industry tools for designers wishing to incorporate these devices and novel architectures. The primary issue for those using these tools is the lack of support for mixed-signal design, as HDLs such as Verilog were designed to work only with digital components. This work aims to improve the ability for designers to rapidly prototype their designs using these emerging memory devices, specifically memristors, by extending Verilog to support functional simulation of memristors with the Verilog Procedural Interface (VPI). In this work, demonstrations of the ability for the VPI to simulate memristors with the nonlinear ion-drift model and the behavior of a memristive crossbar array are presented

    Device Modeling and Circuit Design of Neuromorphic Memory Structures

    Get PDF
    The downscaling of CMOS technology and the benefits gleaned thereof have made it the cornerstone of the semiconductor industry for many years. As the technology reaches its fundamental physical limits, however, CMOS is expected to run out of steam instigating the exploration of new nanoelectronic devices. Memristors have emerged as promising candidates for future computing paradigms, specifically, memory arrays and neuromorphic circuits. Towards this end, this dissertation will explore the use of two memristive devices, namely, Transition Metal Oxide (TMO) devices and Insulator Metal Transition (IMT) devices in constructing neuromorphic circuits. A compact model for TMO devices is first proposed and verified against experimental data. The proposed model, unlike most of the other models present in the literature, leverages the instantaneous resistance of the device as the state variable which facilitates parameter extraction. In addition, a model for the forming voltage of TMO devices is developed and verified against experimental data and Monte Carlo simulations. Impact of the device geometry and material characteristics of the TMO device on the forming voltage is investigated and techniques for reducing the forming voltage are proposed. The use of TMOs in syanptic arrays is then explored and a multi-driver write scheme is proposed that improves their performance. The proposed technique enhances voltage delivery across the selected cells via suppressing the effective line resistance and leakage current paths, thus, improving the performance of the crossbar array. An IMT compact model is also developed and verified against experiemntal data and electro-thermal device simulations. The proposed model describes the device as a memristive system with the temperature being the state variable, thus, capturing the temperature dependent resistive switching of the IMT device in a compact form suitable for SPICE implementation. An IMT based Integrate-And-Fire neuron is then proposed. The IMT neuron leverages the temperature dynamics of the device to deliver the functionality of the neuron. The proposed IMT neuron is more compact than its CMOS counterparts as it alleviates the need for complex CMOS circuitry. Impact of the IMT device parameters on the neuron\u27s performance is then studied and design considerations are provided

    A Model for Multilevel Phase-Change Memories Incorporating Resistance Drift Effects

    Get PDF
    Phase change memories are emerging as a most promising technology for future nonvolatile, solid-state, electrical storage. However, to compete effectively in mainstream storage applications, a multilevel cell capability is most desirable. Unfortunately, phase-change memories exhibit a temporal drift in programmed resistance (and in threshold switching voltage) which appears to be a fundamental and universal property of the amorphous or partially amorphous phase. Phase-change device models should therefore include these drift effects in a realistic way so that circuit and systems designers can assess the likely performance of multilevel phase-change memories in a variety of potential applications. In this paper, therefore, we present a comprehensive SPICE-based model for phase-change devices that includes the capability for programming into multiple resistance levels, the prediction of the drift of cell resistance (and threshold voltage) with time, and the capability for modeling the randomness inherent to the resistance drift phenomenon. Simulations of multilevel programming and drift phenomena using the model are presented and compared to experimental results, with which there is very good agreement

    FPGA implementation of a LSTM Neural Network

    Get PDF
    Este trabalho pretende fazer uma implementação customizada, em Hardware, duma Rede Neuronal Long Short-Term Memory. O modelo python, assim como a descrição Verilog, e síntese RTL, encontram-se terminadas. Falta apenas fazer o benchmarking e a integração de um sistema de aprendizagem

    A null convention logic based platform for high speed low energy IP packet forwarding

    Get PDF
    By 2020, it is predicted that there will be over 5 billion people and 38.5 billion Internet-ofThings devices on the Internet. The data generated by all these users and devices will have to be transported quickly and efficiently. Routers forming the backbone of this Internet already support multiple 100 Gbps ports meaning that they would have to perform upwards of 200 Million destination addresses lookups per second in the packet forwarding block that lies in the router ‘data-path’. At the same time, there is also a huge demand to make the network infrastructure more energy efficient. The work presented in this thesis is motivated by the observation that traditional synchronous digital systems will have increasing difficulty keeping up with these conflicting demands. Further, with reducing device geometries, extremes in “process, voltage and temperature” (PVT) variability will undermine reliable synchronous operation. It is expected that asynchronous design techniques will be able to overcome many of these problems and offer a means of lowering energy while maintaining high throughput and low latency. This thesis investigates existing address lookup algorithms and investigates the possibility of combining various approaches to improve energy efficiency without affecting lookup performance. A quasi delay-insensitive asynchronous methodology - Null Convention Logic (NCL) - is then applied to this combined design. Techniques that take advantage of the characteristics of the design methodology and the lookup algorithm to further improve the area, energy and latency characteristics are also analysed. The IP address lookup scheme utilised here is a recent algorithmic approach that uses compact binary-tries and was selected for its high memory efficiency and throughput. The design is pipelined, and the prefix information is stored in large RAMs. A Boolean synchronous implementation of the algorithm is simulated to provide an initial performance benchmark. It is observed that during the address lookup process nearly 68% of the trie accesses are to nodes that contained no prefix information. Bloom filter structures that use non-cryptographic hashes and single-bit memory are introduced into the address lookup process to prevent these unnecessary accesses, thereby reducing the energy consumption. Three non-cryptographic hashing algorithms (CRC32, Jenkins and Murmur) are also analysed for their suitability in Bloom filters, and the CRC32 is found to offer the most suitable trade-off between complexity and performance. As a first step to applying the NCL design methodology, NCL implementations of the hashing algorithms are created and evaluated. A significant finding from these experiments is that, unlike Boolean systems, latency and throughput in NCL systems are only loosely coupled. An example Jenkins hash implementation with eight pipeline stages and a cycle time of 3.2 ns exhibits a total latency of 6 ns, whereas an equivalent synchronous implementation with a similar clock period exhibits a latency of 25.6 ns. Further investigations reveal that completion detection circuits within the NCL pipelines impair throughput significantly. Two enhancements to the NCL circuit library aimed particularly at optimising NCL completion detection are proposed and analysed. These are shown to enable completion detection circuits to be built with the same delay but with 30% smaller area and about 75% lower peak current compared to the conventional approach using gates from the standard NCL library. An NCL SRAM structure is also proposed to augment the conventional 6-T cell array with circuits to generate the handshaking signals for managing the NCL data flow. Additionally, a dedicated column of cells called the Null-storage column is added, which indicates if a particular address in the RAM stores no Data, i.e., it is in its Null state. This additional hardware imposes a small area overhead of about 10% but allows accesses to Null locations to be completed in 50% less time and consume 40% less energy than accesses to valid Data locations. An experimental NCL-based address lookup system is then designed that includes all of the developed NCL modules. Statistical delay models derived from circuit-level simulations of individual modules are used to emulate realistic circuit delay variability in the behavioural modules written in Verilog. Simulations of the assembled system demonstrate that unlike what was observed with the synchronous design, with NCL, the design that does not employ Bloom filters, but only the Null-storage column RAMs for prefix storage, exhibits the smallest area on the chip and also consumes the least energy per address lookup. It is concluded that to derive maximum benefit out of an asynchronous design approach; it is necessary to carefully select the architectural blocks that combine the peculiarities of the implemented algorithm with the capabilities of the NCL design methodology

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
    corecore