48 research outputs found

    Artificial neural networks acceleration on field-programmable gate arrays considering model redundancy

    Get PDF
    Artificial Neural Networks (ANNs) have dramatically developed over the last ten years, and have been successfully applied in many important areas. A natural follow-up topic is to deploy ANNs to a wider range of hardware platforms. However, modern ANN models may aim for millisecond- or even nanosecond-level latency for each input processing while it is common for them to require million-level operations and gigabyte-scale data access for computing each input. This intrinsic high computational complexity introduces hardware challenges to the system implementation. Meanwhile, the integration of computing resources on hardware platforms is hampered by the slowing down of Moore’s Law. Therefore, it is important to study new design methods for ANN hardware systems that produce high model accuracy with low resource usage. Field-Programmable Gate Array (FPGA) is a natural fit for this topic due to its reconfigurability and flexibility. These features of FPGA allow us to implement customised data paths and data representations on hardware, which makes it the primary platform in this research. The main topics discussed in this thesis include neural network redundancy and its impact on hardware systems. The main goal is to reduce hardware complexity by reducing neural network redundancy and maintaining accuracy at the same time. To achieve this, redundancy is firstly categorised into two types: model- and data-level. Then, each type is studied in isolation before both are combined in a single system design. First, to study model-level redundancy, an algorithm called dropout is implemented as a way to reduce model-level redundancy during training and used here to reduce hardware cost. Our proposed system achieves a 50% reduction in DSP usage and 33% to 47% fewer on-chip memory usage compared to conventional implementations. Second, in terms of data-level redundancy, we aim to study how data precision affects hardware cost and system throughput. Our experiments show that reduced-precision data present negligible or even no accuracy loss to full-precision data on the tested benchmarks. In particular, the 4-bit fixed point presents a good trade-off between model accuracy and hardware cost compared to other tested data representations. Third, we studied the interactive effect of reducing both model- and data-level redundancy and proposed a FPGA accelerator design for Redundancy-Reduced (RR-) MobileNet [Hea17]. Our proposed RR-MobileNet system achieves a state-of-the-art latency, 7.85 ms, for single image processing in ImageNet inference. Finally, a design guideline is proposed as a step-by-step guidance for redundancy-reduced neural network system design.Open Acces

    Algorithm and Architecture Co-design for High-performance Digital Signal Processing.

    Full text link
    CMOS scaling has been the driving force behind the revolution of digital signal processing (DSP) systems, but scaling is slowing down and the CMOS device is approaching its fundamental scaling limit. At the same time, DSP algorithms are continuing to evolve, so there is a growing gap between the increasing complexities of the algorithms and what is practically implementable. The gap can be bridged by exploring the synergy between algorithm and hardware design, using the so-called co-design techniques. In this thesis, algorithm and architecture co-design techniques are applied to X-ray computed tomography (CT) image reconstruction. Analysis of fixed-point quantization and CT geometry identifies an optimal word length and a mismatch between the object and projection grids. A water-filling buffer is designed to resolve the grid mismatch, and is combined with parallel fixed-point arithmetic units to improve the throughput. The analysis eventually leads to an out-of-order scheduling architecture that reduces the off-chip memory access by three orders of magnitude. The co-design techniques are further applied to the design of neural networks for sparse coding. Analysis of the neuron spiking dynamics leads to the optimal tuning of network size, spiking rate, and update step size to keep the spiking sparse. The resulting sparsity enables a bus-ring architecture to achieve both high throughput and scalability. A 65nm CMOS chip implementing the architecture demonstrates feature extraction at a throughput of 1.24G pixel/s at 1.0V and 310MHz. The error tolerance of sparse coding can be exploited to enhance the energy efficiency. As a natural next step after the sparse coding chip, a neural-inspired inference module (IM) is designed for object recognition. The object recognition chip consists of an IM based on sparse coding and an event-driven classifier. A learning co-processor is integrated on chip to enable on-chip learning. The throughput and energy efficiency are further improved using architectural techniques including sub-dividing the IM and classifier into modules and optimal pipelining. The result is a 65nm CMOS chip that performs sparse coding at 10.16G pixel/s at 1.0V and 635MHz. The co-design techniques can be applied to the design of other advanced DSP algorithms for emerging applications.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113344/1/jungkook_1.pd

    Deep Machine Learning with Spatio-Temporal Inference

    Get PDF
    Deep Machine Learning (DML) refers to methods which utilize hierarchies of more than one or two layers of computational elements to achieve learning. DML may draw upon biomemetic models, or may be simply biologically-inspired. Regardless, these architectures seek to employ hierarchical processing as means of mimicking the ability of the human brain to process a myriad of sensory data and make meaningful decisions based on this data. In this dissertation we present a novel DML architecture which is biologically-inspired in that (1) all processing is performed hierarchically; (2) all processing units are identical; and (3) processing captures both spatial and temporal dependencies in the observations to organize and extract features suitable for supervised learning. We call this architecture Deep Spatio-Temporal Inference Network (DeSTIN). In this framework, patterns observed in pixel data at the lowest layer of the hierarchy are organized and fit to generalizations using decomposition algorithms. Subsequent spatial layers draw upon previous layers, their own temporal observations and beliefs, and the observations and beliefs of parent nodes to extract features suitable for supervised learning using standard classifiers such as feedforward neural networks. Hence, DeSTIN is viewed as an unsupervised feature extraction scheme in the sense that rather than relying on human engineering to determine features for a particular problem, DeSTIN naturally constructs features of interest by representing salient regularities in the patterns observed. Detailed discussion and analysis of the DeSTIN framework is provided, including focus on its key components of generalization through online clustering and temporal inference. We present a variety of implementation details, including static and dynamic learning formulations, and function approximation methods. Results on standardized datasets of handwritten digits as well as face and optic nerve detection are presented, illustrating the efficacy of the proposed approach

    Performance of the fixed-point autoencoder

    Get PDF
    Model autodavača (autoencodera) je jedan od najtipičnijih modela temeljitog učenja koji se najčešće koriste u učenju neupravljačkog obilježja za mnoge aplikacije kao što su prepoznavanje, identifikacija i pretraživanje. Algoritmi autodavača predstavljaju opsežne računarske zadatke. Stvaranje opsežnog modela autodavača može zadovoljiti potrebe u analizi ogromnog broja podataka. Međutim, vrijeme učenja katkada postaje nepodnošljivo, što dovodi do potrebe istraživanja nekih platformi hardvera za ubrzavanje, kao što je FPGA. Verzije softvera autodavača često koriste izraze jednostruke ili dvostruke preciznosti. Ali implementiranje jedinica s promjenjivom točkom je vrlo skupo za postavljanje u FPGA. Kod implementacije autodavača na hardver stoga se često primjenjuje aritmetika nepromjenjive točke. No često se zanemaruje gubitak točnosti i nije proučavan u ranijim radovima. Ima tek nekoliko radova koji se bave akceleratorima koji koriste fiksne širine bita na drugim modelima neuronskih mreža. U našem se radu daje opsežna procjena prikaza preciznosti implikacija nepromjenjive točke na autodavač, postizanje najbolje značajke i područja učinkovitosti. Metoda konverzije formata podataka, metode blokiranja matrice i aproksimacija kompleksnim funkcijama predstavljaju ključne razmatrane čimbenike u skladu s mjestom implementacije hardvera. U radu se procjenjuju metoda simulacije konverzije podataka, blokiranje matrice različitim paralelizmom i jednostavna metoda evaluacije. Rezultati su pokazali da je širina bita s nepromjenjivom točkom uistinu utjecala na učinkovitost autodavača. Višestruki čimbenici mogu postići suprotan učinak. Svaki čimbenik može imati dvostruki učinak odbacivanja "brojnih" informacija i "korisnih" informacija u isto vrijeme. Područje predstavljanja treba pažljivo odabrati u skladu s računarskim paralelizmom. Rezultat je također pokazao da se primjenom aritmetike nepromjenjive točke može garantirati preciznost algoritma autodavača i postići prihvatljiva brzina konvergencije.The model of autoencoder is one of the most typical deep learning models that have been mainly used in unsupervised feature learning for many applications like recognition, identification and mining. Autoencoder algorithms are compute-intensive tasks. Building large scale autoencoder model can satisfy the analysis requirement of huge volume data. But the training time sometimes becomes unbearable, which naturally leads to investigate some hardware acceleration platforms like FPGA. The software versions of autoencoder often use single-precision or double-precision expressions. But the floating point units are very expensive to implement on FPGA. Fixed-point arithmetic is often used when implementing autoencoder on hardware. But the accuracy loss is often ignored and its implications for accuracy have not been studied in previous works. There are only some works focused on accelerators using some fixed bit-widths on other neural networks models. Our work gives a comprehensive evaluation to demonstrate the fix-point precision implications on the autoencoder, achieving best performance and area efficiency. The method of data format conversion, the matrix blocking methods and the complex functions approximation are the main factors considered according to the situation of hardware implementation. The simulation method of the data conversion, the matrix blocking with different parallelism and a simple PLA approximation method were evaluated in this paper. The results showed that the fixed-point bit-width did have effect on the performance of autoencoder. Multiple factors may have crossed effect. Each factor would have two-sided impacts for discarding the "abundant" information and the "useful" information at the same time. The representation domain must be carefully selected according to the computation parallelism. The result also showed that using fixed-point arithmetic can guarantee the precision of the autoencoder algorithm and get acceptable convergence speed

    Spintronics-based Architectures for non-von Neumann Computing

    Get PDF
    The scaling of transistor technology in the last few decades has significantly impacted our lives. It has given birth to different kinds of computational workloads which are becoming increasingly relevant. Some of the most prominent examples are Machine Learning based tasks such as image classification and pattern recognition which use Deep Neural Networks that are highly computation and memory-intensive. The traditional and general-purpose architectures that we use today typically exhibit high energy and latency on such computations. This, and the apparent end of Moore's law of scaling, has got researchers into looking for devices beyond CMOS and for computational paradigms that are non-conventional. In this dissertation, we focus on a spintronic device, the Magnetic Tunnel Junction (MTJ), which has demonstrated potential as cache and embedded memory. We look into how the MTJ can be used beyond memory and deployed in various non-conventional and non-von Neumann architectures for accelerating computations or making them energy efficient. First, we investigate into Stochastic Computing (SC) and show how MTJs can be used to build energy-efficient Neural Network (NN) hardware in this domain. SC is primarily bit-serial computing which requires simple logic gates for arithmetic operations. We explore the use of MTJs as Stochastic Number Generators (SNG) by exploiting their probabilistic switching characteristics and propose an energy-efficient MTJ-SNG. It is deployed as part of an NN hardware implemented in the SC domain. Its characteristics allow for achieving further energy efficiency through NN weight approximation, towards which we develop an optimization problem. Next, we turn our attention to analog computing and propose a method for training of analog Neural Network hardware. We consider a resistive MTJ crossbar architecture for representing an NN layer since it is capable of in-memory computing and performs matrix-vector multiplications with O(1) time complexity. We propose the on-chip training of the NN crossbar since, first, it can leverage the parallelism in the crossbar to perform weight update, second, it allows to take into account the device variations, and third, it enables avoiding large sneak currents in transistor-less crossbars which can cause undesired weight changes. Lastly, we propose an MTJ-based non-von Neumann hardware platform for solving combinatorial optimization problems since they are NP-hard. We adopt the Ising model for encoding such problems and solving them with simulated annealing. We let MTJs represent Ising units, design a scalable circuit capable of performing Ising computations and develop a reconfigurable architecture to which any NP-hard problem can be mapped. We also suggest methods to take into account the non-idealities present in the proposed hardware

    Towards Lightweight AI: Leveraging Stochasticity, Quantization, and Tensorization for Forecasting

    Get PDF
    The deep neural network is an intriguing prognostic model capable of learning meaningful patterns that generalize to new data. The deep learning paradigm has been widely adopted across many domains, including for natural language processing, genomics, and automatic music transcription. However, deep neural networks rely on a plethora of underlying computational units and data, collectively demanding a wealth of compute and memory resources for practical tasks. This model complexity prohibits the use of larger deep neural networks for resource-critical applications, such as edge computing. In order to reduce model complexity, several research groups are actively studying compression methods, hardware accelerators, and alternative computing paradigms. These orthogonal research explorations often leave a gap in understanding the interplay of the optimization mechanisms and their overall feasibility for a given task. In this thesis, we address this gap by developing a holistic solution to assess the model complexity reduction theoretically and quantitatively at both high-level and low-level abstractions for training and inference. At the algorithmic level, a novel deep, yet lightweight, recurrent architecture is proposed that extends the conventional echo state network. The architecture employs random dynamics, brain-inspired plasticity mechanisms, tensor decomposition, and hierarchy as the key features to enrich learning. Furthermore, the hyperparameter landscape is optimized via a particle swarm optimization algorithm. To deploy these networks efficiently onto low-end edge devices, both ultra-low and mixed-precision numerical formats are studied within our feedforward deep neural network hardware accelerator. More importantly, the tapered-precision posit format with a novel exact-dot-product algorithm is employed in the low-level digital architectures to study its efficacy in resource utilization. The dynamics of the architecture are characterized through neuronal partitioning and Lyapunov stability, and we show that superlative networks emerge beyond the edge of chaos with an agglomeration of weak learners. We also demonstrate that tensorization improves model performance by preserving correlations present in multi-way structures. Low-precision posits are found to consistently outperform other formats on various image classification tasks and, in conjunction with compression, we achieve magnitudes of speedup and memory savings for both training and inference for the forecasting of chaotic time series and polyphonic music tasks. This culmination of methods greatly improves the feasibility of deploying rich predictive models on edge devices

    Algorithms for massively parallel, event-based hardware

    Full text link

    Energy and Area Efficient Machine Learning Architectures using Spin-Based Neurons

    Get PDF
    Recently, spintronic devices with low energy barrier nanomagnets such as spin orbit torque-Magnetic Tunnel Junctions (SOT-MTJs) and embedded magnetoresistive random access memory (MRAM) devices are being leveraged as a natural building block to provide probabilistic sigmoidal activation functions for RBMs. In this dissertation research, we use the Probabilistic Inference Network Simulator (PIN-Sim) to realize a circuit-level implementation of deep belief networks (DBNs) using memristive crossbars as weighted connections and embedded MRAM-based neurons as activation functions. Herein, a probabilistic interpolation recoder (PIR) circuit is developed for DBNs with probabilistic spin logic (p-bit)-based neurons to interpolate the probabilistic output of the neurons in the last hidden layer which are representing different output classes. Moreover, the impact of reducing the Magnetic Tunnel Junction\u27s (MTJ\u27s) energy barrier is assessed and optimized for the resulting stochasticity present in the learning system. In p-bit based DBNs, different defects such as variation of the nanomagnet thickness can undermine functionality by decreasing the fluctuation speed of the p-bit realized using a nanomagnet. A method is developed and refined to control the fluctuation frequency of the output of a p-bit device by employing a feedback mechanism. The feedback can alleviate this process variation sensitivity of p-bit based DBNs. This compact and low complexity method which is presented by introducing the self-compensating circuit can alleviate the influences of process variation in fabrication and practical implementation. Furthermore, this research presents an innovative image recognition technique for MNIST dataset on the basis of p-bit-based DBNs and TSK rule-based fuzzy systems. The proposed DBN-fuzzy system is introduced to benefit from low energy and area consumption of p-bit-based DBNs and high accuracy of TSK rule-based fuzzy systems. This system initially recognizes the top results through the p-bit-based DBN and then, the fuzzy system is employed to attain the top-1 recognition results from the obtained top outputs. Simulation results exhibit that a DBN-Fuzzy neural network not only has lower energy and area consumption than bigger DBN topologies while also achieving higher accuracy
    corecore