116 research outputs found

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression

    参照テーブル分解アルゴリズムを伴うハードウェアを意識したニューラルネットワーク

    Get PDF
    Neural networks have been successfully implemented on various mobile hardware platforms. However, the high computational cost of neural networks, the large difference in computational accuracy with hardware, and the low structural similarity are often obstacles to be overcome in research.This thesis presents the knowledge and research development on the cross-application of neural networks and logic circuits, including an experimental procedure on implementation of multiple look up tables logic, an approximate decomposition method on decomposing larger size look-up tables into smaller individuals and a hardware-aware structured neural network.In summary, the cost of implementing bidirectional interaction between neural networks and logic circuits can be effectively reduced by benefiting from the logic learning capability of neural networks, the decomposition method of large-size look up tables, and the hardware-aware structure of neural networks.北九州市立大

    Stochastic dividers for low latency neural networks

    Get PDF
    Due to the low complexity in arithmetic unit design, stochastic computing (SC) has attracted considerable interest to implement Artificial Neural Networks (ANNs) for resources-limited applications, because ANNs must usually perform a large number of arithmetic operations. To attain a high computation accuracy in an SC-based ANN, extended stochastic logic is utilized together with standard SC units and thus, a stochastic divider is required to perform the conversion between these logic representations. However, the conventional divider incurs in a large computation latency, so limits an SC implementation for ANNs used in applications needing high performance. Therefore, there is a need to design fast stochastic dividers for SC-based ANNs. Recent works (e.g., a binary searching and triple modular redundancy (BS-TMR) based stochastic divider) are targeting a reduction in computation latency, while keeping the same accuracy compared with the traditional design. However, this divider still requires NN iterations to deal with 2N2^{N} -bit stochastic sequences, and thus the latency increases in proportion to the sequence length. In this paper, a decimal searching and TMR (DS-TMR) based stochastic divider is initially proposed to further reduce the computation latency; it only requires two iterations to calculate the quotient, so regardless of the sequence length. Moreover, a trade-off design between accuracy and hardware is also presented. An SC-based Multi-Layer Perceptron (MLP) is then considered to show the effectiveness of the proposed dividers over current designs. Results show that when utilizing the proposed dividers, the MLP achieves the lowest computation latency while keeping the same classification accuracy; although incurring in an area increase, the overhead due to the proposed dividers is low over the entire MLP. When using as combined metric for both hardware design and computation complexity the product of the implementation area, latency, power and number of clock cycles, the proposed designs are also shown to be superior to the SC-based MLPs (at the same level of accuracy) employing other dividers found in the technical literature as well as the commonly used 32-bit floating point implementation.The work of Shanshan Liu, Farzad Niknia, and Fabrizio Lombardi was supported by the NSF Grant CCF-1953961 and Grant 1812467. The work of Pedro Reviriego was supported in part by the Spanish Ministry of Science and Innovation under project ACHILLES (Grant PID2019-104207RB-I00) and the Go2Edge Network (Grant RED2018-102585-T), and in part by the Madrid Community Research Agency under Grant TAPIR-CM P2018/TCS-4496. The work of Weiqiang Liu was supported by the NSFC under Grant 62022041 and Grant 61871216. The work of Ahmed Louri was supported by the NSF Grant CCF-1812495 and Grant 1953980

    Simulation and implementation of novel deep learning hardware architectures for resource constrained devices

    Get PDF
    Corey Lammie designed mixed signal memristive-complementary metal–oxide–semiconductor (CMOS) and field programmable gate arrays (FPGA) hardware architectures, which were used to reduce the power and resource requirements of Deep Learning (DL) systems; both during inference and training. Disruptive design methodologies, such as those explored in this thesis, can be used to facilitate the design of next-generation DL systems

    Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

    Full text link
    Neural networks (NNs) are growing in importance and complexity. A neural network's performance (and energy efficiency) can be bound either by computation or memory resources. The processing-in-memory (PIM) paradigm, where computation is placed near or within memory arrays, is a viable solution to accelerate memory-bound NNs. However, PIM architectures vary in form, where different PIM approaches lead to different trade-offs. Our goal is to analyze, discuss, and contrast DRAM-based PIM architectures for NN performance and energy efficiency. To do so, we analyze three state-of-the-art PIM architectures: (1) UPMEM, which integrates processors and DRAM arrays into a single 2D chip; (2) Mensa, a 3D-stack-based PIM architecture tailored for edge devices; and (3) SIMDRAM, which uses the analog principles of DRAM to execute bit-serial operations. Our analysis reveals that PIM greatly benefits memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when the GPU requires memory oversubscription for a general matrix-vector multiplication kernel; (2) Mensa improves energy efficiency and throughput by 3.0x and 3.1x over the Google Edge TPU for 24 Google edge NN models; and (3) SIMDRAM outperforms a CPU/GPU by 16.7x/1.4x for three binary NNs. We conclude that the ideal PIM architecture for NN models depends on a model's distinct attributes, due to the inherent architectural design choices.Comment: This is an extended and updated version of a paper published in IEEE Micro, pp. 1-14, 29 Aug. 2022. arXiv admin note: text overlap with arXiv:2109.1432

    CUTIE: Beyond PetaOp/s/W Ternary DNN Inference Acceleration with Better-than-Binary Energy Efficiency

    Get PDF
    We present a 3.1 POp/s/W fully digital hardware accelerator for ternary neural networks. CUTIE, the Completely Unrolled Ternary Inference Engine, focuses on minimizing non-computational energy and switching activity so that dynamic power spent on storing (locally or globally) intermediate results is minimized. This is achieved by 1) a data path architecture completely unrolled in the feature map and filter dimensions to reduce switching activity by favoring silencing over iterative computation and maximizing data re-use, 2) targeting ternary neural networks which, in contrast to binary NNs, allow for sparse weights which reduce switching activity, and 3) introducing an optimized training method for higher sparsity of the filter weights, resulting in a further reduction of the switching activity. Compared with state-of-the-art accelerators, CUTIE achieves greater or equal accuracy while decreasing the overall core inference energy cost by a factor of 4.8x-21x

    Towards the implementation of distributed systems in synthetic biology

    Get PDF
    The design and construction of engineered biological systems has made great strides over the last few decades and a growing part of this is the application of mathematical and computational techniques to problems in synthetic biology. The use of distributed systems, in which an overall function is divided across multiple populations of cells, has the potential to increase the complexity of the systems we can build and overcome metabolic limitations. However, constructing biological distributed systems comes with its own set of challenges. In this thesis I present new tools for the design and control of distributed systems in synthetic biology. The first part of this thesis focuses on biological computers. I develop novel design algorithms for distributed digital and analogue computers composed of spatial patterns of communicating bacterial colonies. I prove mathematically that we can program arbitrary digital functions and develop an algorithm for the automated design of optimal spatial circuits. Furthermore, I show that bacterial neural networks can be built using our system and develop efficient design tools to do so. I verify these results using computational simulations. This work shows that we can build distributed biological computers using communicating bacterial colonies and different design tools can be used to program digital and analogue functions. The second part of this thesis utilises a technique from artificial intelligence, reinforcement learning, in first the control and then the understanding of biological systems. First, I show the potential utility of reinforcement learning to control and optimise interacting communities of microbes that produce a biomolecule. Second, I apply reinforcement learning to the design of optimal characterisation experiments within synthetic biology. This work shows that methods utilising reinforcement learning show promise for complex distributed bioprocessing in industry and the design of optimal experiments throughout biology

    Techniques of Energy-Efficient VLSI Chip Design for High-Performance Computing

    Get PDF
    How to implement quality computing with the limited power budget is the key factor to move very large scale integration (VLSI) chip design forward. This work introduces various techniques of low power VLSI design used for state of art computing. From the viewpoint of power supply, conventional in-chip voltage regulators based on analog blocks bring the large overhead of both power and area to computational chips. Motivated by this, a digital based switchable pin method to dynamically regulate power at low circuit cost has been proposed to make computing to be executed with a stable voltage supply. For one of the widely used and time consuming arithmetic units, multiplier, its operation in logarithmic domain shows an advantageous performance compared to that in binary domain considering computation latency, power and area. However, the introduced conversion error reduces the reliability of the following computation (e.g. multiplication and division.). In this work, a fast calibration method suppressing the conversion error and its VLSI implementation are proposed. The proposed logarithmic converter can be supplied by dc power to achieve fast conversion and clocked power to reduce the power dissipated during conversion. Going out of traditional computation methods and widely used static logic, neuron-like cell is also studied in this work. Using multiple input floating gate (MIFG) metal-oxide semiconductor field-effect transistor (MOSFET) based logic, a 32-bit, 16-operation arithmetic logic unit (ALU) with zipped decoding and a feedback loop is designed. The proposed ALU can reduce the switching power and has a strong driven-in capability due to coupling capacitors compared to static logic based ALU. Besides, recent neural computations bring serious challenges to digital VLSI implementation due to overload matrix multiplications and non-linear functions. An analog VLSI design which is compatible to external digital environment is proposed for the network of long short-term memory (LSTM). The entire analog based network computes much faster and has higher energy efficiency than the digital one

    Sign-Magnitude Stochastic Computing

    Get PDF
    Department of Computer Science and EngineeringStochastic computing (SC) is a promising computing paradigm for applications with low precision requirement, stringent cost and power restriction. One known problem with SC, however, is the low accuracy especially with multiplication. In this work we propose a simple, yet very e ective solution to the low-accuracy SC-multiplication problem, which is critical in many applications such as deep neural networks (DNNs). Unlike previous solutions, our method is scalable, flexible, and does not require a priori knowledge of input values. It is based on an old concept of sign-magnitude, which, when applied to SC, has unique advantages. Our experimental results using multiple DNN applications demonstrate that our technique can improve the e - ciency of SC-based DNNs by about 32X in terms of latency over using bipolar SC, with very little area overhead (about 1%). Moreover, we analyze the existing SC encodings and propose mathematical formulation of SC multiplication, which can be used for analysis and to model the behavior of SC multiplication, instead of performing bit-level simulation.ope

    Reservoir Computing with Boolean Logic Network Circuits

    Get PDF
    To push the frontiers of machine learning, completely new computing architectures must be explored which efficiently use hardware resources. We test an unconventional use of digital logic gate circuits for reservoir computing, a machine learning algorithm that is used for rapid time series processing. In our approach, logic gates are configured into networks that can exhibit complex dynamics. Rather than the gates explicitly computing pre-programmed instructions, they are used collectively as a dynamical system that transforms input data into a higher dimensional representation. We probe the dynamics of such circuits using discrete components on a circuit board as well as an FPGA implementation. We show favorable machine learning performance, including radiofrequency classification accuracy comparableto a state of the art convolutional neural network with a fraction of the trainable parameters. Finally, we discuss the design and fabrication of a reservoir computing ASIC for high-speed time series processing
    corecore