16 research outputs found

    XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes

    Get PDF
    Heavily quantized fixed-point arithmetic is becoming a common approach to deploy Convolutional Neural Networks (CNNs) on limited-memory low-power IoT end-nodes. However, this trend is narrowed by the lack of support for low-bitwidth in the arithmetic units of state-of-the-art embedded Microcontrollers (MCUs). This work proposes a multi-precision arithmetic unit fully integrated into a RISC-V processor at the micro-architectural and ISA level to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64 ร— peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 ร— and 8 ร— faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU

    Flexible Computing Systems For AI Acceleration At The Extreme Edge Of The IoT

    Get PDF
    Embedding intelligence in extreme edge devices allows distilling raw data acquired from sensors into actionable information, directly on IoT end-nodes. This computing paradigm, in which end-nodes no longer depend entirely on the Cloud, offers undeniable benefits, driving a large research area (TinyML) to deploy leading Machine Learning (ML) algorithms on micro-controller class of devices. To fit the limited memory storage capability of these tiny platforms, full-precision Deep Neural Networks (DNNs) are compressed by representing their data down to byte and sub-byte formats, in the integer domain. However, the current generation of micro-controller systems can barely cope with the computing requirements of QNNs. This thesis tackles the challenge from many perspectives, presenting solutions both at software and hardware levels, exploiting parallelism, heterogeneity and software programmability to guarantee high flexibility and high energy-performance proportionality. The first contribution, PULP-NN, is an optimized software computing library for QNN inference on parallel ultra-low-power (PULP) clusters of RISC-V processors, showing one order of magnitude improvements in performance and energy efficiency, compared to current State-of-the-Art (SoA) STM32 micro-controller systems (MCUs) based on ARM Cortex-M cores. The second contribution is XpulpNN, a set of RISC-V domain specific instruction set architecture (ISA) extensions to deal with sub-byte integer arithmetic computation. The solution, including the ISA extensions and the micro-architecture to support them, achieves energy efficiency comparable with dedicated DNN accelerators and surpasses the efficiency of SoA ARM Cortex-M based MCUs, such as the low-end STM32M4 and the high-end STM32H7 devices, by up to three orders of magnitude. To overcome the Von Neumann bottleneck while guaranteeing the highest flexibility, the final contribution integrates an Analog In-Memory Computing accelerator into the PULP cluster, creating a fully programmable heterogeneous fabric that demonstrates end-to-end inference capabilities of SoA MobileNetV2 models, showing two orders of magnitude performance improvements over current SoA analog/digital solutions

    Finite precision deep learning with theoretical guarantees

    Get PDF
    Recent successes of deep learning have been achieved at the expense of a very high computational and parameter complexity. Today, deployment of both inference and training of deep neural networks (DNNs) is predominantly in the cloud. A recent alternative trend is to deploy DNNs onto untethered, resource-constrained platforms at the Edge. To realize on-device intelligence, the gap between algorithmic requirements and available resources needs to be closed. One popular way of doing so is via implementation in finite precision. While ad-hoc trial and error techniques in finite precision deep learning abound, theoretical guarantees on network accuracy are elusive. The work presented in this dissertation builds a theoretical framework for the implementation of deep learning in finite precision. For inference, we theoretically analyze the worst-case accuracy drop in the presence of weight and activation quantization. Furthermore, we derive an optimal clipping criterion (OCC) to minimize the precision of dot-product outputs. For implementations using in-memory computing, OCC lowers ADC precision requirements. We analyze fixed-point training and present a methodology for implementing quantized back-propagation with close-to-minimal per-tensor precision. Finally, we study accumulator precision for reduced precision floating-point training using variance analysis techniques. We first introduce our work on fixed-point inference with accuracy guarantees. Theoretical bounds on the mismatch between limited and full precision networks are derived. Proper precision assignment can be readily obtained employing these bounds, and weight-activation, as well as per-layer precision trade-offs, are derived. Applied to a variety of networks and datasets, the presented analysis is found to be tight to within 2 bit. Furthermore, it is shown that a minimum precision network can have up to โˆผ3.5ร—\sim3.5\times lower hardware complexity than a binarized network at iso-accuracy. In general, a minimum precision network can reduce complexity by up to โˆผ10ร—\sim10\times compared to a full precision baseline while maintaining accuracy. Per-layer precision analysis indicates that precision requirements of common networks vary from 2 bit to 10 bit to guarantee an accuracy close to the floating-point baseline. Then, we study DNN implementation using in-memory computing (IMC), where we propose OCC to minimize the column ADC precision. The signal-to-quantization-noise ratio (SQNR) of OCC is shown to be within 0.8 dB of the well-known optimal Lloyd-Max quantizer. OCC improves the SQNR of the commonly employed full range quantizer by 14 dB which translates to a 3 bit ADC precision reduction. The input-serial weight-parallel (ISWP) IMC architecture is studied. Using bit-slicing techniques, significant energy savings can be achieved with minimal accuracy lost. Indeed, we prove that a dot-product can be realized with single memory access while suffering no more than 2 dB SQNR drop. Combining the proposed OCC and ISWP noise analysis with our proposed DNN precision analysis, we demonstrate โˆผ6ร—\sim6\times reduction of energy consumption in DNN implementation at iso-accuracy. Furthermore, we study the quantization of the back-propagation training algorithm. We propose a systematic methodology to obtain close-to-minimal per-layer precision requirements for the guaranteed statistical similarity between fixed-point and floating-point training. The challenges of quantization noise, inter-layer and intra-layer precision trade-offs, dynamic range, and stability are jointly addressed. Applied to several benchmarks, fixed-point training is demonstrated to achieve high fidelity to the baseline with an accuracy drop no greater than 0.56\%. The derived precision assignment is shown to be within 1 bit per tensor of the minimum. The methodology is found to reduce representational, computational, and communication costs of training by up to 6ร—6\times, 8ร—8\times, and 4ร—4\times, respectively, compared to the baseline and related works. Finally, we address the problem of reduced precision floating-point training. In particular, we study accumulation precision requirements. We present the variance retention ratio (VRR), an analytical metric measuring the suitability of accumulation mantissa precision. The analysis expands on concepts employed in variance engineering for weight initialization. An analytical expression for the VRR is derived and used to determine accumulation bit-width for precise tailoring of computation hardware. The VRR also quantifies the benefits of effective summation reduction techniques such as chunked accumulation and sparsification. Experimentally, the validity and tightness of our analysis are verified across multiple deep learning benchmarks

    ์˜จ-๋””๋ฐ”์ด์Šค ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ๋ฅผ ์œ„ํ•œ ๊ณ ์„ฑ๋Šฅ ์—ฐ์‚ฐ ์œ ๋‹› ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ๊น€ํƒœํ™˜.Optimizing computing units for an on-device neural network accelerator can bring less energy and latency, more throughput, and might enable unprecedented new applications. This dissertation studies on two specific optimization opportunities of multiplyaccumulate (MAC) unit for on-device neural network accelerator stem from precision quantization methodology. Firstly, we propose an enhanced MAC processing unit structure efficiently processing mixed-precision model with majority operations with low precision. Precisely, two essential works are: (1) MAC unit structure supporting two precision modes is designed for fully utilizing its computation logic when processing lower precision data, which brings more computation efficiency for mixed-precision models whose major operations are in lower precision; (2) for a set of input CNNs, we formulate the exploration of the size of a single internal multiplier in MAC unit to derive an economical instance, in terms of computation and energy cost, of MAC unit structure across the whole network layers. Experimental results with two well-known CNN models, AlexNet and VGG-16, and two experimental precision settings showed that proposed units can reduce computational cost per multiplication by 4.68โˆผ30.3% and save energy cost by 43.3% on average over conventional units. Secondly, we propose an acceleration technique for processing multiplication operations using stochastic computing (SC). MUX-FSM based SC, which employs a MUX controlled by an FSM to generate a bit sequence of a binary number to count up for a MAC operation, considerably reduces the hardware cost for implementing MAC operations over the traditional stochastic number generator (SNG) based SC. Nevertheless, the existing MUX-FSM based SC still does not meet the multiplication processing time required for a wide adoption of on-device neural networks in practice even though it offers a very economical hardware implementation. Also, conventional enhancements have their limitation for sub-maximal cycle reduction, parameter conversion cost, etc. This work proposes a solution to the problem of further speeding up the conventional MUX-FSM based SC. Precisely, we analyze the bit counting pattern produced by MUX-FSM and replace the counting redundancy by shift operation, resulting in reducing the length of the required bit sequence significantly, theoretically speeding up the worst-case multiplication processing time by 2X or more. Through experiments, it is shown that our enhanced SC technique is able to shorten the average processing time by 38.8% over the conventional MUX-FSM based SC.์˜จ-๋””๋ฐ”์ด์Šค ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ๋ฅผ ์œ„ํ•œ ์—ฐ์‚ฐ ํšŒ๋กœ ์ตœ์ ํ™”๋Š” ์ €์ „๋ ฅ, ์ €์ง€์—ฐ์‹œ๊ฐ„, ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ์ด์ „์— ๋ถˆ๊ฐ€ํ•˜์˜€๋˜ ์ƒˆ๋กœ์šด ์‘์šฉ์„ ๊ฐ€๋Šฅ์ผ€ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์˜จ-๋””๋ฐ”์ด์Šค ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ์˜ ๊ณฑ์…ˆ-๋ˆ„์ ํ•ฉ ์—ฐ์‚ฐ๊ธฐ(MAC)์— ๋Œ€ํ•ด ์ •๋ฐ€๋„ ์–‘์žํ™” ๊ธฐ๋ฒ• ์ ์šฉ ๊ณผ์ •์—์„œ ํŒŒ์ƒํ•œ ๋‘ ๊ฐ€์ง€ ํŠน์ •ํ•œ ์ตœ์ ํ™” ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ๋‚ฎ์€ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ์ด ๋Œ€๋‹ค์ˆ˜๋ฅผ ์ฐจ์ง€ํ•˜๋„๋ก ์ค€๋น„๋œ ๋‹ค์ค‘ ์ •๋ฐ€๋„๊ฐ€ ์ ์šฉ๋œ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ์„ ๋œ MAC ์—ฐ์‚ฐ ์œ ๋‹› ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ๊ธฐ์—ฌ์ ์„ ์ œ์•ˆํ•œ๋‹ค: (1) ์ œ์•ˆํ•œ ๋‘ ๊ฐ€์ง€ ์ •๋ฐ€๋„ ๋ชจ๋“œ๋ฅผ ์ง€์›ํ•˜๋Š” MAC ์œ ๋‹› ๊ตฌ์กฐ๋Š” ๋‚ฎ์€ ์ •๋ฐ€๋„ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฐ์‚ฐํ•  ๋•Œ ์œ ๋‹›์˜ ์—ฐ์‚ฐ ํšŒ๋กœ๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๋„๋ก ์„ค๊ณ„๋˜๋ฉฐ, ๋‚ฎ์€ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ ๋น„์œจ์ด ๋Œ€๋‹ค์ˆ˜๋ฅผ ์ฐจ์ง€ํ•˜๋Š” ๋‹ค์ค‘ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ ๋ชจ๋ธ์— ๋” ๋†’์€ ์—ฐ์‚ฐ ํšจ์œจ์„ ์ œ๊ณตํ•œ๋‹ค; (2) ์—ฐ์‚ฐ ๋Œ€์ƒ CNN ๋„คํŠธ์›Œํฌ์— ๋Œ€ํ•ด, MAC ์œ ๋‹›์˜ ๋‚ด๋ถ€ ๊ณฑ์…ˆ๊ธฐ์˜ `๊ฒฝ์ œ์ ์ธ' (๋น„ํŠธ) ํฌ๊ธฐ๋ฅผ ํƒ์ƒ‰ํ•˜๊ธฐ ์œ„ํ•œ ๋น„์šฉ ํ•จ์ˆ˜๋ฅผ, ์ „์ฒด ๋„คํŠธ์›Œํฌ ๋ ˆ์ด์–ด๋ฅผ ์—ฐ์‚ฐ ๋Œ€์ƒ์œผ๋กœ ํ•˜์—ฌ ์—ฐ์‚ฐ ๋น„์šฉ๊ณผ ์—๋„ˆ์ง€ ๋น„์šฉ ํ•ญ์œผ๋กœ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๋„๋ฆฌ ์•Œ๋ ค์ง„ AlexNet๊ณผ VGG-16 CNN ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ, ๊ทธ๋ฆฌ๊ณ  ๋‘ ๊ฐ€์ง€ ์‹คํ—˜ ์ƒ ์ •๋ฐ€๋„ ๊ตฌ์„ฑ์— ๋Œ€ํ•˜์—ฌ, ์‹คํ—˜ ๊ฒฐ๊ณผ ์ œ์•ˆํ•œ ์œ ๋‹›์ด ๊ธฐ์กด ์œ ๋‹› ๋Œ€๋น„ ๋‹จ์œ„ ๊ณฑ์…ˆ๋‹น ์—ฐ์‚ฐ ๋น„์šฉ์„ 4.68~30.3% ์ ˆ๊ฐํ•˜์˜€์œผ๋ฉฐ ์—๋„ˆ์ง€ ๋น„์šฉ์„ 43.3% ์ ˆ๊ฐํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šคํ† ์บ์Šคํ‹ฑ ์ปดํ“จํŒ… (SC) ๊ธฐ๋ฐ˜ MAC ์—ฐ์‚ฐ ์œ ๋‹›์˜ ์—ฐ์‚ฐ ์‚ฌ์ดํด ์ ˆ๊ฐ์„ ์œ„ํ•œ ๊ธฐ๋ฒ• ๋ฐ ์—ฐ๊ด€๋œ ํ•˜๋“œ์›จ์–ด ์œ ๋‹› ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. FSM์œผ๋กœ ์ œ์–ด๋˜๋Š” MUX๋ฅผ ํ†ตํ•ด ์ž…๋ ฅ ์ด์ง„์ˆ˜์—์„œ ๋งŒ๋“  ๋น„ํŠธ ์ˆ˜์—ด์„ ์„ธ์–ด MAC ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋Š” MUX-FSM ๊ธฐ๋ฐ˜ SC๋Š” ๊ธฐ์กด ์Šคํ† ์บ์Šคํ‹ฑ ์ˆซ์ž ์ƒ์„ฑ๊ธฐ ๊ธฐ๋ฐ˜ SC ๋Œ€๋น„ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ์„ ์ƒ๋‹นํžˆ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ˜„์žฌ MUX-FSM ๊ธฐ๋ฐ˜ SC๋Š” ํšจ์œจ์ ์ธ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„๊ณผ ๋ณ„๊ฐœ๋กœ ์—ฌ์ „ํžˆ ๋‹ค์ˆ˜์˜ ์—ฐ์‚ฐ ์‚ฌ์ดํด์„ ์š”๊ตฌํ•˜์—ฌ ์˜จ-๋””๋ฐ”์ด์Šค ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ๊ธฐ์— ์ ์šฉ๋˜๊ธฐ ์–ด๋ ค์› ๋‹ค. ๋˜ํ•œ, ๊ธฐ์กด์— ์ œ์•ˆ๋œ ๋Œ€์•ˆ์€ ์ œ๊ฐ๊ธฐ ์ ˆ๊ฐ ํšจ๊ณผ์— ํ•œ๊ณ„๊ฐ€ ์žˆ๊ฑฐ๋‚˜ ๋ชจ๋ธ ๋ณ€์ˆ˜ ๋ณ€ํ™˜ ๋น„์šฉ์ด ์žˆ๋Š” ๋“ฑ ํ•œ๊ณ„์ ์ด ์žˆ์—ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด MUX-FSM ๊ธฐ๋ฐ˜ SC์˜ ์ถ”๊ฐ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. MUX-FSM ๊ธฐ๋ฐ˜ SC์˜ ๋น„ํŠธ ์ง‘๊ณ„ ํŒจํ„ด์„ ํŒŒ์•…ํ•˜๊ณ , ์ค‘๋ณต ์ง‘๊ณ„๋ฅผ ์‹œํ”„ํŠธ ์—ฐ์‚ฐ์œผ๋กœ ๊ต์ฒดํ•˜์˜€๋‹ค. ์ด๋กœ๋ถ€ํ„ฐ ํ•„์š” ๋น„ํŠธ ํŒจํ„ด์˜ ๊ธธ์ด๋ฅผ ํฌ๊ฒŒ ์ค„์ด๋ฉฐ, ๊ณฑ์…ˆ ์—ฐ์‚ฐ ์ค‘ ์ตœ์•…์˜ ๊ฒฝ์šฐ์˜ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ์ด๋ก ์ ์œผ๋กœ 2๋ฐฐ ์ด์ƒ ํ–ฅ์ƒํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ์ œ์•ˆํ•œ ๊ฐœ์„ ๋œ SC ๊ธฐ๋ฒ•์ด ๊ธฐ์กดMUX-FSM ๊ธฐ๋ฐ˜ SC ๋Œ€๋น„ ํ‰๊ท  ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ 38.8% ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.1 INTRODUCTION 1 1.1 Neural network accelerator and its optimizations 1 1.2 Necessity of optimizing computational block of neural network accelerator 5 1.3 Contributions of This Dissertation 7 2 MAC Design Considering Mixed Precision 9 2.1 Motivation 9 2.2 Internal Multiplier Size Determination 14 2.3 Proposed hardware structure 16 2.4 Experiments 21 2.4.1 Implementation of Reference MAC units 23 2.4.2 Area, Wirelength, Power, Energy, and Performance of MAC units for AlexNet 24 2.4.3 Area, Wirelength, Power, Energy, and Performance of MAC units for VGG-16 31 2.4.4 Power Saving by Clock Gating 35 3 Speeding up MUX-FSM based Stochastic Computing Unit Design 37 3.1 Motivations 37 3.1.1 MUX-FSM based SC and previous enhancements 42 3.2 The Proposed MUX-FSM based SC 48 3.2.1 Refined Algorithm for Stochastic Computing 48 3.3 The Supporting Hardware Architecture 55 3.3.1 Bit Counter with shift operation 55 3.3.2 Controller 57 3.3.3 Combining with preceding architectures 58 3.4 Experiments 59 3.4.1 Experiments Setup 59 3.4.2 Generating input bit selection pattern 60 3.4.3 Performance Comparison 61 3.4.4 Hardware Area and Energy Comparison 63 4 CONCLUSIONS 67 4.1 MAC Design Considering Mixed Precision 67 4.2 Speeding up MUX-FSM based Stochastic Computing Unit Design 68 Abstract (In Korean) 73Docto

    Full Stack Optimization of Transformer Inference: a Survey

    Full text link
    Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems
    corecore