25 research outputs found

    고성능 인공 신경망을 위한 메모리 레이아웃 및 컴퓨팅 기법

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 김태환.인공 신경망 연산을 수행하고자 하는 수요가 꾸준히 증가하고 있지만, 깊은 인공 신경망에는 과도한 메모리와 계산 비용이 수반되기 때문에 많은 설계 문제가 있다. 본 논문에서는 인공 신경망 추론 연산을 효과적으로 처리하기 위한 여러 가지 새로운 기술을 연구한다. 첫 번째로, 최대 계산 속도 향상이 가중치의 0 아닌 비트의 총 수에 의해 제한되는 한계의 극복을 시도한다. 구체적으로, 부호있는 숫자 인코딩에 기반한 본 연구에서, (1) 모든 가중치의 2의 보수 표현을 필수 비트를 최소로 하는 부호있는 숫자 표현의 집합으로 변환하는 변환 기법을 제안하며, (2) 가중치의 비트 단위 곱셈의 병렬성을 최대하화는 가중치의 부호있는 숫자 표현을 선택하는 문제를 숫자 인덱스 (열 단위) 압축 최대화를 달성하도록 다목적 최단 경로 문제로 공식화하여 근사 알고리즘을 사용하여 효율적으로 해결하며, (3) 주요 하드웨어를 추가로 포함하지 않고 앞서 제안한 기법을 지원하는 새로운 가속기 아키텍처(DWP)를 제안한다. 또한, 우리는 (4) 병렬 처리에서 최악의 지연 시간을 엄격하게 예측할 수 있는 기능이 포함된 비트 단위 병렬 곱셈을 지원하도록 다른 형태의 DWP를 제안한다. 실험을 통해 본 연구에서 제안하는 접근 방법은 필수 비트 수를 AlexNet에서 69%, VGG-16에서 74%, ResNet-152에서 68%까지 줄일 수 있음을 보여주었다. 또한 이를 지원하는 가속기는 추론 연산 시간을 기존의 비트 단위 가중치 가지치기 방법에 비해 최대 3.57배까지 감소시켰다. 두 번째로, 이진 및 삼진 가중치의 컨볼루션 인공 신경망에서 컨볼루션 간의 중복 연산을 최대한 제거하기 위하여 공통 커널 및 컨볼루션을 추출하는 새로운 알고리즘을 제시한다. 구체적으로, (1) 기존 방법에서 공통 커널 후보의 국부적이고 제한적인 탐색을 극복하기 위한 새로운 공통 커널 추출 알고리즘을 제안하고, 이후에 (2) 컨볼루션 연산에서의 중복성을 최대한으로 제거하기 위한 새로운 개념의 공통 컨볼루션 추출을 적용한다. 또한, 우리의 알고리즘은 (3) 컨볼루션에 대해 최종적으로 도출된 커널 수를 최소화하여 커널에 대한 총 메모리 접근 지연 시간을 절약할 수 있다. 삼진 가중치의 VGG-16에 대한 실험 결과로 모든 컨볼루션에 대한 총 연산 수를 25.8-26.3% 감소시켜, 최신 알고리즘으로 추출한 공통 커널을 사용하는 컨볼루션에 비해 2.7-3.8% 더 적은 커널을 사용하는 동안 하드웨어 플랫폼에서의 총 수행 사이클을 22.4% 감소시킴으로써 우리가 제안한 컨볼루션 최적화 알고리즘이 매우 효과적임을 보였다. 마지막으로, 우리는 압축된 DNN의 모든 고유 가중치들을 온-칩 메모리에 완전히 포함할 수 없는 경우 정확도 유지를 위해 부적합 압축을 사용하는 DNN 솔루션을 제안한다. 구체적으로, 가중치의 접근 시퀀스가 주어지면, (1) 첫 번째 문제는 오프-칩 메모리의 메모리 접근 수(접근에 의해 소비되는 에너지)를 최소화하도록 오프-칩 메모리에 가중치를 배열하는 것이고, (2) 두 번째 문제는 블록 교체를 위한 인덱스 탐색에 소비되는 오버헤드와 오프-칩 메모리 접근에 소모되는 총 에너지의 최소화를 목적으로 하여 블록 미스 발생 시 온-칩 메모리에서 교체될 가중치 블록을 선택하는 전략을 고안하는 것이다. 압축된 AlexNet 모델을 사용한 실험을 통해 우리의 솔루션은 최적화되지 않은 메모리 레이아웃 및 LRU 교체 방법을 사용하는 경우에 비해 탐색 오버헤드를 포함하여 오프-칩 메모리 접근에 필요한 총 에너지 소비를 평균 34.2%까지 줄일 수 있음을 보였다.Although the demand for exploiting neural networks is steadily increasing, there are many design challenges since deep neural networks (DNNs) entail excessive memory and computation cost. This dissertation studies a number of new techniques for effectively processing DNN inference operations. Firstly, we attempt to overcome that the maximal computation speedup is bounded by the total number of non-zero bits of the weights. Precisely, this work, based on the signed-digit encoding, (1) proposes a transformation technique which converts the twos complement representation of every weight into a set of signed-digit representations of the minimal number of essential bits, (2) formulates the problem of selecting signed-digit representations of weights that maximize the parallelism of bit-level multiplication on the weights into a multi-objective shortest path problem to achieve a maximal digit-index by digit-index (i.e. column-wise) compression for the weights and solves it efficiently using an approximation algorithm, and (3) proposes a supporting novel acceleration architecture (DWP) with no additional inclusion of non-trivial hardware. In addition, we (4) propose a variant of DWP to support bit-level parallel multiplication with the capability of predicting a tight worst-case latency of the parallel processing. Through experiments on several representative models using the ImageNet dataset, it is shown that our proposed approach is able to reduce the number of essential bits by 69% on AlexNet, 74% on VGG-16, and 68% on ResNet-152, by which our accelerator is able to reduce the inference computation time by up to 3.57x over the conventional bit-level weight pruning. Secondly, a new algorithm for extracting common kernels and convolutions to maximally eliminate the redundant operations among the convolutions in binary- and ternary-weight convolutional neural networks is presented. Specifically, we propose (1) a new algorithm of common kernel extraction to overcome the local and limited exploration of common kernel candidates by the existing method, and subsequently apply (2) a new concept of common convolution extraction to maximally eliminate the redundancy in the convolution operations. In addition, our algorithm is able to (3) tune in minimizing the number of resulting kernels for convolutions, thereby saving the total memory access latency for kernels. Experimental results on ternary-weight VGG-16 demonstrate that our convolution optimization algorithm is very effective, reducing the total number of operations for all convolutions by 25.8-26.3%, thereby reducing the total number of execution cycles on hardware platform by 22.4% while using 2.7-3.8% fewer kernels over that of the convolution utilizing the common kernels extracted by the state-of-the-art algorithm. Finally, we propose solutions for DNNs with unfitted compression to maintain the accuracy, in which all distinct weights of the compressed DNNs could not be entirely contained in on-chip memory. Precisely, given an access sequence of weights, (1) the first problem is to arrange the weights in off-chip memory, so that the number of memory accesses to the off-chip memory (equivalently the energy consumed by the accesses) be minimized, and (2) the second problem is to devise a strategy of selecting a weight block in on-chip memory for replacement when a block miss occurs, with the objective of minimizing the total energy consumed by the off-chip memory accesses and the overhead of scanning indexes for block replacement. Through experiments with the model of compressed AlexNet, it is shown that our solutions are able to reduce the total energy consumption of the off-chip memory accesses including the scanning overhead by 34.2% on average over the use of unoptimized memory layout and LRU replacement scheme.1 Introduction 1 1.1 Deep Neural Networks and Its Challenges 1 1.2 Redundant Weight Elimination Methods in DNN 4 1.3 Redundant Representation Elimination Methods in DNN 8 1.4 Contributions of This Dissertation 12 2 Bit-level Weight Pruning Techniques for High-Performance Neural Networks 17 2.1 Preliminary 17 2.1.1 Bit-level Weight Pruning in Binary Representation 17 2.1.2 Bit-level Weight Pruning in Signed-digit Representation 19 2.1.3 CSD Representation Conversion 21 2.2 Motivations 23 2.2.1 Inefficiency in Two's Complement Representation 23 2.2.2 Inability to Exploit Signed-digit Representation 25 2.3 Signed-digit Representation-based Deeper Weight Pruning 28 2.3.1 Generating Signed-digit Representations 28 2.3.2 Selecting Signed-digit Representations for Maximal Parallelism 30 2.3.3 Extension to the Low-precision Weights 32 2.4 Supporting Hardware Architecture 33 2.4.1 Technique for Using a Single Bit to Encode Ternary Value 33 2.4.2 Structure of Supporting Architecture 35 2.4.3 Memory Analysis 37 2.4.4 Full Utilization of Accumulation Adders 38 2.4.5 Modification for Hybrid Approach 38 2.5 Bit-level Intra-weight Pruning 41 2.5.1 Signed-digit Representation Conversion 41 2.5.2 Encoding Technique 41 2.5.3 Supporting Hardware Architecture 42 2.6 Experimental Results 44 2.6.1 Essential Bits 44 2.6.2 Memory Usage 46 2.6.3 Performance 46 2.6.4 Area 50 2.6.5 Energy Efficiency 56 3 Convolution Computation Techniques for High-Performance Neural Networks 59 3.1 Motivations 59 3.1.1 Limited Space Exploration for Common Kernels 59 3.1.2 Inability to Exploit Common Expressions of Convolution Values 61 3.2 The Proposed Algorithm 63 3.2.1 Common Kernel Extraction 63 3.2.2 Common Convolution Extraction 67 3.2.3 Memory Access Minimization 69 3.3 Hardware Implementation 70 3.4 Experimental Results 72 3.4.1 Experimental Setup 72 3.4.2 Assessing Effectiveness of ConvOpt-op and ConvOpt-mem 72 3.4.3 Measuring Performance through Hardware Implementation 78 3.4.4 Running Time of ConvOpt 78 4 Memory Layout and Block Replacement Techniques for High-Performance Neural Networks 81 4.1 Motivation 81 4.2 Algorithms for Off-chip Memory Access Optimization for DNNs with Unfitted Compression 84 4.2.1 Algorithm for Off-chip Memory Layout 84 4.2.2 Algorithm for On-chip Memory Block Replacement 86 4.2.3 Exploitation of Parallel Computing 91 4.3 Experimental Results 94 4.3.1 Experimental Setup 94 4.3.2 Assessing the Effectiveness of Mem-layout 94 4.3.3 Assessing the Effectiveness of MIN-k Combined with Mem-layout 97 5 Conclusions 101 5.1 Bit-level Weight Pruning Techniques for High-Performance Neural Networks 101 5.2 Convolution Computation Techniques for High-Performance Neural Networks 102 5.3 Memory Layout and Block Replacement Techniques for High-Performance Neural Networks 102 Abstract (In Korean) 117Docto

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems

    Machine Learning Classification of Digitally Modulated Signals

    Get PDF
    Automatic classification of digitally modulated signals is a challenging problem that has traditionally been approached using signal processing tools such as log-likelihood algorithms for signal classification or cyclostationary signal analysis. These approaches are computationally intensive and cumbersome in general, and in recent years alternative approaches that use machine learning have been presented in the literature for automatic classification of digitally modulated signals. This thesis studies deep learning approaches for classifying digitally modulated signals that use deep artificial neural networks in conjunction with the canonical representation of digitally modulated signals in terms of in-phase and quadrature components. Specifically, capsule networks are trained to recognize common types of PSK and QAM digital modulation schemes, and their classification performance is tested on two distinct datasets that are publicly available. Results show that capsule networks outperform convolutional neural networks and residual networks, which have been used previously to classify signals in the same datasets, and indicate that they are a meaningful alternative for machine learning approaches to digitally modulated signal classification. The thesis includes also a discussion of practical implementations of the proposed capsule networks in an FPGA-powered embedded system

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Get PDF
    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ×2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimization’s robustness by ×2.35

    Reconfigurable Antenna Systems: Platform implementation and low-power matters

    Get PDF
    Antennas are a necessary and often critical component of all wireless systems, of which they share the ever-increasing complexity and the challenges of present and emerging trends. 5G, massive low-orbit satellite architectures (e.g. OneWeb), industry 4.0, Internet of Things (IoT), satcom on-the-move, Advanced Driver Assistance Systems (ADAS) and Autonomous Vehicles, all call for highly flexible systems, and antenna reconfigurability is an enabling part of these advances. The terminal segment is particularly crucial in this sense, encompassing both very compact antennas or low-profile antennas, all with various adaptability/reconfigurability requirements. This thesis work has dealt with hardware implementation issues of Radio Frequency (RF) antenna reconfigurability, and in particular with low-power General Purpose Platforms (GPP); the work has encompassed Software Defined Radio (SDR) implementation, as well as embedded low-power platforms (in particular on STM32 Nucleo family of micro-controller). The hardware-software platform work has been complemented with design and fabrication of reconfigurable antennas in standard technology, and the resulting systems tested. The selected antenna technology was antenna array with continuously steerable beam, controlled by voltage-driven phase shifting circuits. Applications included notably Wireless Sensor Network (WSN) deployed in the Italian scientific mission in Antarctica, in a traffic-monitoring case study (EU H2020 project), and into an innovative Global Navigation Satellite Systems (GNSS) antenna concept (patent application submitted). The SDR implementation focused on a low-cost and low-power Software-defined radio open-source platform with IEEE 802.11 a/g/p wireless communication capability. In a second embodiment, the flexibility of the SDR paradigm has been traded off to avoid the power consumption associated to the relevant operating system. Application field of reconfigurable antenna is, however, not limited to a better management of the energy consumption. The analysis has also been extended to satellites positioning application. A novel beamforming method has presented demonstrating improvements in the quality of signals received from satellites. Regarding those who deal with positioning algorithms, this advancement help improving precision on the estimated position

    Improved 3D MR Image Acquisition and Processing in Congenital Heart Disease

    Get PDF
    Congenital heart disease (CHD) is the most common type of birth defect, affecting about 1% of the population. MRI is an essential tool in the assessment of CHD, including diagnosis, intervention planning and follow-up. Three-dimensional MRI can provide particularly rich visualization and information. However, it is often complicated by long scan times, cardiorespiratory motion, injection of contrast agents, and complex and time-consuming postprocessing. This thesis comprises four pieces of work that attempt to respond to some of these challenges. The first piece of work aims to enable fast acquisition of 3D time-resolved cardiac imaging during free breathing. Rapid imaging was achieved using an efficient spiral sequence and a sparse parallel imaging reconstruction. The feasibility of this approach was demonstrated on a population of 10 patients with CHD, and areas of improvement were identified. The second piece of work is an integrated software tool designed to simplify and accelerate the development of machine learning (ML) applications in MRI research. It also exploits the strengths of recently developed ML libraries for efficient MR image reconstruction and processing. The third piece of work aims to reduce contrast dose in contrast-enhanced MR angiography (MRA). This would reduce risks and costs associated with contrast agents. A deep learning-based contrast enhancement technique was developed and shown to improve image quality in real low-dose MRA in a population of 40 children and adults with CHD. The fourth and final piece of work aims to simplify the creation of computational models for hemodynamic assessment of the great arteries. A deep learning technique for 3D segmentation of the aorta and the pulmonary arteries was developed and shown to enable accurate calculation of clinically relevant biomarkers in a population of 10 patients with CHD

    Embedded electronic systems driven by run-time reconfigurable hardware

    Get PDF
    Abstract This doctoral thesis addresses the design of embedded electronic systems based on run-time reconfigurable hardware technology –available through SRAM-based FPGA/SoC devices– aimed at contributing to enhance the life quality of the human beings. This work does research on the conception of the system architecture and the reconfiguration engine that provides to the FPGA the capability of dynamic partial reconfiguration in order to synthesize, by means of hardware/software co-design, a given application partitioned in processing tasks which are multiplexed in time and space, optimizing thus its physical implementation –silicon area, processing time, complexity, flexibility, functional density, cost and power consumption– in comparison with other alternatives based on static hardware (MCU, DSP, GPU, ASSP, ASIC, etc.). The design flow of such technology is evaluated through the prototyping of several engineering applications (control systems, mathematical coprocessors, complex image processors, etc.), showing a high enough level of maturity for its exploitation in the industry.Resumen Esta tesis doctoral abarca el diseño de sistemas electrónicos embebidos basados en tecnología hardware dinámicamente reconfigurable –disponible a través de dispositivos lógicos programables SRAM FPGA/SoC– que contribuyan a la mejora de la calidad de vida de la sociedad. Se investiga la arquitectura del sistema y del motor de reconfiguración que proporcione a la FPGA la capacidad de reconfiguración dinámica parcial de sus recursos programables, con objeto de sintetizar, mediante codiseño hardware/software, una determinada aplicación particionada en tareas multiplexadas en tiempo y en espacio, optimizando así su implementación física –área de silicio, tiempo de procesado, complejidad, flexibilidad, densidad funcional, coste y potencia disipada– comparada con otras alternativas basadas en hardware estático (MCU, DSP, GPU, ASSP, ASIC, etc.). Se evalúa el flujo de diseño de dicha tecnología a través del prototipado de varias aplicaciones de ingeniería (sistemas de control, coprocesadores aritméticos, procesadores de imagen, etc.), evidenciando un nivel de madurez viable ya para su explotación en la industria.Resum Aquesta tesi doctoral està orientada al disseny de sistemes electrònics empotrats basats en tecnologia hardware dinàmicament reconfigurable –disponible mitjançant dispositius lògics programables SRAM FPGA/SoC– que contribueixin a la millora de la qualitat de vida de la societat. S’investiga l’arquitectura del sistema i del motor de reconfiguració que proporcioni a la FPGA la capacitat de reconfiguració dinàmica parcial dels seus recursos programables, amb l’objectiu de sintetitzar, mitjançant codisseny hardware/software, una determinada aplicació particionada en tasques multiplexades en temps i en espai, optimizant així la seva implementació física –àrea de silici, temps de processat, complexitat, flexibilitat, densitat funcional, cost i potència dissipada– comparada amb altres alternatives basades en hardware estàtic (MCU, DSP, GPU, ASSP, ASIC, etc.). S’evalúa el fluxe de disseny d’aquesta tecnologia a través del prototipat de varies aplicacions d’enginyeria (sistemes de control, coprocessadors aritmètics, processadors d’imatge, etc.), demostrant un nivell de maduresa viable ja per a la seva explotació a la indústria
    corecore