197 research outputs found

    Temperature Evaluation of NoC Architectures and Dynamically Reconfigurable NoC

    Get PDF
    Advancements in the field of chip fabrication led to the integration of a large number of transistors in a small area, giving rise to the multi–core processor era. Massive multi–core processors facilitate innovation and research in the field of healthcare, defense, entertainment, meteorology and many others. Reduction in chip area and increase in the number of on–chip cores is accompanied by power and temperature issues. In high performance multi–core chips, power and heat are predominant constraints. High performance massive multicore systems suffer from thermal hotspots, exacerbating the problem of reliability in deep submicron technologies. High power consumption not only increases the chip temperature but also jeopardizes the integrity of the system. Hence, there is a need to explore holistic power and thermal optimization and management strategies for massive on–chip multi–core environments. In multi–core environments, the communication fabric plays a major role in deciding the efficiency of the system. In multi–core processor chips this communication infrastructure is predominantly a Network–on–Chip (NoC). Tradition NoC designs incorporate planar interconnects as a result these NoCs have long, multi–hop wireline links for data exchange. Due to the presence of multi–hop planar links such NoC architectures fall prey to high latency, significant power dissipation and temperature hotspots. Networks inspired from nature are envisioned as an enabling technology to achieve highly efficient and low power NoC designs. Adopting wireless technology in such architectures enhance their performance. Placement of wireless interconnects (WIs) alters the behavior of the network and hence a random deployment of WIs may not result in a thermally optimal solution. In such scenarios, the WIs being highly efficient would attract high traffic densities resulting in thermal hotspots. Hence, the location and utilization of the wireless links is a key factor in obtaining a thermal optimal highly efficient Network–on–chip. Optimization of the NoC framework alone is incapable of addressing the effects due to the runtime dynamics of the system. Minimal paths solely optimized for performance in the network may lead to excessive utilization of certain NoC components leading to thermal hotspots. Hence, architectural innovation in conjunction with suitable power and thermal management strategies is the key for designing high performance and energy–efficient multicore systems. This work contributes at exploring various wired and wireless NoC architectures that achieve best trade–offs between temperature, performance and energy–efficiency. It further proposes an adaptive routing scheme which factors in the thermal profile of the chip. The proposed routing mechanism dynamically reacts to the thermal profile of the chip and takes measures to avoid thermal hotspots, achieving a thermally efficient dynamically reconfigurable network on chip architecture

    Tree-structured small-world connected wireless network-on-chip with adaptive routing

    Get PDF
    Traditional Network-on-Chip (NoC) systems comprised of many cores suffer from debilitating bottlenecks of latency and significant power dissipation due to the overhead inherent in multi-hop communication. In addition, these systems remain vulnerable to malicious circuitry incorporated into the design by untrustworthy vendors in a world where complex multi-stage design and manufacturing processes require the collective specialized services of a variety of contractors. This thesis proposes a novel small-world tree-based network-on-chip (SWTNoC) structure designed for high throughput, acceptable energy consumption, and resiliency to attacks and node failures resulting from the insertion of hardware Trojans. This tree-based implementation was devised as a means of reducing average network hop count, providing a large degree of local connectivity, and effective long-range connectivity by means of a novel wireless link approach based on carbon nanotube (CNT) antenna design. Network resiliency is achieved by means of a devised adaptive routing algorithm implemented to work with TRAIN (Tree-based Routing Architecture for Irregular Networks). Comparisons are drawn with benchmark architectures with optimized wireless link placement by means of the simulated annealing (SA) metaheuristic. Experimental results demonstrate a 21% throughput improvement and a 23% reduction in dissipated energy per packet over the closest competing architecture. Similar trends are observed at increasing system sizes. In addition, the SWTNoC maintains this throughput and energy advantage in the presence of a fault introduced into the system. By designing a hierarchical topology and designating a higher level of importance on a subset of the nodes, much higher network throughput can be attained while simultaneously guaranteeing deadlock freedom as well as a high degree of resiliency and fault-tolerance

    Neuraghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on zynQ SoCs

    Get PDF
    Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18

    Architecture FPGA améliorée et flot de conception pour une reconfiguration matérielle en ligne efficace

    Get PDF
    The self-reconfiguration capabilities of modern FPGA architectures pave the way for dynamic applications able to adapt to transient events. The CAD flows of modern architectures are nowadays mature but limited by the constraints induced by the complexity of FPGA circuits. In this thesis, multiple contributions are developed to propose an FPGA architecture supporting the dynamic placement of hardware tasks. First, an intermediate representation of these tasks configuration data, independent from their final position, is presented. This representation allows to compress the task data up to 11x with regard to its conventional raw counterpart. An accompanying CAD flow, based on state-of-the-art tools, is proposed to generate relocatable tasks from a high-level description. Then, the online behavior of this mechanism is studied. Two algorithms allowing to decode and create in real-time the conventional bit-stream are described. In addition, an enhancement of the FPGA interconnection network is proposedto increase the placement flexibility of heterogeneous tasks, at the cost of a 10% increase in average of the critical path delay. Eventually, a configurable substitute to the configuration memory found in FPGAs is studied to ease their partial reconfiguration.Les capacités d'auto-reconfiguration des architectures FPGA modernes ouvrent la voie à des applications dynamiques capables d'adapter leur fonctionnement pour répondre à des évènements ponctuels. Les flots de reconfiguration des architectures commerciales sont aujourd'hui aboutis mais limités par des contraintes inhérentes à la complexité de ces circuits. Dans cette thèse, plusieurs contributions sont avancées afin de proposer une architecture FPGA reconfigurable permettant le placement dynamique de tâches matérielles. Dans un premier temps, une représentation intermédiaire des données de configuration de ces tâches, indépendante de leur positionnement final, est présentée. Cette représentation permet notamment d'atteindre des taux de compression allant jusqu'à 11x par rapport à la représentation brute d'une tâche. Un flot de conception basé sur des outils de l'état de l'art accompagne cette représentation et génère des tâches relogeables à partir d'une description haut-niveau. Ensuite, le comportement en ligne de ce mécanisme est étudié. Deux algorithmes permettant le décodage de ces tâches et la génération en temps-réel des données de configuration propres à l'architectures son décrits. Par ailleurs, une amélioration du réseau d'interconnexion d'une architecture FPGA est proposée pour accroître la flexibilité du placement de tâches hétérogènes, avec une augmentation de 10% en moyenne du délai du chemin critique. Enfin, une alternative programmable aux mémoires de configuration de ces circuits est étudiée pour faciliter leur reconfiguration partielle

    임베디드 시스템에서 여러 컨볼루션 뉴럴 네트워크를 위한 하드웨어를 고려하는 소프트웨어 최적화 기법

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 하순회.임베디드 기기는 대개 계산량, 메모리 크기, 에너지 소모량 등의 제약 사항이 있기 때문에, 딥 러닝 응용을 임베디드 기기에서 수행하는 것은 쉽지 않다. 딥 러닝 응용의 계산량 증가를 해결하기 위해서 에너지 효율적인 모바일 GPU, 디지털 신호 처리 프로세서을 사용하거나, 또는 새로운 뉴럴 프로세서 칩을 만드려는 하드웨어 영역의 최적화 방법이 있다. 반면에 딥 러닝 응용 영역에서는 새로운 딥 러닝 응용을 만들거나, 딥 러닝의 통계적인 특성을 이용한 근사 계산 방법을 이용하여 최적화 방법을 제안하고 있다. 그리고 또 다른 최적화 방법으로는 먼저 하드웨어 플랫폼의 성능 병목 부분을 찾고, 일을 동등하게 여러 계산 자원에 분배하여 최적화하는 하드웨어를 고려한 최적화 방법이 있다. 본 논문에서는 하드웨어를 고려한 소프트웨어 최적화 방법들을 고안하였다. 먼저, LPIRC 대회에 참가한 경험을 바탕으로 임베디드 딥 러닝 시스템을 최적화하는 체계적인 방법론을 고안하고, 그 방법론에 따른 C-GOOD이라는 딥 러닝 프레임워크를 구현하였다. C-GOOD은 하드웨어 플랫폼에 독립적으로 작동하기 위해 대부분의 임베디드 기기에서 컴파일, 수행이 가능한 C 코드를 생성한다. 또한 여러 가지 딥 러닝 응용 영역의 최적화 방법을 적용할 수 있는 옵션과 시스템 성능을 측정할 수 있는 기능을 제공하였다. 이 방법론을 Jetson TX2, Odroid XU4, SRP 등의 서로 다른 3개의 기기에 적용해 봄으로써, 고안된 방법론이 하드웨어 플랫폼에 독립적이며 C-GOOD을 통해 쉽게 여러 딥 러닝 응용 최적화 방법을 적용할 수 있음을 확인하였다. 최근 임베디드 기기에 이종 프로세서들이 많이 탑재되고 있고, 동시에 자율 주행 자동차와 스마트폰 등의 하나의 임베디드 기기에서 여러 개의 딥 러닝 응용을 동시에 수행하는 것이 필요해지고 있다. 본 논문에서는 여러 딥 러닝 응용을 이종 프로세서들을 탑재한 임베디드 기기에 스케줄하는 방법을 고안하고, 스케줄링 프레임워크를 구현하였다. 이 방법론은 실제 기기에서의 프로파일링부터 스케줄 결과를 실제 기기에서 확인하는 과정까지 포함하며, 실제 기기에서 발생하는 이슈들인 DVFS, CPU Hot-plug 등을 고려하였다. 이종 프로세서로의 스케줄링 기법으로는 많이 사용되는 메타 휴리스틱 알고리즘은 유전 알고리즘을 사용하였다. 특히, 서로 다른 주기와 상대 오프셋을 가지고 있는 여러 응용을 동시에 스케줄하기 위해서 모든 태스크들의 스케줄 가능성을 고려하여 스케줄하였다. 스케줄 결과를 검증하기 위해서, ACL의 코어 라이브러리를 이용하여 딥 러닝 추론 응용을 구현하였으며, 스케줄 결과와 같이 각 레이어들을 실제 하드웨어의 서로 다른 프로세서 매핑하도록 구현하였다. 갤럭시 S9 스마트폰과 Hikey 970 보드에서 서로 다른 두개의 딥 러닝 네트워크를 수행하고, 스케줄 결과와 비교하여 방법론을 검증할 수 있었다. 이전 최적화 방법들이 딥 러닝 응용의 계산량과 프로세서들에 집중하였는데, 딥 러닝 가속기 또는 NPU의 성능 병목이 생기는 원인은 오프 칩 메모리와 온 칩 사이의 통신이다. 더욱이 오프 칩 메모리 DRAM 접근은 NPU의 전력소모의 많은 부분을 차지한다고 알려져있다. 따라서 이와 같은 오프 칩 DRAM 접근으로 인한 NPU의 성능과 에너지 영향을 줄이고자 본 논문에서는 온 칩 메모리 뱅크를 관리하는 컴파일러 기법을 고안하였다. 온 칩 메모리를 여러 개의 뱅크로 구성하고 연산 도중에 인풋 데이터를 미리 로드함으로써 연산 지연 시간을 줄일 수 있다는 점과 레이어의 아웃풋을 온 칩 메모리에서 재사용하여 오프 칩 메모리 접근을 줄일 수 있다는 점을 이용하여 서로 다른 두 가지의 목적 함수를 가진 두 가지 기법을 고안하였다. 목적 함수는 각각 오프 칩 메모리 접근을 최소화하는 것과 오프 칩 메모리 접근으로 인한 프로세서들의 처리 지연시간을 줄이는 것이다. 서로 다른 5개의 딥 러닝 네트워크를 사이클 레벨 NPU 시뮬레이터에서 수행하여 두 목적 함수에 따른 절충 (Trade-off) 관계 를 확인하였다. 또한 온 칩 메모리 뱅크 관리 기법을 레이어 간 피처 데이터를 최대한 재사용하는 레이어 융합 방법으로 확장하였다. 기존의 순수한 레이어 융합 방법의 경우에는 중복 계산하는 오버헤드와 추가적인 필터 웨이트 로드가 생긴다. 따라서 본 논문에서는 기존의 레이어 별로 처리하는 방법과 순수한 레이어 융합 방법 사이의 하이브리드 레이어 융합 방법을 고안하였다. 두 온 칩 메모리 뱅크 관리 기법을 기반으로 하이브리드 레이어 융합 방법이 기존의 레이어 별 처리하는 기법과 순수한 레이어 융합 방법보다 좋은 성능을 보임을 확인할 수 있었다.Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption, while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. To cope with increasing computation complexity, it is common to use an energy-efficient accelerator, such as a mobile GPU or digital signal processor (DSP) array, or to develop a customized neural processor chip called neural processing unit (NPU). In the application domain, many optimization techniques have been proposed to change the application algorithm in order to reduce the computational amount and memory usage by developing new deep learning networks or software optimization techniques that take advantage of the statistical nature of deep learning algorithms. Another approach is hardware-ware software optimization, which finds the performance bottleneck first and then distributes the workload evenly by scheduling the workloads. This dissertation covers hardware-aware software optimization, which is based on a hardware processor or platform. First, we devise a systematic optimization methodology through the experience of participating in the Low Power Image Recognition Challenge (LPIRC) and build a deep learning framework called C-GOOD (C-code Generation Framework for Optimized On-device Deep Learning) based on the devised methodology. For hardware independence, C-GOOD generates a C code that can be compiled for and run on any embedded device. Also, C-GOOD is facilitated with various options for application domain optimization that can be performed according to the devised methodology. By applying the devised methodology to three hardware platforms, NVIDIA Jetson TX2, Odroid XU4, and the Samsung Reconfigurable Processor (SRP), we demonstrate that the devised methodology is independent of the hardware platforms and application domain optimizations can be performed easily with C-GOOD. Recently, embedded devices are equipped with heterogeneous processing elements (PEs), and the need for running multiple deep learning applications concurrently in the embedded systems such as self-driving cars and smartphones is increasing at the same time. In those systems, we devise an end-to-end methodology to schedule deep learning applications onto heterogeneous PEs and implement a scheduling framework according to the methodology. It covers from profiling on real embedded devices to verifying the schedule results on the devices. In this methodology, we use a genetic algorithm (GA)-based scheduling technique for scheduling deep learning applications onto heterogeneous PEs and consider several practical issues in the profile step. Furthermore, we schedule multiple applications with different throughput constraints considering the schedulability of mapped tasks on each processor. After implementing a deep learning inference engine that can utilize heterogeneous PEs using a low-level library of the ARM compute library (ACL), we verify the devised methodology by running two widely used convolution neural networks (CNNs) on a Galaxy S9 smartphones and a Hikey970 board. While the previous optimization methods focus on the computation and processing elements, the performance bottleneck of deep learning accelerators is the communication between off-chip and on-chip memory. Moreover, the off-chip DRAM access volume has a significant effect on the energy consumption of an NPU. To reduce the impact of off-chip DRAM access on the performance and energy of an NPU, we devise compiler techniques for an NPU to manage multi-bank on-chip memory with two different objectives: one is to minimize the off-chip memory access volume, and the other is to minimize the processing delay caused by unhidden DRAM accesses. The main idea is that by organizing on-chip memory into multiple banks, we may hide the off-chip DRAM access delay by prefetching data into unused banks during computation and reduce the off-chip DRAM access volume by storing the output feature map data of each layer to on-chip memory. By running CNN benchmarks on a cycle-level NPU simulator, we demonstrate the trade-off relation between two objectives. The devised multi-bank on-chip memory management (MOMM) techniques are extended to consider layer fusion that aims to reuse feature maps between layers maximally. Since the pure layer fusion technique incurs extra computation overhead and increases DRAM access for filter weights, a hybrid fusion technique is presented between a per-layer processing technique and the pure layer fusion techniques, based on the devised MOMM techniques with two different objectives. Experiment results confirm the superiority of the hybrid fusion technique to the per-layer processing technique and the pure layer fusion technique.Abstract Contents List of Figures List of Tables List of Algorithms Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 7 1.3 Dissertation Organization 8 Chapter 2 Background 9 2.1 Target Hardware 9 2.1.1 Commodity Hardware Platform 9 2.1.2 Application-specific Hardware Accelerator 10 2.2 Convolutional Neural Network 11 2.2.1 Convolution 11 2.2.2 Optimization Methods for Convolutional Neural Network 11 Chapter 3 Optimization for a Commodity Hardware Platform 14 3.1 Joint Optimization Method of Multiple Objectives 15 3.1.1 Hardware Platform 16 3.1.2 Deep Neural Network and Software Framework 17 3.1.3 Software Optimization Techniques 19 3.2 C-code Generation Framework for Optimized On-device Deep Learning 29 3.2.1 C-GOOD Framework 29 3.2.2 Experiments 36 3.3 Scheduling Deep Learning Applications Onto Heterogeneous Processors 44 3.3.1 Search Space Size 45 3.3.2 Hardware Platform and System Model 45 3.3.3 Proposed Scheduling Framework and Profiling 48 3.3.4 Scheduling a Single Deep Learning Application 53 3.3.5 Scheduling Multiple Deep Learning Applications 61 3.3.6 Verification with Real Hardware Platforms 65 3.4 Related Work 69 3.4.1 Deep Learning Framework 69 3.4.2 Deep Learning Compiler 70 3.4.3 Scheduling Deep Learning Application 70 3.4.4 Scheduling Multiple Applications on Heterogeneous Processors 72 Chapter 4 Optimization for an Application-specific Hardware Accelerator 75 4.1 Multi-Bank On-chip Memory Management Problem 75 4.1.1 Main Idea 75 4.1.2 Assumed Dataflow 76 4.1.3 Multi-bank On-chip Memory Management Problem 79 4.2 Proposed Multi-bank On-chip Memory Management Techniques 83 4.2.1 DRAM-first Storing Policy 84 4.2.2 DRAM Access Minimization Policy (MIN policy) 85 4.2.3 DRAM Access Hiding Policy (HIDE policy) 89 4.2.4 Multiple Path Consideration 91 4.3 Layer Fusion Technique 92 4.3.1 Layer Fusion Technique 92 4.3.2 Hybrid Fusion Technique 94 4.4 Experiments 96 4.4.1 Setup 96 4.4.2 Performance Comparison of MOMM Techniques 98 4.4.3 Multiple Path 100 4.4.4 Design Space Exploration of NPU Architecture 101 4.4.5 Hybrid Fusion Technique 104 4.5 Related Work 106 Chapter 5 Conclusion 108 Bibliography 111 Appendix 120 A Proposed Multi-bank On-chip Memory Management Algorithm 120 A.1 Multi-bank On-chip Memory (MOM) Manager 120 A.2 MIN policy 122 A.3 HIDE policy 124 요 약 126Docto

    Using Proportional-Integral-Differential approach for Dynamic Traffic Prediction in Wireless Network-on-Chip

    Get PDF
    The massive integration of cores in multi-core system has enabled chip designer to design systems while meeting the power performance demands of the applications. Wireless interconnection has emerged as an energy efficient solution to the challenges of multi-hop communication over the wireline paths in conventional Networks-on-Chips (NoCs). However, to ensure the full benefits of this novel interconnect technology, design of simple, fair and efficient Medium Access Control (MAC) mechanism to grant access to the on-chip wireless communication channel is needed. Moreover, to adapt to the varying traffic demands from the applications running on a multicore environment, MAC mechanisms should dynamically adjust the transmission slots of the wireless interfaces (WIs). To ensure an efficient utilization of the wireless medium in a Wireless NoC (WiNoC), in this work we present the design of prediction model that is used by two dynamic MAC mechanism to predict the traffic demand of the WIs and respond accordingly by adjusting transmission slots of the WIs. Through system level simulations, we show that the traffic aware MAC mechanisms are more energy efficient as well as capable of sustaining higher data bandwidth in WiNoCs
    corecore