Search CORE

144 research outputs found

Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

Author: Bell Steven Emberton
Cao Kaidi
Gao Mingyu
Ha Heonjae
Horowitz Mark
Kozyrakis Christos
Liu Qiaoyi
Nayak Ankita
Pu Jing
Raina Priyanka
Setter Jeff Ou
Yang Xuan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/04/2020
Field of study

We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

arXiv.org e-Print Archive

Crossref

AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Author: Cong Jason
Wei Peng
Yu Cody Hao
Zhang Peng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/07/2018
Field of study

CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

arXiv.org e-Print Archive

Crossref

Scipedia

메모리 집약적 연산 가속화를 위해 맞춤화된 DNN 가속기 및 로드 밸런싱 기술

Author: 이선정
Publication venue: 서울대학교 대학원
Publication date: 01/08/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2022. 8. 안정호.딥 뉴럴 네트워크(DNN)는 인간에 근접한 인식 정확도를 토대로 이미지 분류, 자연어 처리, 음성 인식과 같은 다양한 분야에서 사용된다. DNN의 계속된 발전으 로 인해, DNN에서 가장 많은 연산량을 요구하는 컨볼루션과 행렬 곱셈(GEMM) 을 전용으로 처리하는 가속기들이 출시되었다. 하지만, 컴퓨팅 집약적인 연산들을 가속하는데에만 치중된 가속기 연구 방향으로 인해, 이전에는 잘 보이지 않았던 메모리 집약적인 연산들의 수행 시간 비중이 증가하였다. 컨볼루션 뉴럴 네트워크 추론(CNN inference)에서, 컨볼루션의 연산 비용을 줄이기 위해 최신 CNN 모델들은 깊이방식의 컨볼루션(depth-wise convolution, DW-CONV)과 스퀴즈-엑사이테이션(squeeze-and-excitation, SE)을 채택한 다. 그러나, 기존의 CNN 가속기는 컴퓨팅 집약적인 표준 컨볼루션 계층에 최적 화되었기 때문에, 데이터 재사용이 제한된 DW-CONV 및 SE는 연산의 효율성을 떨어뜨린다. 따라서, DW-CONV 및 SE의 연산량은 전체 연산의 10%만 차지하지 만 시스토릭 어레이(systolic-array) 기반의 가속기에서 메모리 대역폭의 병목으로 인해 처리 시간의 60% 이상을 소비한다. 트랜스포머 학습(transformer training)에서, GEMM의 수행시간이 상대적 으로 감소함에 따라 소프트맥스(softmax), 레이어 정규화(layer normalization), GeLU, 컨텍스트(context), 어텐션(attention)과 같은 메모리 집약적인 연산들의 수행 시간 비중이 증가하였다. 특히, 입력 데이터의 시퀀스 길이(sequence length) 가 증가하는 최신의 트랜스포머 추세로 인해 시퀀스 길이에 따라 데이터 크기가 제곱배가 되는 소프트맥스, 컨텍스트(context), 어텐션(attention) 레이어들의 영 향도가 커진다. 따라서, 메모리 집약적인 특성을 가진 연산들이 최대 80%의 수행 시간을 차지한다. 본 논문에서, 우리는 CNN을 가속하기 위해 시스토릭 어레이 기반 아키텍처 위에 작은 영역 오버헤드로 컴퓨팅 및 메모리 집약적 작업을 모두 효율적으로 처 리하는 연산 유닛을 추가한 MVP 아키텍처를 제안한다. 우리는 높은 메모리 대역 폭 요구 사항을 충족하기 위해 곱셈기, 덧셈 트리(adder tree), 다중의 다중-뱅크 버퍼를 포함한 DW-CONV 처리에 맞춤화된 벡터 유닛(vector unit)을 제안한다. 또한, 우리는 시스토릭 어레이에서 사용하는 통합 버퍼를 확장하여 SE와 같은 요 소단위(element-wise) 연산을 뒤따르는 CONV와 파이프라인(pipeline) 방식으로 처리하는 프로세싱-니어-메모리 유닛(processing-near-memory-unit, PNMU) 을 제안한다. MVP 구조는 베이스라인(baseline) 시스토릭 어레이 아키텍처에 비해 9%의 면적 오버헤드만을 이용하여 EfficientNet-B0/B4/B7, MnasNet 및 MobileNet-V1/V2에 대해 성능을 평균 2.6배 향상하고 에너지 소모량을 47% 줄인다. 그리고, 우리는 트랜스포머 학습 가속을 위해 DNN 가속기 내에 존재하는 여러 개의 연산 유닛들을 클러스터(cluster) 단위로 분할하는 기술들을 제안한다. 트래픽 성형(traffic shaping)은 클러스터들을 비동기 방식으로 수행시켜 DRAM 대역폭의 출렁임을 완화시킨다. 자원 공유(resource sharing)는 컴퓨팅 집약적인 연산과 메모리 집약적인 연산이 서로 다른 클러스터에서 동시에 수행될 때 모든 클러스터의 매트릭스 유닛과 벡터 유닛을 동시에 수행 시켜 컴퓨팅 집약적인 연산의 수행 시간을 줄인다. 트래픽 성형과 자원 공유를 적용하여 BERT-Large 학습 수행 시 1.27배의 성능을 향상시킨다.Deep neural networks (DNNs) are used in various fields, such as in image classification, natural language processing, and speech recognition based on high recognition accuracy that approximates that of humans. Due to the continuous development of DNNs, a large body of accelerators have been introduced to process convolution (CONV) and general matrix multiplication (GEMM) operations, which account for the greatest level of computational demand. However, in the line of accelerator research focused on accelerating compute-intensive operations, the execution time of memory-intensive operations has increased more than it did in the past. In convolutional neural network (CNN) inference, recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE) to reduce the computational costs of CONV. However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators. During the transformer training process, the execution times of memoryintensive operations such as softmax, layer normalization, GeLU, context, and attention layer increased because conventional accelerators improved their computational performance capabilities dramatically. In addition, with the latest trend toward increasing the sequence length, the softmax, context, and attention layers have much more of an influence as their data sizes have increased quadratically. Thus, these layers take up to 80% of the execution time. In this thesis, we propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DWCONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6× and reduces energy consumption by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline. Then, we propose load balancing techniques that partition multiple processing element tiles inside a DNN accelerator for transformer training acceleration. Traffic shaping alleviates temporal fluctuations in the DRAM bandwidth by handling multiple processing element tiles within a cluster in a synchronous manner but running different clusters asynchronously. Resource sharing reduces the execution time of compute-intensive operations by simultaneously executing the matrix units and vector units of all clusters. Our evaluation shows that traffic shaping and resource sharing improve the performance by up to 1.27× for BERT-Large training.1 Introduction 1 1.1 Accelerating Depth-wise Convolution on Edge Device 3 1.2 Accelerating Transformer Models in Training 6 1.3 Research Contributions 10 1.4 Outline 11 2 Background and Motivation 12 2.1 CNN background and trends 12 2.1.1 Various types of convolution (CONV) operations 12 2.1.2 Trends in CNN model architecture 14 2.1.3 EfficientNet: A state-of-the-art CNN model 17 2.2 Transformer background and trends 20 2.2.1 Bidirectional encoder representations from transformers (BERT) 20 2.2.2 Trends in training transformer models 21 2.3 Baseline DNN acceleration architecture 23 2.4 Motivation 25 2.4.1 Challenges of computing memory-intensive CNN layers 25 2.4.2 Opportunity for load balancing in BERT training 28 3 DNN accelerator tailored for accelerating memory-intensive operations 32 4 MVP: A CNN accelerator with Matrix, Vector, and Processing-near-memory units 35 4.1 Contribution 35 4.1.1 MVP organization 35 4.1.2 How depth-wise processing element (DWPE) operates 38 4.1.3 How processing-near-memory unit (PNMU) operates 41 4.1.4 Overlapping the operation of DW-CONV with PW-CONV 42 4.1.5 Considerations for designing DWIB 44 4.2 Evaluation 45 4.2.1 Experimental setup 46 4.2.2 Performance and energy evaluation 47 4.2.3 Comparing MVP with NVDLA 52 4.2.4 Exploring the design space of MVP architecture 54 4.2.5 Evaluating MVP with various SysAr configurations 57 4.3 Related Work 57 5 Load Balancing Techniques for BERT Training 61 5.1 Contribution 61 5.1.1 Tiled architecture 61 5.1.2 DRAM traffic shaping 64 5.1.3 Resource sharing 66 5.2 Evaluation 68 5.2.1 Experimental setup 68 5.2.2 Performance evaluation 69 6 Discussion 73 7 Conclusion 78박

SNU Open Repository and Archive

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

Author: Song Linghao
Mao Jiachen
Zhuo Youwei
Qian Xuehai
Li Hai
Chen Yiran
Publication venue
Publication date: 01/01/2018
Field of study

With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.Comment: To appear in the 2019 25th International Symposium on High-Performance Computer Architecture (HPCA 2019

arXiv.org e-Print Archive

Memoria Académica

Servicio de Difusión de la Creación Intelectual

Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models

Author: Alonso Gustavo
Hoefler Torsten
Jiang Wenqi
Waleffe Roger
Zeller Marco
Publication venue
Publication date: 29/11/2023
Field of study

A Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database. This strategy facilitates impressive text generation quality even with smaller models, thus reducing orders of magnitude of computational demands. However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics between LM inference and retrieval and (b) the various system requirements and bottlenecks for different RALM configurations such as model sizes, database sizes, and retrieval frequencies. We propose Chameleon, a heterogeneous accelerator system that integrates both LM and retrieval accelerators in a disaggregated architecture. The heterogeneity ensures efficient acceleration of both LM inference and retrieval, while the accelerator disaggregation enables the system to independently scale both types of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements retrieval accelerators on FPGAs and assigns LM inference to GPUs, with a CPU server orchestrating these accelerators over the network. Compared to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. These promising results pave the way for bringing accelerator heterogeneity and disaggregation into future RALM systems

arXiv.org e-Print Archive