144 research outputs found
Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators
We show that DNN accelerator micro-architectures and their program mappings
represent specific choices of loop order and hardware parallelism for computing
the seven nested loops of DNNs, which enables us to create a formal taxonomy of
all existing dense DNN accelerators. Surprisingly, the loop transformations
needed to create these hardware variants can be precisely and concisely
represented by Halide's scheduling language. By modifying the Halide compiler
to generate hardware, we create a system that can fairly compare these prior
accelerators. As long as proper loop blocking schemes are used, and the
hardware can support mapping replicated loops, many different hardware
dataflows yield similar energy efficiency with good performance. This is
because the loop blocking can ensure that most data references stay on-chip
with good locality and the processing units have high resource utilization. How
resources are allocated, especially in the memory system, has a large impact on
energy and performance. By optimizing hardware resource allocation while
keeping throughput constant, we achieve up to 4.2X energy improvement for
Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long
Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
λ©λͺ¨λ¦¬ μ§μ½μ μ°μ° κ°μνλ₯Ό μν΄ λ§μΆ€νλ DNN κ°μκΈ° λ° λ‘λ λ°Έλ°μ± κΈ°μ
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : μ΅ν©κ³ΌνκΈ°μ λνμ μ΅ν©κ³ΌνλΆ(μ§λ₯νμ΅ν©μμ€ν
μ 곡), 2022. 8. μμ νΈ.λ₯ λ΄λ΄ λ€νΈμν¬(DNN)λ μΈκ°μ κ·Όμ ν μΈμ μ νλλ₯Ό ν λλ‘ μ΄λ―Έμ§ λΆλ₯, μμ°μ΄ μ²λ¦¬, μμ± μΈμκ³Ό κ°μ λ€μν λΆμΌμμ μ¬μ©λλ€. DNNμ κ³μλ λ°μ μΌ λ‘ μΈν΄, DNNμμ κ°μ₯ λ§μ μ°μ°λμ μꡬνλ 컨볼루μ
κ³Ό νλ ¬ κ³±μ
(GEMM) μ μ μ©μΌλ‘ μ²λ¦¬νλ κ°μκΈ°λ€μ΄ μΆμλμλ€. νμ§λ§, μ»΄ν¨ν
μ§μ½μ μΈ μ°μ°λ€μ κ°μνλλ°μλ§ μΉμ€λ κ°μκΈ° μ°κ΅¬ λ°©ν₯μΌλ‘ μΈν΄, μ΄μ μλ μ 보μ΄μ§ μμλ λ©λͺ¨λ¦¬ μ§μ½μ μΈ μ°μ°λ€μ μν μκ° λΉμ€μ΄ μ¦κ°νμλ€.
컨볼루μ
λ΄λ΄ λ€νΈμν¬ μΆλ‘ (CNN inference)μμ, 컨볼루μ
μ μ°μ° λΉμ©μ μ€μ΄κΈ° μν΄ μ΅μ CNN λͺ¨λΈλ€μ κΉμ΄λ°©μμ 컨볼루μ
(depth-wise convolution, DW-CONV)κ³Ό μ€ν΄μ¦-μμ¬μ΄ν
μ΄μ
(squeeze-and-excitation, SE)μ μ±νν λ€. κ·Έλ¬λ, κΈ°μ‘΄μ CNN κ°μκΈ°λ μ»΄ν¨ν
μ§μ½μ μΈ νμ€ μ»¨λ³Όλ£¨μ
κ³μΈ΅μ μ΅μ νλμκΈ° λλ¬Έμ, λ°μ΄ν° μ¬μ¬μ©μ΄ μ νλ DW-CONV λ° SEλ μ°μ°μ ν¨μ¨μ±μ λ¨μ΄λ¨λ¦°λ€. λ°λΌμ, DW-CONV λ° SEμ μ°μ°λμ μ 체 μ°μ°μ 10%λ§ μ°¨μ§νμ§ λ§ μμ€ν λ¦ μ΄λ μ΄(systolic-array) κΈ°λ°μ κ°μκΈ°μμ λ©λͺ¨λ¦¬ λμνμ λ³λͺ©μΌλ‘ μΈν΄ μ²λ¦¬ μκ°μ 60% μ΄μμ μλΉνλ€.
νΈλμ€ν¬λ¨Έ νμ΅(transformer training)μμ, GEMMμ μνμκ°μ΄ μλμ μΌλ‘ κ°μν¨μ λ°λΌ μννΈλ§₯μ€(softmax), λ μ΄μ΄ μ κ·ν(layer normalization), GeLU, 컨ν
μ€νΈ(context), μ΄ν
μ
(attention)κ³Ό κ°μ λ©λͺ¨λ¦¬ μ§μ½μ μΈ μ°μ°λ€μ μν μκ° λΉμ€μ΄ μ¦κ°νμλ€. νΉν, μ
λ ₯ λ°μ΄ν°μ μνμ€ κΈΈμ΄(sequence length) κ° μ¦κ°νλ μ΅μ μ νΈλμ€ν¬λ¨Έ μΆμΈλ‘ μΈν΄ μνμ€ κΈΈμ΄μ λ°λΌ λ°μ΄ν° ν¬κΈ°κ° μ κ³±λ°°κ° λλ μννΈλ§₯μ€, 컨ν
μ€νΈ(context), μ΄ν
μ
(attention) λ μ΄μ΄λ€μ μ ν₯λκ° μ»€μ§λ€. λ°λΌμ, λ©λͺ¨λ¦¬ μ§μ½μ μΈ νΉμ±μ κ°μ§ μ°μ°λ€μ΄ μ΅λ 80%μ μν μκ°μ μ°¨μ§νλ€.
λ³Έ λ
Όλ¬Έμμ, μ°λ¦¬λ CNNμ κ°μνκΈ° μν΄ μμ€ν λ¦ μ΄λ μ΄ κΈ°λ° μν€ν
μ² μμ μμ μμ μ€λ²ν€λλ‘ μ»΄ν¨ν
λ° λ©λͺ¨λ¦¬ μ§μ½μ μμ
μ λͺ¨λ ν¨μ¨μ μΌλ‘ μ² λ¦¬νλ μ°μ° μ λμ μΆκ°ν MVP μν€ν
μ²λ₯Ό μ μνλ€. μ°λ¦¬λ λμ λ©λͺ¨λ¦¬ λμ ν μꡬ μ¬νμ μΆ©μ‘±νκΈ° μν΄ κ³±μ
κΈ°, λ§μ
νΈλ¦¬(adder tree), λ€μ€μ λ€μ€-λ±
ν¬ λ²νΌλ₯Ό ν¬ν¨ν DW-CONV μ²λ¦¬μ λ§μΆ€νλ λ²‘ν° μ λ(vector unit)μ μ μνλ€. λν, μ°λ¦¬λ μμ€ν λ¦ μ΄λ μ΄μμ μ¬μ©νλ ν΅ν© λ²νΌλ₯Ό νμ₯νμ¬ SEμ κ°μ μ μλ¨μ(element-wise) μ°μ°μ λ€λ°λ₯΄λ CONVμ νμ΄νλΌμΈ(pipeline) λ°©μμΌλ‘ μ²λ¦¬νλ νλ‘μΈμ±-λμ΄-λ©λͺ¨λ¦¬ μ λ(processing-near-memory-unit, PNMU) μ μ μνλ€. MVP ꡬ쑰λ λ² μ΄μ€λΌμΈ(baseline) μμ€ν λ¦ μ΄λ μ΄ μν€ν
μ²μ λΉν΄ 9%μ λ©΄μ μ€λ²ν€λλ§μ μ΄μ©νμ¬ EfficientNet-B0/B4/B7, MnasNet λ° MobileNet-V1/V2μ λν΄ μ±λ₯μ νκ· 2.6λ°° ν₯μνκ³ μλμ§ μλͺ¨λμ 47% μ€μΈλ€.
κ·Έλ¦¬κ³ , μ°λ¦¬λ νΈλμ€ν¬λ¨Έ νμ΅ κ°μμ μν΄ DNN κ°μκΈ° λ΄μ μ‘΄μ¬νλ μ¬λ¬ κ°μ μ°μ° μ λλ€μ ν΄λ¬μ€ν°(cluster) λ¨μλ‘ λΆν νλ κΈ°μ λ€μ μ μνλ€. νΈλν½ μ±ν(traffic shaping)μ ν΄λ¬μ€ν°λ€μ λΉλκΈ° λ°©μμΌλ‘ μνμμΌ DRAM λμνμ μΆλ μμ μνμν¨λ€. μμ 곡μ (resource sharing)λ μ»΄ν¨ν
μ§μ½μ μΈ μ°μ°κ³Ό λ©λͺ¨λ¦¬ μ§μ½μ μΈ μ°μ°μ΄ μλ‘ λ€λ₯Έ ν΄λ¬μ€ν°μμ λμμ μνλ λ λͺ¨λ ν΄λ¬μ€ν°μ 맀νΈλ¦μ€ μ λκ³Ό λ²‘ν° μ λμ λμμ μν μμΌ μ»΄ν¨ν
μ§μ½μ μΈ μ°μ°μ μν μκ°μ μ€μΈλ€. νΈλν½ μ±νκ³Ό μμ 곡μ λ₯Ό μ μ©νμ¬ BERT-Large νμ΅ μν μ 1.27λ°°μ μ±λ₯μ ν₯μμν¨λ€.Deep neural networks (DNNs) are used in various fields, such as in image classification, natural language processing, and speech recognition based on high recognition accuracy that approximates that of humans. Due to the continuous development of DNNs, a large body of accelerators have been introduced to process convolution (CONV) and general matrix multiplication (GEMM) operations, which account for the greatest level of computational demand. However, in the line of accelerator research focused on accelerating compute-intensive operations, the execution time of memory-intensive operations has increased more than it did in the past.
In convolutional neural network (CNN) inference, recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE) to reduce the computational costs of CONV. However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators.
During the transformer training process, the execution times of memoryintensive operations such as softmax, layer normalization, GeLU, context, and attention layer increased because conventional accelerators improved their computational performance capabilities dramatically. In addition, with the latest trend toward increasing the sequence length, the softmax, context, and attention layers have much more of an influence as their data sizes have increased quadratically. Thus, these layers take up to 80% of the execution time.
In this thesis, we propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DWCONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6Γ and reduces energy consumption by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.
Then, we propose load balancing techniques that partition multiple processing element tiles inside a DNN accelerator for transformer training acceleration. Traffic shaping alleviates temporal fluctuations in the DRAM bandwidth by handling multiple processing element tiles within a cluster in a synchronous manner but running different clusters asynchronously. Resource sharing reduces the execution time of compute-intensive operations by simultaneously executing the matrix units and vector units of all clusters. Our evaluation shows that traffic shaping and resource sharing improve the performance by up to 1.27Γ for BERT-Large training.1 Introduction 1
1.1 Accelerating Depth-wise Convolution on Edge Device 3
1.2 Accelerating Transformer Models in Training 6
1.3 Research Contributions 10
1.4 Outline 11
2 Background and Motivation 12
2.1 CNN background and trends 12
2.1.1 Various types of convolution (CONV) operations 12
2.1.2 Trends in CNN model architecture 14
2.1.3 EfficientNet: A state-of-the-art CNN model 17
2.2 Transformer background and trends 20
2.2.1 Bidirectional encoder representations from transformers (BERT) 20
2.2.2 Trends in training transformer models 21
2.3 Baseline DNN acceleration architecture 23
2.4 Motivation 25
2.4.1 Challenges of computing memory-intensive CNN layers 25
2.4.2 Opportunity for load balancing in BERT training 28
3 DNN accelerator tailored for accelerating memory-intensive operations 32
4 MVP: A CNN accelerator with Matrix, Vector, and Processing-near-memory units 35
4.1 Contribution 35
4.1.1 MVP organization 35
4.1.2 How depth-wise processing element (DWPE) operates 38
4.1.3 How processing-near-memory unit (PNMU) operates 41
4.1.4 Overlapping the operation of DW-CONV with PW-CONV 42
4.1.5 Considerations for designing DWIB 44
4.2 Evaluation 45
4.2.1 Experimental setup 46
4.2.2 Performance and energy evaluation 47
4.2.3 Comparing MVP with NVDLA 52
4.2.4 Exploring the design space of MVP architecture 54
4.2.5 Evaluating MVP with various SysAr configurations 57
4.3 Related Work 57
5 Load Balancing Techniques for BERT Training 61
5.1 Contribution 61
5.1.1 Tiled architecture 61
5.1.2 DRAM traffic shaping 64
5.1.3 Resource sharing 66
5.2 Evaluation 68
5.2.1 Experimental setup 68
5.2.2 Performance evaluation 69
6 Discussion 73
7 Conclusion 78λ°
HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
With the rise of artificial intelligence in recent years, Deep Neural
Networks (DNNs) have been widely used in many domains. To achieve high
performance and energy efficiency, hardware acceleration (especially inference)
of DNNs is intensively studied both in academia and industry. However, we still
face two challenges: large DNN models and datasets, which incur frequent
off-chip memory accesses; and the training of DNNs, which is not well-explored
in recent accelerator designs. To truly provide high throughput and energy
efficient acceleration for the training of deep and large models, we inevitably
need to use multiple accelerators to explore the coarse-grain parallelism,
compared to the fine-grain parallelism inside a layer considered in most of the
existing architectures. It poses the key research question to seek the best
organization of computation and dataflow among accelerators. In this paper, we
propose a solution HyPar to determine layer-wise parallelism for deep neural
network training with an array of DNN accelerators. HyPar partitions the
feature map tensors (input and output), the kernel tensors, the gradient
tensors, and the error tensors for the DNN accelerators. A partition
constitutes the choice of parallelism for weighted layers. The optimization
target is to search a partition that minimizes the total communication during
training a complete DNN. To solve this problem, we propose a communication
model to explain the source and amount of communications. Then, we use a
hierarchical layer-wise dynamic programming method to search for the partition
for each layer.Comment: To appear in the 2019 25th International Symposium on
High-Performance Computer Architecture (HPCA 2019
Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
A Retrieval-Augmented Language Model (RALM) augments a generative language
model by retrieving context-specific knowledge from an external database. This
strategy facilitates impressive text generation quality even with smaller
models, thus reducing orders of magnitude of computational demands. However,
RALMs introduce unique system design challenges due to (a) the diverse workload
characteristics between LM inference and retrieval and (b) the various system
requirements and bottlenecks for different RALM configurations such as model
sizes, database sizes, and retrieval frequencies. We propose Chameleon, a
heterogeneous accelerator system that integrates both LM and retrieval
accelerators in a disaggregated architecture. The heterogeneity ensures
efficient acceleration of both LM inference and retrieval, while the
accelerator disaggregation enables the system to independently scale both types
of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype
implements retrieval accelerators on FPGAs and assigns LM inference to GPUs,
with a CPU server orchestrating these accelerators over the network. Compared
to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x
speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon
exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput
compared to the hybrid CPU-GPU architecture. These promising results pave the
way for bringing accelerator heterogeneity and disaggregation into future RALM
systems
- β¦