33,343 research outputs found
λ©λͺ¨λ¦¬ μ§μ½μ μ°μ° κ°μνλ₯Ό μν΄ λ§μΆ€νλ DNN κ°μκΈ° λ° λ‘λ λ°Έλ°μ± κΈ°μ
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : μ΅ν©κ³ΌνκΈ°μ λνμ μ΅ν©κ³ΌνλΆ(μ§λ₯νμ΅ν©μμ€ν
μ 곡), 2022. 8. μμ νΈ.λ₯ λ΄λ΄ λ€νΈμν¬(DNN)λ μΈκ°μ κ·Όμ ν μΈμ μ νλλ₯Ό ν λλ‘ μ΄λ―Έμ§ λΆλ₯, μμ°μ΄ μ²λ¦¬, μμ± μΈμκ³Ό κ°μ λ€μν λΆμΌμμ μ¬μ©λλ€. DNNμ κ³μλ λ°μ μΌ λ‘ μΈν΄, DNNμμ κ°μ₯ λ§μ μ°μ°λμ μꡬνλ 컨볼루μ
κ³Ό νλ ¬ κ³±μ
(GEMM) μ μ μ©μΌλ‘ μ²λ¦¬νλ κ°μκΈ°λ€μ΄ μΆμλμλ€. νμ§λ§, μ»΄ν¨ν
μ§μ½μ μΈ μ°μ°λ€μ κ°μνλλ°μλ§ μΉμ€λ κ°μκΈ° μ°κ΅¬ λ°©ν₯μΌλ‘ μΈν΄, μ΄μ μλ μ 보μ΄μ§ μμλ λ©λͺ¨λ¦¬ μ§μ½μ μΈ μ°μ°λ€μ μν μκ° λΉμ€μ΄ μ¦κ°νμλ€.
컨볼루μ
λ΄λ΄ λ€νΈμν¬ μΆλ‘ (CNN inference)μμ, 컨볼루μ
μ μ°μ° λΉμ©μ μ€μ΄κΈ° μν΄ μ΅μ CNN λͺ¨λΈλ€μ κΉμ΄λ°©μμ 컨볼루μ
(depth-wise convolution, DW-CONV)κ³Ό μ€ν΄μ¦-μμ¬μ΄ν
μ΄μ
(squeeze-and-excitation, SE)μ μ±νν λ€. κ·Έλ¬λ, κΈ°μ‘΄μ CNN κ°μκΈ°λ μ»΄ν¨ν
μ§μ½μ μΈ νμ€ μ»¨λ³Όλ£¨μ
κ³μΈ΅μ μ΅μ νλμκΈ° λλ¬Έμ, λ°μ΄ν° μ¬μ¬μ©μ΄ μ νλ DW-CONV λ° SEλ μ°μ°μ ν¨μ¨μ±μ λ¨μ΄λ¨λ¦°λ€. λ°λΌμ, DW-CONV λ° SEμ μ°μ°λμ μ 체 μ°μ°μ 10%λ§ μ°¨μ§νμ§ λ§ μμ€ν λ¦ μ΄λ μ΄(systolic-array) κΈ°λ°μ κ°μκΈ°μμ λ©λͺ¨λ¦¬ λμνμ λ³λͺ©μΌλ‘ μΈν΄ μ²λ¦¬ μκ°μ 60% μ΄μμ μλΉνλ€.
νΈλμ€ν¬λ¨Έ νμ΅(transformer training)μμ, GEMMμ μνμκ°μ΄ μλμ μΌλ‘ κ°μν¨μ λ°λΌ μννΈλ§₯μ€(softmax), λ μ΄μ΄ μ κ·ν(layer normalization), GeLU, 컨ν
μ€νΈ(context), μ΄ν
μ
(attention)κ³Ό κ°μ λ©λͺ¨λ¦¬ μ§μ½μ μΈ μ°μ°λ€μ μν μκ° λΉμ€μ΄ μ¦κ°νμλ€. νΉν, μ
λ ₯ λ°μ΄ν°μ μνμ€ κΈΈμ΄(sequence length) κ° μ¦κ°νλ μ΅μ μ νΈλμ€ν¬λ¨Έ μΆμΈλ‘ μΈν΄ μνμ€ κΈΈμ΄μ λ°λΌ λ°μ΄ν° ν¬κΈ°κ° μ κ³±λ°°κ° λλ μννΈλ§₯μ€, 컨ν
μ€νΈ(context), μ΄ν
μ
(attention) λ μ΄μ΄λ€μ μ ν₯λκ° μ»€μ§λ€. λ°λΌμ, λ©λͺ¨λ¦¬ μ§μ½μ μΈ νΉμ±μ κ°μ§ μ°μ°λ€μ΄ μ΅λ 80%μ μν μκ°μ μ°¨μ§νλ€.
λ³Έ λ
Όλ¬Έμμ, μ°λ¦¬λ CNNμ κ°μνκΈ° μν΄ μμ€ν λ¦ μ΄λ μ΄ κΈ°λ° μν€ν
μ² μμ μμ μμ μ€λ²ν€λλ‘ μ»΄ν¨ν
λ° λ©λͺ¨λ¦¬ μ§μ½μ μμ
μ λͺ¨λ ν¨μ¨μ μΌλ‘ μ² λ¦¬νλ μ°μ° μ λμ μΆκ°ν MVP μν€ν
μ²λ₯Ό μ μνλ€. μ°λ¦¬λ λμ λ©λͺ¨λ¦¬ λμ ν μꡬ μ¬νμ μΆ©μ‘±νκΈ° μν΄ κ³±μ
κΈ°, λ§μ
νΈλ¦¬(adder tree), λ€μ€μ λ€μ€-λ±
ν¬ λ²νΌλ₯Ό ν¬ν¨ν DW-CONV μ²λ¦¬μ λ§μΆ€νλ λ²‘ν° μ λ(vector unit)μ μ μνλ€. λν, μ°λ¦¬λ μμ€ν λ¦ μ΄λ μ΄μμ μ¬μ©νλ ν΅ν© λ²νΌλ₯Ό νμ₯νμ¬ SEμ κ°μ μ μλ¨μ(element-wise) μ°μ°μ λ€λ°λ₯΄λ CONVμ νμ΄νλΌμΈ(pipeline) λ°©μμΌλ‘ μ²λ¦¬νλ νλ‘μΈμ±-λμ΄-λ©λͺ¨λ¦¬ μ λ(processing-near-memory-unit, PNMU) μ μ μνλ€. MVP ꡬ쑰λ λ² μ΄μ€λΌμΈ(baseline) μμ€ν λ¦ μ΄λ μ΄ μν€ν
μ²μ λΉν΄ 9%μ λ©΄μ μ€λ²ν€λλ§μ μ΄μ©νμ¬ EfficientNet-B0/B4/B7, MnasNet λ° MobileNet-V1/V2μ λν΄ μ±λ₯μ νκ· 2.6λ°° ν₯μνκ³ μλμ§ μλͺ¨λμ 47% μ€μΈλ€.
κ·Έλ¦¬κ³ , μ°λ¦¬λ νΈλμ€ν¬λ¨Έ νμ΅ κ°μμ μν΄ DNN κ°μκΈ° λ΄μ μ‘΄μ¬νλ μ¬λ¬ κ°μ μ°μ° μ λλ€μ ν΄λ¬μ€ν°(cluster) λ¨μλ‘ λΆν νλ κΈ°μ λ€μ μ μνλ€. νΈλν½ μ±ν(traffic shaping)μ ν΄λ¬μ€ν°λ€μ λΉλκΈ° λ°©μμΌλ‘ μνμμΌ DRAM λμνμ μΆλ μμ μνμν¨λ€. μμ 곡μ (resource sharing)λ μ»΄ν¨ν
μ§μ½μ μΈ μ°μ°κ³Ό λ©λͺ¨λ¦¬ μ§μ½μ μΈ μ°μ°μ΄ μλ‘ λ€λ₯Έ ν΄λ¬μ€ν°μμ λμμ μνλ λ λͺ¨λ ν΄λ¬μ€ν°μ 맀νΈλ¦μ€ μ λκ³Ό λ²‘ν° μ λμ λμμ μν μμΌ μ»΄ν¨ν
μ§μ½μ μΈ μ°μ°μ μν μκ°μ μ€μΈλ€. νΈλν½ μ±νκ³Ό μμ 곡μ λ₯Ό μ μ©νμ¬ BERT-Large νμ΅ μν μ 1.27λ°°μ μ±λ₯μ ν₯μμν¨λ€.Deep neural networks (DNNs) are used in various fields, such as in image classification, natural language processing, and speech recognition based on high recognition accuracy that approximates that of humans. Due to the continuous development of DNNs, a large body of accelerators have been introduced to process convolution (CONV) and general matrix multiplication (GEMM) operations, which account for the greatest level of computational demand. However, in the line of accelerator research focused on accelerating compute-intensive operations, the execution time of memory-intensive operations has increased more than it did in the past.
In convolutional neural network (CNN) inference, recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE) to reduce the computational costs of CONV. However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators.
During the transformer training process, the execution times of memoryintensive operations such as softmax, layer normalization, GeLU, context, and attention layer increased because conventional accelerators improved their computational performance capabilities dramatically. In addition, with the latest trend toward increasing the sequence length, the softmax, context, and attention layers have much more of an influence as their data sizes have increased quadratically. Thus, these layers take up to 80% of the execution time.
In this thesis, we propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DWCONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6Γ and reduces energy consumption by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.
Then, we propose load balancing techniques that partition multiple processing element tiles inside a DNN accelerator for transformer training acceleration. Traffic shaping alleviates temporal fluctuations in the DRAM bandwidth by handling multiple processing element tiles within a cluster in a synchronous manner but running different clusters asynchronously. Resource sharing reduces the execution time of compute-intensive operations by simultaneously executing the matrix units and vector units of all clusters. Our evaluation shows that traffic shaping and resource sharing improve the performance by up to 1.27Γ for BERT-Large training.1 Introduction 1
1.1 Accelerating Depth-wise Convolution on Edge Device 3
1.2 Accelerating Transformer Models in Training 6
1.3 Research Contributions 10
1.4 Outline 11
2 Background and Motivation 12
2.1 CNN background and trends 12
2.1.1 Various types of convolution (CONV) operations 12
2.1.2 Trends in CNN model architecture 14
2.1.3 EfficientNet: A state-of-the-art CNN model 17
2.2 Transformer background and trends 20
2.2.1 Bidirectional encoder representations from transformers (BERT) 20
2.2.2 Trends in training transformer models 21
2.3 Baseline DNN acceleration architecture 23
2.4 Motivation 25
2.4.1 Challenges of computing memory-intensive CNN layers 25
2.4.2 Opportunity for load balancing in BERT training 28
3 DNN accelerator tailored for accelerating memory-intensive operations 32
4 MVP: A CNN accelerator with Matrix, Vector, and Processing-near-memory units 35
4.1 Contribution 35
4.1.1 MVP organization 35
4.1.2 How depth-wise processing element (DWPE) operates 38
4.1.3 How processing-near-memory unit (PNMU) operates 41
4.1.4 Overlapping the operation of DW-CONV with PW-CONV 42
4.1.5 Considerations for designing DWIB 44
4.2 Evaluation 45
4.2.1 Experimental setup 46
4.2.2 Performance and energy evaluation 47
4.2.3 Comparing MVP with NVDLA 52
4.2.4 Exploring the design space of MVP architecture 54
4.2.5 Evaluating MVP with various SysAr configurations 57
4.3 Related Work 57
5 Load Balancing Techniques for BERT Training 61
5.1 Contribution 61
5.1.1 Tiled architecture 61
5.1.2 DRAM traffic shaping 64
5.1.3 Resource sharing 66
5.2 Evaluation 68
5.2.1 Experimental setup 68
5.2.2 Performance evaluation 69
6 Discussion 73
7 Conclusion 78λ°
ASCR/HEP Exascale Requirements Review Report
This draft report summarizes and details the findings, results, and
recommendations derived from the ASCR/HEP Exascale Requirements Review meeting
held in June, 2015. The main conclusions are as follows. 1) Larger, more
capable computing and data facilities are needed to support HEP science goals
in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of
the demand at the 2025 timescale is at least two orders of magnitude -- and in
some cases greater -- than that available currently. 2) The growth rate of data
produced by simulations is overwhelming the current ability, of both facilities
and researchers, to store and analyze it. Additional resources and new
techniques for data analysis are urgently needed. 3) Data rates and volumes
from HEP experimental facilities are also straining the ability to store and
analyze large and complex data volumes. Appropriately configured
leadership-class facilities can play a transformational role in enabling
scientific discovery from these datasets. 4) A close integration of HPC
simulation and data analysis will aid greatly in interpreting results from HEP
experiments. Such an integration will minimize data movement and facilitate
interdependent workflows. 5) Long-range planning between HEP and ASCR will be
required to meet HEP's research needs. To best use ASCR HPC resources the
experimental HEP program needs a) an established long-term plan for access to
ASCR computational and data resources, b) an ability to map workflows onto HPC
resources, c) the ability for ASCR facilities to accommodate workflows run by
collaborations that can have thousands of individual members, d) to transition
codes to the next-generation HPC platforms that will be available at ASCR
facilities, e) to build up and train a workforce capable of developing and
using simulations and analysis to support HEP scientific research on
next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio
Diluting the Scalability Boundaries: Exploring the Use of Disaggregated Architectures for High-Level Network Data Analysis
Traditional data centers are designed with a rigid architecture of
fit-for-purpose servers that provision resources beyond the average workload in
order to deal with occasional peaks of data. Heterogeneous data centers are
pushing towards more cost-efficient architectures with better resource
provisioning. In this paper we study the feasibility of using disaggregated
architectures for intensive data applications, in contrast to the monolithic
approach of server-oriented architectures. Particularly, we have tested a
proactive network analysis system in which the workload demands are highly
variable. In the context of the dReDBox disaggregated architecture, the results
show that the overhead caused by using remote memory resources is significant,
between 66\% and 80\%, but we have also observed that the memory usage is one
order of magnitude higher for the stress case with respect to average
workloads. Therefore, dimensioning memory for the worst case in conventional
systems will result in a notable waste of resources. Finally, we found that,
for the selected use case, parallelism is limited by memory. Therefore, using a
disaggregated architecture will allow for increased parallelism, which, at the
same time, will mitigate the overhead caused by remote memory.Comment: 8 pages, 6 figures, 2 tables, 32 references. Pre-print. The paper
will be presented during the IEEE International Conference on High
Performance Computing and Communications in Bangkok, Thailand. 18 - 20
December, 2017. To be published in the conference proceeding
- β¦