Interscience Research Network

#### Interscience Research Network

**Invited Talks** 

Interscience Research Community

6-28-2019

## State of the Art of Deep Learning Technology and its Next Generation Architecture

Dr. Kuo-Kun Tseng Associate Professor

Follow this and additional works at: https://www.interscience.in/conf\_proc\_papers

Part of the Computer Engineering Commons

State of the Art of Deep Learning Technology and its Next Generation Architecture

**Presented by : Dr. Kuo-Kun Tseng** 



哈爾濱工業大學

HARBIN INSTITUTE OF TECHNOLOGY



- Introduction
- State of the Art of Deep Learning Technology
  - Our Applications
  - Other New Applications
- Next Generation Architecture for Deep Learning Technology
  - Demand for New Architecture
  - Deep Learning with FPGA Architecture
  - Object Tracking Example
  - OpenCL on FPGA
  - Translation Tool
- Conclusion

# About the Speaker



- Kuo-Kun Tseng(email:kktseng@hit.edu.cn), he was born in 1974, and received his doctoral degree in computer information and engineering from National Chiao Tung University of Taiwan in 2008.
- He is currently a tenure associate Professor at School of Computer Science and Technology in Harbin Institute of Technology (Shenzhen Campus), and received Shenzhen Peacock talent award (B level).
- Before he joined HITSZ, he worked as senior software and IC design engineer in the USA and Taiwan for many years. Since 2004, he is working on the research of intelligent algorithm and architecture.
- Furthermore, he has more than 20 research projects, 30 patents and been published more than 80 research articles, of which about half papers are published at SCI/ACM / IEEE journal with high reputation and impact factor. Last but not least, he is an associate editor for Enterprise Information System and International Journal of Engineering Business Management, and the reviewer of many distinguished journals, such as IEEE Transactions on Neural Networks and Learning Systems, IEEE Transaction of Internet of Things, IEEE Access, IEEE Sensor, Expert System, Neural Computing and so on.

# About Harbin Institute of Technology

- Harbin Institute of Technology (HIT) is a member of top nine University Union (C9) in China with three Campus: Harbin, Weihai, Shenzhen.
- Undergraduate Entrance Examination Score No. 1 among Guangdong Province's Colleges, and 2019 QS Ranking is No. 9.

| 上榜 2019 QS 世界大学排名(中国大陆高校) |         |         |          |  |  |
|---------------------------|---------|---------|----------|--|--|
| 序号                        | 2019 排名 | 2018 排名 | 大学       |  |  |
| 1                         | 17      | 25      | 清华大学     |  |  |
| 2                         | 30      | 38=     | 北京大学     |  |  |
| 3                         | 44      | 40      | 复旦大学     |  |  |
| 4                         | 59      | 62      | 上海交通大学   |  |  |
| 5                         | 68      | 87      | 浙江大学     |  |  |
| 6                         | 98      | 97      | 中国科学技术大学 |  |  |
| 7                         | 122     | 114=    | 南京大学     |  |  |
| 8                         | 257     | 282     | 武汉大学     |  |  |
| 9                         | 285     | 325=    | 哈尔滨工业大学  |  |  |



Harbin Institute of Technology

## About Shenzhen

- Developed from "reform and opening-up" policy in 1979.
- Actual population to be about 20 million
- Shenzhen was one of the fastest-growing cities in the world
- Has been ranked second on the list of top 10 cities to visit in 2019 by Lonely Planet.
- The city is a leading global technology hub, dubbed by media as the **next Silicon Valley**.



# Our Lab - Intelligent Architecture Lab

### **Deep Learning Application**

- NLP
  - English to Chinese Translation
  - Specific Domain Q&A Robot
- Signal Processing
  - ECG Abnormal Detection
  - House Price Prediction
- Graphic Processing
  - Image Semantic Segmentation
  - Visual Depth Prediction
  - Medical Image Segmentation

### Technology

- Algorithm Optimization
  - Design algorithms for deep learning applications.
- Hardware Optimization
  - Based on FPGA and other new hardware, optimize performance for deep learning algorithms.
  - For edge and cloud devices





# Modern to Classical Chinese Translation







Semi-Supervised Learning





#### Optimize for accuracy and speed

#### For unmanned driving application



## **Detection and Segmentation**





# Medical Image Segmentation











#### Satellite Map







### Design deep learning from single task to multi-task network

### End to end encoder and decoder model has great applications.

## PizzaGAN-Naturally Layered





https://arxiv.org/abs/19 06.02839





(Source: shaoanlu/faceswap-GAN)

https://github.com/Fabsqrt/BitTigerL ab/tree/master/DeepFake



# Drive with-Reinforcement-Learning



https://arxiv.org/pdf/1807.00412.pdf





https://github.com/STVIR/pysot/blob/master/demo/bag.avi

i https://github.com/STVIR/pysot





One-Shot Detection

**RPN** feature map

# Inhibitory for Shorterm Memory



- This AI model shows that during the silent period of memory, the brain can use the short-term plasticity of synaptic connections between neurons to memorize information.
- These two forms of shortterm memory last from a few seconds to a few minutes.
   Some of the information used in short-term memory may eventually be stored for a long time, but most of the information will disappear over time.

## Direct Speech-to-Speech Translation







Sparse text information



Many novel architectures

Many new applications

Would Deep Learning Model be important as Programming Language Model?

## **Comparison for Hardware Architectures**

- CPU: Insufficient Energy Efficiency
- GPU: High efficiency in training, but low efficiency in reasoning (batch size = 1)
- DSP: Low hit rate of cache
- ASIC has high NRE: Large-scale application market has not yet formed
- ASIC has a long input period and neural network is developing

- FPGA (Reconfigurable Architecture):
  - Acceptable energy consumption and performance
  - Accept flexible architecture
  - On-chip storage with high bandwidth
  - Short Market Cycle

## **Demand for Low Power and High Performance Hardware**



UAV Client



Video surveillance Edge  $\mathcal{C}$ 

Speech recognition Cloud

Demand Real-time scene recognition Demand Real-time image analysis

Demand Processing delays are lower

Limitations Limited battery capacity Limitations Low cost and high performance hardware

Limitations Higher maintenance and cooling costs

## Problem of Current Architecture

- High Redundancy in Neural Networks
  - VGG16 network can be compressed from 550 MB to 11.3  $\ensuremath{\text{MB}}$
- The limited bandwidth of BRAM and DDR in FPGA
- Different neural networks have different computational models
  - CNN: Frequent data reuse, high density
  - DNN/RNN/LSTM: No data reuse, data sparseness
- Different architectures need to adapt to different neural networks
  - With the rapid development of neural networks, the architecture should be adapted to the new algorithm.







network.

### **Research Trend**

Year 2016, the number of neural network accelerators based on FPGA published on IEEE eXplore had reached 69, and it has been increasing. This is enough to illustrate the research trend in this direction.

Since 1994, D.S. Reay first implemented the neural network accelerator using FPGA, the research in this area has not attracted people's attention. Until the increasing complexity of the neural network, researchers did not notice the neural network accelerator based on FPGA. The graph describes the changes over time of the relevant papers published in IEEE eXplore.



約有 27,000 項結果 (0.03 秒) DLAU: A scalable deep learning ; <u>C Wang</u>, L Gong, Q Yu, <u>X Li</u>, <u>Y Xie</u>... - IEEI As the emerging field of machine learning, ( complex learning problems. However, the s scale due to the demands of the practical a

☆ 55 被引用 98 次 相關文章 全部対

69

Optimizing fpga-based acceleratc C Zhang, P Li, G Sun, Y Guan, B Xiao...-F ... 161 Page 2. Unfortunately, both advance aggravate this problem at the same time. O ory bandwidth provided by state-of-art FPG ☆ 99 被引用 807 次 相關文章

[PDF] Deep learning with limited n S Gupta, A Agrawal, K Gopalakrishnan... -... 2Digital Signal Processing units are harc mathematical and log- ical operations includ Linear Algebra Subprograms Page 4. Deep ☆ 99 被引用 695 次 相關文章 全部

A deep learning prediction proces Q Yu, <u>C Wang</u>, <u>X Ma</u>, <u>X Li</u>... - 2015 15th IE Recently, machine learning is widely used in emerging field of machine learning, deep le learning problems. To give users better exp

## Level of Deep Learning Hardware Design



**Designing Accelerators for Specific Applications** 



**Designing Accelerators for Specific Algorithms** 



**Designing Accelerators for Common Features of Algorithms** 



Designing a Universal Accelerator Framework with Hardware Templates

## **Structure and Complexity of CNN**



Convolution & POOL & ReLU layers

Fully-connected layers

|                     | CONV       | POOL    | ReLU    | FCN          |
|---------------------|------------|---------|---------|--------------|
| Comput. $ops(10^7)$ | 3E3(99.5%) | 0.6(0%) | 1.4(0%) | 12.3(0.4%)   |
| Storage (MB)        | 113(19.3%) | 0(0%)   | 0(0%)   | 471.6(80.6%) |
| Time% in pure sw    | 96.3%      | 0.0%    | 0.0%    | 3.7%         |
| after CONV acc      | 48.7%      | 0.0%    | 0.0%    | 51.2%        |

## Hardware Accleration for CNN

1. for(o=0; o< To ; o++){
2. for(i=0; i< Ti ; i++){
3. for(r=0; r< Tr ; r++){
4. for(c=0; c< Tc ; c++){
5. for(p=0; p< K1 ; p++){
6. for(q=0; q< K2 ; q++){
 cache\_output[o][r][c] +=
 cache\_weights[o][i][p][q] \* cache\_input[i][ S \*r+p][ S \*c+q];
} } } } } } </pre>

### Fig. 5: Pseudo code of original on-chip computation



Fig. 7: Scalable accelerator architecture design



Fig. 6: Pseudo code of optimized on-chip computation



## Object Tracking FPGA Architecture







**Figure 16.** Frame sequence showing object tracking in scenes that change due to camera movement and presence of other moving objects in the scene.

### OpenCL FPGA Framework



Figure 2: OpenCL FPGA framework:(a) Top level ;(b) Compute unit (CU); (c) Processing element(PE)

# OpenCL FPGA Framework

- Processing element for Convolution:
  - A table tennis mechanism (similar to pipeline) is introduced to transmit data and operations to hide latency of external memory access.
  - A computing unit has 256 DSP chips, which can parallel 256 computations at a time by reusing the storage ports of 16 rows and columns.



### DLAU : SLICE TECHNIQUES

- No matter how big the data in the input neuron is, it can be sliced into several data subsets of the same size.
- The weight matrix is also divided into slices of the same size according to the size of data slices.
- Multiply the slice with its corresponding weight matrix to get partial sum
- The above operations are performed on each slice until the data is processed.

#### Require:

Ni: the number of the input neurons No: the number of the output neurons Tile\_Size: the tile size of the input data batchsize: the batch size of the input data for n = 0; n < batchsize; n + + dofor k = 0; k < Ni;  $k + = Tile_Size$  do for j = 0; j < No; j + + doy[n][j] = 0;for i = k;  $i < k + Tile_Size \&\&i < Ni$ ; i + i + doy[n][j] + = w[i][j] \* x[n][i]if i == Ni - 1 then y[n][j] = f(y[n][j]);end if end for end for end for end for





#### PSAU for Accumlation





#### **AFAU for Activation Funtion**



37

## DeepPhi : Architecture

- Compiler + Framework
   Replaces OpenCL
- Algorithmic developers do not need to understand hardware architecture
- Generate instructions
   instead of RTL code
- Compile in one minute
- Better performance and lower energy consumption



## **DeepPhi:Workflow and Proessor Eelement**

### Workflow









### DeepPhi:RNN/LSTM Architecture



# Coarse grained Reconfigurable Architecture

- CGRA computing energy efficiency can reach 1000 times of CPU computing architecture,
- 100-1000 times of GPU computing architecture, and more than 100 times of FPGA computing architecture.
- Compared with NPU, CGRA can improve performance more than 10 times.
- CGRA is based on configuration mode, and its execution efficiency can be comparable to ASIC, but its flexibility is much better than ASIC.



Reconfigurable Array







# Other CNN-TO-FPGA Tools

FpgaConvNet, ALAMO and Snowflake are mainly concerned with the feature extractor part of CNN.

Inception module of fpgaConvNet, Snowflake and dense module of fpgaConvNet support irregular CNN building module and other modules Haddoc2 requires all weights to be stored on the chip, so the size of the supported model is limited by the storage resources of the target device.



DeepBurning and FP-DNN support recurrent neural network (RNN) and long-term and shortterm memory (LSTM) networks.





In a paper in Physical Review X, MIT researchers describe a new photon accelerator that uses optical components and optical signal processing technology to reduce chip size, which will allow the chip to **expand to neural networks several orders of magnitude larger than electrical chips**.



# Near Future Archtecture

#### Support Next Generation Network

- Increase in Model Depth
- Increase the workload of reasoning
- Introducing new components (e.g. enhancing the CNN layer by introducing complex blocks)

### Support Compression, Sparse

- Post-training
- Training-time methods

### Support Low Accuracy

 Angel-Eye, ALAMO, DnnWeaver, DeepBurning and AutoCodeGen support dynamic quantization with fixed, uniform word length and different scaling across layers



#### Integrate with popular frameworks

 Such as Google's TensorFlow, seem to be of interest to academia and industry because of the variety of machine learning models supported and the flexibility to deploy across different heterogeneous systems.

#### Support Hardware Unit

 Such as deploying Tensor Processing Unit (TPU) ASIC in its servers for the training and reasoning stages of machine learning models.

### Hardware-Network Codesign

• By taking hardware performance and power consumption as indicators in the training phase, hardware adjustable parameters, model weight and topology will be jointly modified in the optimization process to jointly optimize the application-level accuracy and the required reasoning execution time and power consumption.





Artificial intelligence with deep learning architecture is still in infancy. But it has already brought a lot of help to mankind.