Interscience Research Network

Interscience Research Network
Invited Talks

Interscience Research Community

6-28-2019

State of the Art of Deep Learning Technology and its Next
Generation Architecture
Dr. Kuo-Kun Tseng Associate Professor

Follow this and additional works at: https://www.interscience.in/conf_proc_papers
Part of the Computer Engineering Commons

State of the Art of Deep Learning Technology and its Next
Generation Architecture
Presented by : Dr. Kuo-Kun Tseng
1

Outline
• Introduction
• State of the Art of Deep Learning Technology
• Our Applications
• Other New Applications

• Next Generation Architecture for Deep Learning Technology
•
•
•
•
•

Demand for New Architecture
Deep Learning with FPGA Architecture
Object Tracking Example
OpenCL on FPGA
Translation Tool

• Conclusion
2

About the Speaker
• Kuo-Kun Tseng(email:kktseng@hit.edu.cn), he was born in 1974, and received his doctoral
degree in computer information and engineering from National Chiao Tung University of
Taiwan in 2008.
• He is currently a tenure associate Professor at School of Computer Science and
Technology in Harbin Institute of Technology (Shenzhen Campus), and received
Shenzhen Peacock talent award (B level).
• Before he joined HITSZ, he worked as senior software and IC design engineer in the USA
and Taiwan for many years. Since 2004, he is working on the research of intelligent algorithm
and architecture.
• Furthermore, he has more than 20 research projects, 30 patents and been published more
than 80 research articles, of which about half papers are published at SCI/ACM / IEEE
journal with high reputation and impact factor. Last but not least, he is an associate editor for
Enterprise Information System and International Journal of Engineering Business
Management, and the reviewer of many distinguished journals, such as IEEE Transactions
on Neural Networks and Learning Systems, IEEE Transaction of Internet of Things, IEEE
Access, IEEE Sensor, Expert System, Neural Computing and so on.
3

About Harbin Institute of Technology
• Harbin Institute of Technology (HIT) is a member of top nine University Union (C9) in
China with three Campus: Harbin , Weihai, Shenzhen.
• Undergraduate Entrance Examination Score No. 1 among Guangdong Province's
Colleges, and 2019 QS Ranking is No. 9.

Harbin Institute of Technology
4

About Shenzhen
• Developed from "reform and opening-up"
policy in 1979.
• Actual population to be about 20 million
• Shenzhen was one of the fastest-growing
cities in the world
• Has been ranked second on the list of top
10 cities to visit in 2019 by Lonely Planet.
• The city is a leading global technology hub,
dubbed by media as the next Silicon Valley.
5

Our Lab - Intelligent Architecture Lab
Deep Learning Application

Technology

• NLP

• Algorithm Optimization

• English to Chinese Translation
• Specific Domain Q&A Robot

• Signal Processing
• ECG Abnormal Detection
• House Price Prediction

• Graphic Processing
• Image Semantic Segmentation

• Design algorithms for deep learning
applications.

• Hardware Optimization
• Based on FPGA and other new
hardware, optimize performance for
deep learning algorithms.
• For edge and cloud devices

• Visual Depth Prediction
• Medical Image Segmentation

6

E CG / P P G C l a s s i f i c a t i o n

PPG

7

Modern to Classical Chinese Translation
• For Learning
Classical Chinese
• Mutual
Translation
• Small Training
Data

8

Visual Depth Prediction

Semi-Supervised Learning
9

Semantic Segmentation

Encoder

Decoder

Encoder

Conv

Decoder
Context
Pooling
Concat

ASSP/LargeFOV

Optimize for accuracy and speed

For unmanned driving application
10

Detection and Segmentation

11

Medical Image Segmentation
Decoder

Encoder

Convolution layer

Use case introduction

Output

Input

Downsampling

Conv

Conv

Upsampling
OP

Feature fusion
operation

Block
Block

FCN: add operation, the
number of feature map
channels is unchanged
U-net: concatenate operation,
the number of feature map
channels increased

OP
Add/Concatenate
FCN, Unet, ResUnet are all
combinations of conv+bn+relu

Upsampling

Pool

Feature
extraction block

OP

Block

DenseUnet has intensive
feature fusion operations
within each block

Block
OP
Add/Concatenate
OP

Pool

Residual design

It does not appear in FCN and U-net
networks. OP is an add operation in
ResUnet and a concatenate
operation in DenseNet.

Upsampling
Block

12

House Price Estimation
Satellite Map

Residential Appearance

13

Short Comment
Design deep learning from single task to multi-task network

End to end encoder and decoder model has great applications.

14

PizzaGAN-Naturally Layered

https://arxiv.org/abs/19
06.02839
15

DeepFakes

(Source: shaoanlu/faceswap-GAN)
https://github.com/Fabsqrt/BitTigerL
ab/tree/master/DeepFake
16

Drive with-Reinforcement-Learning

https://arxiv.org/pdf/1807.00412.pdf

17

Siamese-RPN - Object Tracking

https://github.com/STVIR/pysot/blob/master/demo/bag.avi

https://github.com/STVIR/pysot
18

Siamese-RPN - Object Tracking

One-Shot Detection

RPN feature map
19

Inhibitory for Shorterm Memory
• This AI model shows that
during the silent period of
memory, the brain can use
the short-term plasticity of
synaptic connections
between neurons to
memorize information.
• These two forms of shortterm memory last from a few
seconds to a few minutes.
Some of the information used
in short-term memory may
eventually be stored for a
long time, but most of the
information will disappear
over time.
20

Direct Speech-to-Speech Translation

https://arxiv.org/abs/1904.06037
21

Clinical BERT - Readmission Prediction

Sparse text
information

22

Short Comment
Many novel architectures

Many new applications

Would Deep Learning Model be important as Programming Language
Model?
23

Comparison for Hardware Architectures
• CPU: Insufficient Energy Efficiency

• FPGA (Reconfigurable Architecture):

• GPU: High efficiency in training, but
low efficiency in reasoning (batch
size = 1)

• Acceptable energy consumption
and performance

• DSP: Low hit rate of cache

• On-chip storage with high
bandwidth

• ASIC has high NRE: Large-scale
application market has not yet formed

• Accept flexible architecture

• Short Market Cycle

• ASIC has a long input period and
neural network is developing

24

Demand for Low Power and High Performance Hardware

UAV
Client

Video surveillance
Edge

Speech recognition
Cloud

Demand
Real-time scene recognition

Demand
Real-time image analysis

Demand
Processing delays are lower

Limitations
Limited battery capacity

Limitations
Low cost and high performance
hardware

Limitations
Higher maintenance and cooling
costs

25

Problem of Current Architecture
• High Redundancy in Neural Networks
• VGG16 network can be compressed from 550 MB to 11.3 MB

• The limited bandwidth of BRAM and DDR in FPGA

• Different neural networks have different computational models
• CNN: Frequent data reuse, high density
• DNN/RNN/LSTM: No data reuse, data sparseness

• Different architectures need to adapt to different neural networks
• With the rapid development of neural networks, the architecture should be
adapted to the new algorithm.

26

Development of FPGA CNN

27

Research Trend
Year 2016, the number of neural network accelerators based on FPGA published on IEEE
eXplore had reached 69, and it has been increasing. This is enough to illustrate the
research trend in this direction.

28

Level of Deep Learning Hardware Design
Designing Accelerators for Specific Applications

Designing Accelerators for Specific Algorithms

Designing Accelerators for Common Features of Algorithms

Designing a Universal Accelerator Framework with Hardware
Templates
29

Structure and Complexity of CNN

30

Hardware Accleration for CNN

31

Object Tracking FPGA Architecture
B.
Compute

A.

Histogram
Sum of

Absolute
Difference

C.
Histogram
Variance

D.

E.
Compute

Minimum

New

Summation

Position
32

Object Tracking Result

33

OpenCL FPGA Framework

34

OpenCL FPGA Framework
• Processing element for
Convolution:
• A table tennis mechanism
(similar to pipeline) is
introduced to transmit data
and operations to hide
latency of external memory
access.
• A computing unit has 256
DSP chips, which can
parallel 256 computations
at a time by reusing the
storage ports of 16 rows
and columns.
35

DLAU : SLICE
TECHNIQUES
• No matter how big the data in the
input neuron is, it can be sliced into
several data subsets of the same size.
• The weight matrix is also divided into
slices of the same size according to
the size of data slices.
• Multiply the slice with its
corresponding weight matrix to get
partial sum
• The above operations are performed
on each slice until the data is
processed.
36

TILE TECHNIQUES
TMMU
for Tile
and
weight
sum

PSAU for Accumlation
AFAU for Activation Funtion

37

DeepPhi : Architecture
• Compiler + Framework
Replaces OpenCL

• Algorithmic developers do
not need to understand
hardware architecture
• Generate instructions
instead of RTL code
• Compile in one minute

• Better performance and
lower energy consumption
38

DeepPhi:Workflow and Proessor Eelement

Workflow

Proessor Eelement

自动编译

硬件加速

39

DeepPhi:RNN/LSTM Architecture

40

Coarse grained
Reconfigurable
Architecture
• CGRA computing energy efficiency can
reach 1000 times of CPU computing
architecture,
• 100-1000 times of GPU computing
architecture, and more than 100 times of
FPGA computing architecture.
• Compared with NPU, CGRA can improve
performance more than 10 times.
• CGRA is based on configuration mode,
and its execution efficiency can be
comparable to ASIC, but its flexibility is
much better than ASIC.

Reconfigurable Array

41

Nvidia NVIDIA TensorRT

42

Other CNN-TO-FPGA Tools
FpgaConvNet, ALAMO and Snowflake are mainly
concerned with the feature extractor part of CNN.

Haddoc2 requires all weights to
be stored on the chip, so the
size of the supported model is

Inception module of fpgaConvNet,

limited by the storage

Snowflake and dense module of

resources of the target device.

fpgaConvNet support irregular CNN
building module and other modules
DeepBurning and FP-DNN support recurrent
neural network (RNN) and long-term and shortterm memory (LSTM) networks.

43

New Photonic AI Chips

In a paper in Physical Review X, MIT researchers
describe a new photon accelerator that uses
optical components and optical signal processing
technology to reduce chip size, which will allow
the chip to expand to neural networks several
orders of magnitude larger than electrical
chips.

44

Near Future Archtecture
Integrate with popular frameworks
Support Next Generation Network
• Increase in Model Depth
• Increase the workload of reasoning
• Introducing new components (e.g.

• Such as Google's TensorFlow, seem to be of
interest to academia and industry because of the
variety of machine learning models supported

enhancing the CNN layer by introducing

and the flexibility to deploy across different

complex blocks)

heterogeneous systems.

Support Compression, Sparse
• Post-training
• Training-time methods

Support Low Accuracy
• Angel-Eye, ALAMO, DnnWeaver,

Support Hardware Unit
• Such as deploying Tensor Processing Unit
(TPU) ASIC in its servers for the training

and reasoning stages of machine learning
models.

Hardware-Network Codesign

DeepBurning and AutoCodeGen support

• By taking hardware performance and power consumption

dynamic quantization with fixed, uniform

as indicators in the training phase, hardware adjustable

word length and different scaling across

parameters, model weight and topology will be jointly

layers

modified in the optimization process to jointly optimize the
application-level accuracy and the required reasoning
execution time and power consumption.

45

Conclusion
Programming
Language .

Deep Learning
Model

??? Model

State Machine .

Neural Machine .

??? Machine .

Parallel &
Reconfigurable
Hardware.

?? Hardware.

CPU.

Artificial intelligence with deep learning architecture is still in infancy.
But it has already brought a lot of help to mankind.

46

