













November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 2
Deep Learning Applications
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 3
“AI is the new electricity” – Andrew Ng
Object Detection
Speech Recognition
Image Segmentation Medical Imaging
GamesText to Speech Recommendations
What is a Deep Neural Network?























Each synapse has a weight
for neuron activation
Deep Learning Landscape
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 5
Training Inference
Deep Learning Landscape






















































Challenges in Designing and Deploying AI















































Template for Continuous Learning















at one task Learn multiple tasks
Continuous
Neuro-Evolutionary (NE) Algorithm

































Neural Network (NN) expressed as a graph
Gene: Vertex or 
Edge in the graph
Genome: Collection of 
all genes (i.e., a NN) [1] Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through 
augmenting topologies. Evolutionary computation, 10(2), 99-127.
Fitness
Neuro-Evolutionary (NE) Algorithm




















Neural Network (NN) expressed as a graph
Gene: Vertex or 
Edge in the graph
Genome: Collection of 
all genes (i.e., a NN)
Create parent pool










[1] Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through 
augmenting topologies. Evolutionary computation, 10(2), 99-127.
NeuroEvolution of Augmented Topologies (NEAT) [1]
Properties of NE algorithms






















NE is viable for continuous learning
GeneSys SoC
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 16






























Genome 2 Genome 1 










































Src node for connection gene
Node number for node gene
Attribute 1
Dest node for connection gene
Reserved for node gene
Gene IDGene ID Attribute 2 Attribute 3
Bias for node gene Activation for node gene
Response for node gene Aggregation for node gene
Reserved node for connection geneWeight node for connection gene
Enabled node for connection geneReserved node for connection gene
Reserved
}




Reserved  for connection gene
Ananda Samajdar, Parth Mannan, Kartikay Garg, and 
Tushar Krishna
GeneSys: Enabling Continuous Learning through 
Neural Network Evolution in Hardware
MICRO 2018
GeneSys ASIC: Runtime and Energy































































Why do we need DNN accelerators?
uMillions of Parameters (i.e., weights)
uBillions of computations
uHeavy data movement
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology







Need lots of parallel compute
Need to reduce energy
This makes CPUs 
inefficient
This makes GPUs 
inefficient
Spatial (or Dataflow) Accelerators
uMillions of Parameters (i.e., weights)
uBillions of computations
uHeavy data movement
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology
Spread computations 
across hundreds of ALUs
Reuse data within the 
array via direct 
communication
Examples: MIT Eyeriss, Google TPU, …
Memory Hierarchy
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU







High-Dimensional Compute à HW







































• 7D Computation Space
• R * S * X * Y * C * K * N
• 4D Operand / Result Spaces –
• Weights – R * S * C * K
• Inputs – X * Y * C * N
• Outputs – X’ * Y’ * K * N





Millions of non-trivial mappings
How do we explore all possible dataflows?
Energy Benefits = f(Dimension Sizes, Hardware Resources, Dataflow)
Accelerator HW 
(ASIC or FPGA or 
HPC System)






























MAESTRO: Analytical Cost/Benefit Model
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 22
*H. Kwon et. al., “An Analytic Model for Cost-Benefit Analysis of Dataflows in DNN Accelerators,” https://arxiv.org/abs/1805.02566
Describing Dataflows in MAESTRO
MAESTRO: Temporal_Map(1, 1) X -> Spatial_Map(2, 2) S
Parameters: 
(Mapping size, Offset)
X Values 0 0 0 0 0
S Values 0 1
PE0 PE1 PE2 PE3 PE4
Time step = 0
Mapping Size = 1




6 7 8 9
+ 2 + 2
* Temporal_Map: to map the same loop variable set to each PE
* Spatial_Map: to map different loop variable sets to each PE
Replication
Block-Cyclic Distribution
for(int x = 0; x < 20; x++)
for(int s = 0; s < 10; s++)
Output[x] += Weight[s] * Input[x+s]
Input: MAESTRO DSL
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 24
1 | //Hardware Resource Description
2 | L1Size 64 
3 | L2 Size 1024
4 | NoCBW 64
5 | Multcast True
6 | NumPEs 256
7 | ...
8 | //DNN Layer Description
9 | Layer CONV VGG16_C1
10|  K=64;C=3;R=3;S=3;Y=224;X=224 
11| endLayer
12| 
13 | //Mapping (Dataflow) Description
14 | Temporal_Map (1,1) K
15 | -> Temporal_Map (1,1) C
16 | -> Temporal_Map (3,1) Y
17 | -> Tile (3) Y
18 | -> Spatial_Map (1,1) X
19 | -> Unroll R
20 | -> Unroll S
Temporally/Spatially maps 
iteration variables to each PE
Output: Cost/Benefit Analysis
No single dataflow is good for every layer
































































































































Dataflow Style Dataflow Style Dataflow Style





NLR WS DLAShi RS NLR WS DLAShi RS

































































































































Dataflow Style Dataflow Style Dataflow Style






















November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 25
Outline of Talk













Myriad Dataflows in DNN Accelerators
u DNN Topologies
u Layer size / shape
u Layer types: Convolution / Pool / FC / LSTM
u New sub-structure: e.g., Inception in Googlenet
u Irregular Networks
u Weight Pruning during Training
u Generated by GeneSys
u Compiler/Mapper Optimization (i.e., MAESTRO)
u Loop reordering
u Loop tiling size
u Cross-layer mapping
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 27
The current trend for supporting 
multiple dataflows
uNew Dataflow  à New Accelerator
uData reuse: FlexFlow (2017), Eyeriss (2016), ...
uCross-layer: Fused CNN (2016)
uSparse CNN: SCNN (2017), EIE(2016), ...
uLSTM: ESE (2017), ...
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 28
Can we have one architectural 
solution that can handle arbitrary 
dataflows and provides ~100% 
utilization?
What is the computation in a DNN?



























Compute weighted sum Independent multiplication
Accumulation of partial products
29
Our Key insight: Each dataflow translates into neurons of different sizes 
The MAERI Abstraction
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology
Prefetch
Buffer




























Virtual Neuron (VN): Temporary 
grouping of compute units for an output




Traffic Patterns in DNN Accelerators*


















* PB: Prefetch buffer (Global buffer)
* NoC: Network-on-Chip (Interconnection network)
* PE: Processing element (Compute units)
Local Forwarding
e.g. input and weight 
distribution to PEs




November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 31
The MAERI Implementation
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 32















































• Spatial Reuse via Multicasts
• High Bandwidth via fat links
Linear Local Network
• Forwarding of weights
• Spatio-Temporal Reuse
Reduction Network
• High Bandwidth via fat links
• Provably Non-blocking 
Reductions via forwarding links
Download RTL: http://synergy.ece.gatech.edu/tools/maeri
Micro-Switches
Example Mapping – Dense CNN
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 33



















Partial Outputs Partial Outputs
















Example Mapping – Sparse DNN
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 34



















Partial Outputs Partial Outputs
















Example Mapping – LSTM/FC
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 35

































Partial Outputs Partial Outputs
Performance with Dense Workload
• Total Latency (Runtime) for Convolution
* Normalized to ideal case (100% utilization, Infinite bandwidth)












































LSTMs from Yonghui We, et. al., "Google's Neural Machine Translation System: Bridging the 
Gap between Human and Machine Translation.", Arxiv Preprint, 2016
MAERI reduces LSTM runtime 
upto 63%, 57% in avg.
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 36
Takeaways
November 2, 2018GT CRNCH Summit  2018                                               Tushar Krishna |  Georgia Institute of Technology 37






















































… DNN Accelerator with Configurable 
Interconnects can map Irregular Dataflows
















HW-SW Co-Design of NE 
Algorithms shows promise for 
continuous learning at the 
edgeGenerations
Up
Right
Left
Environment
Squat
Down
Jump
