Utilization Wall Utilization Wall [G. Venkatesh et.al. ASPLOS'10] [G. Venkatesh et.al. ASPLOS'10] ♦ ♦ Assuming 80W power budget Assuming 80W power budget ♦ ♦ Assuming 80W power budget, Assuming 80W power budget, At 45 nm TSMC process, less than 7% of a 300mm At 45 nm TSMC process, less than 7% of a 300mm 2 2 die can be die can be switched switched switched. switched. ♦ ♦ ITRS roadmap and CMOS scaling theory: ITRS roadmap and CMOS scaling theory:
Next Big Opportunity Next Big Opportunity --Customization and Specialization Customization and Specialization
Parallelization Parallelization 
Example of Customizable Platforms: FPGAs Example of Customizable Platforms: FPGAs Example of Customizable Platforms: FPGAs Example of Customizable Platforms: FPGAs

Configurable logic Configurable logic blocks blocks blocks blocks
Island Island--style configurable style configurable mesh routing mesh routing g g 
Dedicated components Dedicated components
Overall Communication Scheme in ARC Overall Communication Scheme in ARC Overall Communication Scheme in ARC Overall Communication Scheme in ARC
Light Light--Weight Interrupt Support Weight Interrupt Support Light Light Weight Interrupt Support
Weight Interrupt Support CPU GAM CPU GAM
TLB Miss TLB Miss T k D T k D LCA
Task Done Task Done
Core Sends Logical Addresses to LCA Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on LCA keeps a small TLB for the addresses that it is working on 32 32
Light Light--Weight Interrupt Support Weight Interrupt Support Light Light Weight Interrupt Support
TLB Miss TLB Miss T k D T k D LCA
Task Done Task Done
Core Sends Logical Addresses to LCA Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on LCA keeps a small TLB for the addresses that it is working on Why Logical Address? Why Logical Address? 1 1--Accelerators can work on irregular addresses (e.g. indirect addressing) Accelerators can work on irregular addresses (e.g. indirect addressing)
33 33 
2--Using large page size can be a solution but will effect other applications Using large page size can be a solution but will effect other applications
Light Light--Weight Interrupt Support Weight Interrupt Support Light Light Weight Interrupt Support Weight Interrupt Support
j i S k k k e i Z w j f w i ∑ = − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ∀ = − − ∈ ∑ 1 2 1 2 j 2 , ) ( 1 , 2 ) ( ) ( u : voxel σ fluid fluid registration registration registration registration ( ) [ ] ) ( ) ( ) ( ) ( u x T x R u x T v v u v t u v − ∇ − − − = ⋅ ∇ ∇ + + Δ ∇ ⋅ + ∂ ∂ = η μ μ level set level set registration registration ntation ntation ( ) [ ] ) ( ) ( ) ( ) ( u x T x R u x T v v ∇ ∇ ∇ + + Δ η μ μ div ) ( F ⎥ ⎤ ⎢ ⎡ ⎟ ⎟ ⎞ ⎜ ⎜ ⎛ ∇ + ∇ = ∂ ϕ λ φ ϕ ϕ data level set level set methods methods segmen segmen ysis ysis { } 0 t) (x, : x voxels ) ( surface div ) , ( F = = ⎥ ⎥ ⎦ ⎢ ⎢ ⎣ ⎟ ⎟ ⎠ ⎜ ⎜ ⎝ ∇ + ∇ ∂ ϕ ϕ λ φ ϕ t data t ∂v 39 39 analy analy ∑ ∑ = = + ∂ ∂ + ∂ ∂ − = ∂ ∂ + ∂ ∂ + Δ + −∇ = ∇ ⋅ + ∂ ∂ 3 1 2 2 3 1 ) , ( ) , ( ) ( j i j i j j i j i j i t x f x v v x p x v v t v t x f v p v v t v υ υ
Experimental Results Experimental Results --Energy Energy (N cores N threads N accelerators) (N cores N threads N accelerators) (N cores, N threads, N accelerators) (N cores, N threads, N accelerators) 700
Energy gain over SW-only version
Energy improvement Energy improvement over SW over SW--only approaches:
only approaches: 
A Composable Heterogeneous Accelerator A Composable Heterogeneous Accelerator--Rich Rich Microprocessor (CHARM) [ISLPED'12] Microprocessor (CHARM) [ISLPED'12] Microprocessor (CHARM) [ISLPED 12] Microprocessor (CHARM) [ISLPED 12]
♦ ♦ Motivation Motivation 
L2 Banks Memory controllers
An Example of ABB Library (for Medical Imaging) An Example of ABB Library (for Medical Imaging) An Example of ABB Library (for Medical Imaging) An Example of ABB Library (for Medical Imaging)
Internal Internal Internal Internal 
Example of ABB Flow Example of ABB Flow--Graph (Denoise) Graph (Denoise) Example of ABB Flow Example of ABB Flow Graph (Denoise)
Graph (Denoise) 2 2 47 47
Graph (Denoise) 
Graph (Denoise) w w z z Needed ABBs: "x", "y", "z" Needed ABBs: "x", "y", "z"
With task size of 5x5 block, With task size of 5x5 block, 1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 
LCA Composition Process
LCA Composition Process LCA Composition Process LCA Composition ProcessArea Overhead Analysis Area Overhead Analysis Area Overhead Analysis Area Overhead Analysis
Examples of Energy Examples of Energy--Efficient Customization Efficient Customization Examples of Energy Examples of Energy Efficient Customization Efficient Customization
♦ ♦ Customization of processor cores Customization of processor cores ♦ ♦ Customization of on Customization of on--chip memory chip memory ♦ ♦ Customization of on Customization of on--chip interconnects chip interconnects ♦ ♦ Customization of on Customization of on chip interconnects chip interconnects
Memory Management for Accelerator Memory Management for Accelerator--Rich Rich Architectures Architectures [ISLPED'2012] [ISLPED'2012] Architectures Architectures [ISLPED 2012] [ISLPED 2012]
♦ ♦ Providing a private buffer for each accelerator is very inefficient. Providing a private buffer for each accelerator is very inefficient. (5) The accelerator signals to the core when it finishes.
(6) The core sends the free-resource message to ABM. (7) ABM frees the accelerator and buffer in NUCA. 
Dynamic Interval Dynamic Interval--based Global (DIG) Allocation based Global (DIG) Allocation Dynamic Interval Dynamic Interval based Global (DIG) Allocation based Global (DIG) Allocation
♦ ♦Perform global allocation for buffer allocation requests in an interval Perform global allocation for buffer allocation requests in an interval
♦ ♦ BiN manager locally keep the information of the current contiguous buffer space in each L2 bank BiN manager locally keep the information of the current contiguous buffer space in each L2 bank ♦ ♦ BiN manager locally keep the information of the current contiguous buffer space in each L2 bank BiN manager locally keep the information of the current contiguous buffer space in each L2 bank
Since all of the buffer allocation and free operations are performed by BiN manager Since all of the buffer allocation and free operations are performed by BiN manager ♦ ♦ Allocation: starting from the nearest L2 bank to this accelerator, to the farthest Allocation: starting from the nearest L2 bank to this accelerator, to the farthest ♦ ♦ We allow the last page (source of page fragments) of a buffer to be smaller than the other We allow the last page (source of page fragments) of a buffer to be smaller than the other pages of this buffer pages of this buffer
No impact on the page table look p  No impact on the page table look p  No impact on the page table lookup  No impact on the page table lookup The max page fragment will be smaller than the min The max page fragment will be smaller than the min--page page An average latency of 0.6us (1.2K cycles @ 2GHz) to perform the buffer allocations An average latency of 0.6us (1.2K cycles @ 2GHz) to perform the buffer allocations ♦ ♦The total area of the buffer allocation module is less than 0.01% for a medium size 1cm The total area of the buffer allocation module is less than 0.01% for a medium size 1cm 2 2 chip chip b b 75 75 ( 1 ) ( higher harmonics (4th and 6th harmonics) ma be higher harmonics (4th and 6th harmonics) ma be higher harmonics (4th and 6th harmonics) may be higher harmonics (4th and 6th harmonics) may be substantially underestimated due to excessive water substantially underestimated due to excessive water and oxygen absorption and setup losses at these and oxygen absorption and setup losses at these frequencies. frequencies. (s s, ,d d) 
Examples of Energy Examples of Energy--Efficient Customization Efficient Customization Examples of Energy Examples of Energy Efficient Customization Efficient Customization
RF-I transmission line bundle
