Field-programmable technology: Today’s and tomorrow’s by Luk, Wayne
Field-programmable technology: today’s and tomorrow’s 
Wayne LUK 











Outline: technology = devices + design
1. overview: motivation and vision
2. field-programmable devices: today
- Xilinx Virtex-4, Virtex-5; Stretch S5 
3. field-programmable design : today
- enhance optimality and re-use 
4. field-programmable devices: tomorrow
- hybrid FPGA, die stacking 
5. field-programmable design : tomorrow
- guided synthesis, representation, upgradability
6. summary 
Thanks to colleagues, students and collaborators from Imperial College, University of 
British Columbia, Chinese University of Hong Kong, University of Massachusetts 
Amherst, UK Engineering and Physical Sciences Research Council, Stretch, Xilinx
3































































































challenge: reduce the design productivity gap
Not so good: design productivity gap
Source: SEMATECH
5























CPUs + DSPs FPGAs + sen ors
6
500MHz  Flexible
Soft Logic  Architecture











500MHz  DCM Digital
Clock Management
2a. Devices today: Virtex-4 FPGA
• fine-grain fabric + special function units












• all good 
news?
8












2c. Software configurable engine: S5
Wide Register File (WRF)









• Instruction Specialization Fabric
• Compute Intensive
• Arbitrary Bit-width Operations
• 3 Inputs and 2 Outputs
• Pipelined, Bypassed, Interlocked
• Random Logic Support



























• Tensilica – Xtensa V
• 32 KB I & D Cache
• On-Chip Memory, MMU
• 24 Channels of DMA, FPU
Source: Stretch
10
3. Design today: overview
• structural or register-transfer level (RTL)
– e.g. VHDL, Verilog
– low-level, little automation, small designs 
• behavioural, system-level descriptions
– e.g. SystemC (public-domain: systemc.org)
– MARTES: +UML for real-time embedded systems
• general-purpose software languages
– e.g. C, Java; with hardware support: Handel-C
– high-level, large automation, large designs
• special-purpose descriptions
– e.g. System Generator (signal processing)
– high-level, domain-specific optimisations
11
3a. Enhance optimality and re-use
• design optimality: quality
– select algorithm and devices: meet requirements
– mapping: regular to systolic, rest to processor
– I/O: dictates on-chip parallelism, buffering schemes
– control speed/area/power: pipelining, layout plan
– partitioning: coarse vs fine grain logic and memory
• design re-use: productivity
– separate aspects specific to application/technology
– library of customisable components with trade-offs
– compose and customise to meet requirements
– uniform interface to memory and I/O: hide details
– pre-verified parts: ease system verification
12
Example: systolic summation tree
• n-input adder
– tree of (n-1)
2-input adders
• each adder
– has k stages
– each stage 
has s-bit adder
• figure shows
– k = 3
– s = 3
• high k, low s





Finance application: value-at-risk 
• sampling from 
multivariate
Gaussian distribution
• DSP units: matrix 
multiplication


















+ + + +
+ +
+
33 times faster than 2.2GHz quad Opteron
(including all IO overheads, PC-FPGA communications, and using AMD 



















• multiple data (WR)
–perform operations 
in parallel 
• efficient data 
movement 




















Soft Logic  Architecture











500MHz  DCM Digital
Clock Management
4. Devices tomorrow: more diversed
Source: Xilinx
Replaced by other functional units, e.g. floating-point units
16
4a. Hybrid FPGA: architecture
• most digital circuits
– datapath: regular, word-based logic
– control: irregular, bit-based logic
• hybrid FPGA
– customised coarse-grained block: 
domain-specific requirements
– fine-grained blocks:                 
existing FPGA architecture
– good match to computing 
applications for given domain
17
Coarse-grained fabric library
D=9, M=4, R=3, F=3, 2 add, 2 mul: best density over benchmarks
18
Evaluation
• 6 benchmark circuits
– digital signal processing kernels: e.g. bfly (for FFT)
– linear algebra: e.g. matrix multiplication
– complete application: e.g. bgm (financial model)
• circuits: partitioned to control + datapath
– control: vendor tools to fine-grained units
– datapath: manually map to coarse-grained units
• comparison 





























4b. On-chip memory bandwidth
• storage hierarchy: registers, LUT RAM, block RAM
• processor cache: address lack of I/O bandwidth 
Source: Xilinx
21




• die stacking: 3D interchip connections
• customisable system-in-package: productivity gap? 
23
5a. Design tomorrow: guided synthesis
• guided transformation of design descriptions
– automate tedious and error-prone steps
– applicable to various levels of abstraction
• focus: two timing models
– strict    timing model: cycle-accurate - efficiency
– flexible timing model: behavioural - productivity
• combine cycle-accurate and behavioural models
– rapid development with high quality
– design maintainability and portability
• based on high-level language
– library developer: provide optimised building blocks
– application developer: customise building blocks 
24
Timing models: strict vs flexible
{
delta = b*b - ((a*c) << 2);
if (delta > 0)
num_sol = 2;






delta  b*b - ((a*c)  2);
if (delta  0)
nu _sol  2;
else if (delta  0)
nu _sol  1;
else





























































delta = b*b - ((a*c) << 2);
if (delta > 0)
num_sol = 2;






delta = b*b - ((a*c) << 2);
if (delta > 0)
num_sol = 2;







































tmp1 = pipe_mult[1].q << 2;
// ==================[stage 9]
tmp2 = tmp0 - tmp1;
// ==================[stage 10]
if (tmp2 > 0) num_sol = 2;
else if (tmp2 == 0) num_sol = 1;
































Rapid design: automated scheduling
• support combination of manual and automatic scheduling 
26
par {













tmp1 = pipe_mult[1].q << 2;
// ==================[stage 9]
tmp2 = tmp0 - tmp1;
// ==================[stage 10]
if (tmp2 > 0) num_sol = 2;
else if (tmp2 == 0) num_sol = 1;




























































































• ffd: free-form deformation; dct: discrete cosine transform 
with respect to smallestwith respect to software
29
5b. Data representation optimisation










– known width 
• Out1..Out2
– width determines 
accuracy
– defined by user
• find representation
– minimise width of 
nodes, e.g. X, Y
• trade-off in speed, 
area, power, error
30
Floating-point design Fixed-point design
Output Design Descriptions
y Xilinx System Generator
y VHDL
y A Stream Compiler (ASC) Code




( Interval analysis )
Precision analysis
( Automatic Differentiation )
Bit-width determination Bit-width determination
Design Selection
BitSize bit-width analysis system - Frontend




y Xilinx System Generator
y C/C++ ASC Code
y HandelC






























FIR filter and DFT: area vs error
1% more error: 65% less area
32



















FIR filter and DFT: speed vs error














• upgradability: minimise time-to-market
maximise time-in-market
• add new functions, fix bugs
• very rapid upgrade?
34
Dynamic upgrade: turbo coder
• error correction code: add redundancy
• need: fast, low-power, adapt to noise level















Source: Liang, Tessier, Goeckel
35
Self-tuning: run-time reconfiguration
• adapt: less channel noise, so lower power 
• larger Nmax: better correction, more area/power  
• sample channel noise every 250K bits
• find Signal to Noise Ratio (SNR), select Nmax
• if Nmax z current Nmax, configure new bitstream








Source: Liang, Tessier, Goeckel 36

















• up to 0.5x power, 2x speed over static decoder
• 100 times faster than processor decoder
Source: Liang, Tessier, Goeckel
52
737
• domain-specific design automation
– languages + tools: for particle physics systems?   
• multi-core, sensor network co-design
– multiple hardware/software: FPGA + CPU + sensors
• extending processor and compiler capabilities
– static and dynamic optimizations, self-tuning
• power-aware, radiation-aware design
– transforms e.g. pipelining, damage monitoring
• rapid and informative design validation
– simulation + FPGA prototype + formal verification
Other directions
38
• good: Moore’s Law, bad: productivity gap
• vision: unified design synthesis and analysis
• devices and design today
– growing gap: amount of I/O and amount of logic 
– enhance optimality and re-use: I/O driven   
• devices tomorrow
– hybrid FPGA: multi-granularity fabric
– 3D FPGA: customisable system-in-package
• design tomorrow
– guided synthesis: optimised and portable design 
– data representation optimisation
– upgradable and self-tuned design                          
6. Summary
53
