Accelerating numerical applications using ESL methodologies by Yibin Li (1526803)
University Library 
._ Lo,!ghbprough 
.,Umverslty 
AuthorlFillng Title ................ ~J ................................. . 
........................................................................................ 
--r Class Mark .................................................................... . 
Please note that fines are charged on ALL 
overdue items. 
0403820642 
11111111111111111111111 111111111111111 
Accelerating numerical applications 
using ESL methodologies 
By 
Yibin Li, BSc 
A Doctoral Thesis submitted in partial fulfilment of the requirements for the award of Doctor of 
Philosophy of Loughborough University 
November 2008 
QC Loughbnrough '" t.:,,,j ': ,h~: .. :':I lInhTI-,1(y :-,:~ ,/ 
Pji!\~';!_) j' ,i'~"'!":lry 
Date q-pj;o-
-
Class ( 
Acc 
No, O/fOJ,-f:, L oblf.J:. 
ABSTRACT 
This thesis is about exploring the suitability ofESL methodology for numerical applications. 
Numerical computing is the foundation of digital signal processing and many science computing. 
These applications all have high demand for processing speed and precision. On the EDA side, 
ESL increasingly draws designer's attention in recent years. Therefore, addressing the demand of 
numerical application with ESL methodology becomes the motivation of this thesis. 
Three cases presented in this thesis are TLM, FFT and Smith-Waterman. In the first case, the 
thesis details the development of programmable and hard-wired TLM computing solutions from 
methodology to post synthesis result. The comparisons of these two solutions are also presented. 
Motivated by the outcomes of this comparison, it was determined to investigate the suitability of 
ESL further. Subsequently, other two cases (FFT and Smith-waterman) are presented. 
ACKNOWLEDGEMENTS 
My foremost thank goes to my adviser Dr. Vassilios Chouliaras. Without his s uppor!, it will be 
impossible to finish this work. I thank him for his patience and encouragement that carried 
me on through difficult times, and for his insights and suggestions that helped to shape my 
knowledge. 
I would like to acknowledge my parents for always standing by me and supporting me 
through these years at University. 
I acknowledge all my colleagues in Loughborough University. All the wonderful people I 
met during my years in Loughborough have become cherished part of my life. 
Finally, I would like to express my gratitude to the Loughborough University for providing 
me with financial support 
List of Figures 
List of Tables 
List of Abbreviations 
Table of Contents 
Chapter 1 Introduction .•.•.......•.•.. : ......................•.•.•.......•.•.•.••••••••••••••••.•... 1 
1.1 Numerical computing ............................................................................ . 
1.1.1 TLM ......................................................................................... I 
1.1.2 FFT ......................................................................................... I 
1.1.3 Smith-Waterman ........................................................................... 2 
1.2 Hardware realization of numerical applications and its methodology.............. .... ... 2 
1.2.1 Hardware realization of numerical applications ........................................ 2 
1.2.1.1 Programmable approach..................................................... ... 2 
1.2.1.2 Hardwired approach ............................................................. 2 
1.2.2 The trade-off between floating point and fixed point. ................................ 3 
1.2.3 ESL methodology .......................................................................... 4 
1.3 The objectives of thesis .......................................................................... 5 
1.4 The organization of thesis ........................................................................ 5 
1.5 Author's contribution ................................................................................... 5 
Chapter2 Architectural exploration for programmable approach in TLM case ........ 9 
2.1 Transmission line ................................................................................... 9 
2.1.1 Modeling Transmission lines with lumped components ................................ 9 
2.1.1.1 Modeling a lumped system with discrete components ......................... IO 
2.1.1.2 Dispersion in a transmission line .................................................. I 0 
2.2 TLM modeling and its acceleration ................................................... : .......... 12 
2.2.1 TLM modeling ............................................................................. 12 
2.2.1.1 Two-dimensional TLM modelling ............................................... 13 
2.2.1.1.1 Two-dimensional TLM modeling ofa homogeneous medium ... 14. 
2.2.1.1.2 Two-dimensional TLM modeling of a non homogeneous medium .. 
...................................................................... ..... 15 
2.2.2 Literature for TLM acceleration.... ............................................. ........... 16 
2.2.2.1 Distributed Computing for TLM ................................................. 16 
2.2.2.2 Dedicated TLM processor......................................................... 16 
2.2.2.2.1 Programmable approach................................................ 16 
2.2.2.2.2 Hardwired approach ..................................................... 16 
2.2.2.2.3 Discussion................................................................ 17 
2.3 Simulation infrastructure........................................................................ 17 
2.3.1 CPU simulator.............................................................................. 17 
2.3.2 SimpleScalar toolset ........................................................................ 18 
2.3.3 Sim-system.................................................................................. 19 
2.3.4 Sim-vector ................................................................................... 21 
2.4 Thread and data level parallelism exploration and its results.................... .......... 23 
2.4.1 Reference code .............................................................................. 23 
2.4.2 Thread level parallelizing .................................................................. 24 
2.4.3 Thread level parallelism simulation result ............................................... 27 
2.4.4 Data level parallelizing ..................................................................... 28 
2.4.5 Data level parallelism simulation result.. ............................................... 32 
2.5 Summary ............................................................................................ 34 
Chapter 3 Hardware realization of programmahle approach in TLM case ............... 37 
3.1 Hardware realization of programmable approach ............................................. 37 
3 .I.IThe categorization of processor. ........... .............. ....... .............. ........... 37 
3.1.2 Hardware realization of parallel isms ................................................... 40 
3.1.2.1 Vector processor .................................................................... 40 
3.1.2.2 Multiprocessor ..................................................................... 40 
3.1.2.3 Hybrid approaches and research ................................................. 41 
3.2 Leon3MP and extended compiler environment. ............................................... 41 
3.2.1 Introduction of Leon core .................................................................. 41 
3.2.2 The architecture of Leon3 .............................................................. ". 42 
3.2.3 GRUB library and software development environment.. ............................ 43 
3.2.4 Extended compiler environment for Leon3MP ........ " ............................... 44 
3.3 Run time for TLM workload on Leon3MP platform ......................................... 46 
3.4 The heterogeneous TLM processor and VLSI implementation results ..................... 48 
3.4.1 Micro-architecture of heterogeneous TLM SoC ................................... " .. 48 
3.4.2 VLSI implementation results ............................................................. 51 
3.5 Summary ........................................................................................... 52. 
Chapter 4 A novel ESL design flow and hardwired approach for TLM ................... 55 
4.1 ESL methodology .................................................................................. 55 
4.1.1 ESL. ......................................................................................... 55 
4.1.2 SystemC .................................................................................... 55 
4.1.3 Abstraction level of SystemC model................................................... 55 
4.2 A novel ESL design flow ..................................................................... " .. 57 
4.2.1 IEEE 754 floating point datapath ........................ " ........................ " ... 58 
4.2.2 Synthesizable IEEE754 function .......................... , ............................. 61 
4.2.3 Replacing array access by pre-defined RAM ports .................... ,,,,,,,,,,,, ". 63 
4.2.4 Optimization techniques toward synthesizable SystemC ............................ 65 
4.2.5 SystemC synthesizer ...................................................................... 68 
4.2.6 Pre-defined RAM .......................................................................... 68 
4.2.7 RTL level validation ...................................................................... 69 
4.2.& Logical synthesis .......................................................................... 70 
4.2.9 Place and route ............................................................ " ........ " .... , 70 
4.2.1 0 Statistical power analysis stage ..................................................... " 71 
4.3 Parallelizing technique ........................................................................... 71 
4.4 TLM engine ................... ;................................................................... 73 
4.4.1 Interface ofTLM engine ........................................................... ,...... 73 
4.4.2 SystemC level simulation .................................................................. 75 
4.4.3 RTL level simulation ......................................... "............................ 76 
I 
I 
I 
I 
I 
I 
I 
4.4.4 Implementation space exploration ............................................................. 77 
4.5 Comparison ofLeon3 based TLM processor and ESL designed dedicated TLM engine 
................................................................................................................ 81 
4.5.1 Performance comparison between CPU based TLM processor and ESL designed 
TLM engine ................................................................................. 81 
4.5.2 VLSI Implementation comp~rison between CPU based TLM processor and ESL 
designed TLM engine ..................................................................... 82 
4.5.3 Conclusion ..................................................................................... 83 
4.6 Summary ............................................................................................ 83 
Chapter 5 Hardwired approach for FFT using ESL method .................................. 85 
5.1 FFT background ...................................................................................... 85 
5.2 ESL-based FFT engine ............................................................................ 88 
5.2.1 SystemC model.. ............................................................................ 89 
5.2.2 SystemC level simulation result ............................................................ 91 
5.2.3 RTL level simulation ........................................................................ 91 
5.3 The implementation ofESL-based FFT engine ............................................... 92 
5.4 Result and discussion .............................................................................. 93 
5.5 Summary ............................................................................................. 93 
Chapter (; Hardwired approach for Smith-Waterman using ESL method ................ 95 
6.1 Smith-Waterman background ...................................................................... 95 
6.2 ESL-based SW engine .............................................................................. 96 
6.3 The implementation of ESL-based SW engine .................................................. 99 
6.4 Conclusion .......................................................................................... 1 00 
6.5 Summary ............................................................................................ 100 
Chapter 7 Conclusion and future work ............................................................ 102 
List of figures 
Figure l.l 
Figure 2.1 
Figure2.2 
Figure 2.3 
Figure 2.4 
Figure 2.5 
Figure 2.6 
Figure 2.7 
Figure 2.8 
Figure 2.9 
Figure 2.10 
List of Figures 
Trade-off between fixed point and floating point ................................. 3. 
Transmission Line ..................................................................... 9 
The mode ling of a transmission line using a series of interconnected lumped 
systems ................................................................................. 9 
Modelling a small section of a transmission line by discrete lumped components 
10 
Part a T model into 3 segments..... ................... ............................ 11 
Wave propagation in a two-dimensional TLM mesh ............................. 13 
Model of node in a mesh. ............................... .................... ........ 14 
Model for a node of homogeneous medium ....................................... 15. 
Model for a node in a non homogeneous medium when the medium has some 
losses .................................................................................... 15 
The composition of Simple Scalar .................................................. 18 
Sim-systems C definition of the control registers declared for each simulated 
CPU ..................................................................................... 19 
Figure 2.11 Assembly macro used to set the processors run state ............................ 20 
Figure 2.12 Assembly macro used to obtain the processor ID number of the calling context. 
20 
Figure 2.13 C-macro implementing the sleep stage of the barrier instruction ............... 21 
Figure 2.14 Vector structure defining the registers required for vector simulation within 
Figure 2.15 
Figure 2.16 
Figure 2.17 
Figure 2.18 
Figure 2.19 
sim-vector .............................................................................. 22 
Vector length definition within sim-vector. ....................... ' ............... 22 
C model of scalar register ............................................................ 23 
Parallelizing loop ..................................................................... 24 
Pseudo declaration of shared memory array ....................................... 26 
Pseudo serial loop illustrating use of shared memory array ..................... 27 
List of figures 
Figure 2.20 
Figure 2.21 
Figure 2.22 
Figure2.23 
Figure2.24 
Figure 2.25 
Figure2.26 
Figure 3.1 
Figure 3.2 
Figure 3.3 
Figure 3.4 
Figure 3.5 
Figure 3.6 
Figure 3.7 
Performance of thread level optimization .......................................... 28 
The methodology used to identifY data level parallelism in TLM workload ... 29 
Programmers model for vector and scalar registers ............................... .30 
Definition of vector multiplication instruction .................................... 31 
VLD instruction defined both in C and assembly language ..................... 32 
Benchmarking using a thin and a cubic problem space ........................... 33 
Reductive dynamic instruction count for 80x 100x 125, 100x 125x80 and 
80x125xl00 node mesh .............................................................. 33 
Flynn's Taxonomy ..................................................................... 38 
A taxonomy of parallel computers .................................................. 39 
Architecture of Leon3 ................................................................ 42 
The extended compiler for Leon3MP .............................................. 44 
The Barrier mechanism ............................................................... 45 
The algorithm of barrier mechanism ............................................... 45 
The implementation of barrier mechanism using SPARC ISA ................. .46 
Figure 3.8 Run time for 4x4x4 TLM configuration................... ...................... 47 
Figure 3.9 Run time for 4x16x4 TLM configuration....................................... 47 
Figure 3.10 Run time for 16x4x4 TLM configuration....................................... 47 
Figure 3.11 Run time for 4x4x 16 TLM configuration..................................... 48 
Figure 3.12 Single processor-co processor architecture ......................................... 48 
Figure 3.13 Detailed scalar and vector core microarchitecture ................................ 49 
Figure 3.14 Processor-coprocessor IIF communication ......................................... 50 
Figure 3.15 Heterogeneous SoC architecture .................................................... 50 
Figure 3.16 Floorplan and layout for heterogeneous TLM SoC ............................... 51 
Figure 4.1 A novel ESL design approach ...................................................... 57 
Figure ,4.2 Datapath of Add and Subtraction ................................................... 59 
Figure4.3 Datapath of multiplication ............................................................ 60 
Figure 4.4 Datapath of division ................................................................... 61 
Figure 4.5 Data reformation toward 32bits IEEE754 .......................................... 62 
Figure 4.6 Using synthesizable IEEE754 floating point function ........................... 63 
List of figures 
Figure 4.7 
Figure 4.8 
Figure 4.9 
Figure 4.10 
Figure 4.11 
Figure 4.12 
Figure 4.\3 
Figure 4.14 
Figure 4.15 
Figure 4.16 
Figure 4.17 
Figure 4.18 
Figure 4.19 
Figure 4.20 
Figure 4.21 
Pre-defined single port RAM ....................................................... 63 
Convert array index as one dimensional memory address ....................... 64 
Using the RAM ports instead of array ............................................. 65 
aL mainO used to enable the SystemC synthesis ................................. 65 
Resolvable pointer ..................................................................... 6 
Wait () statement is used to determine the critical path ........................ 67 
Loop with waitO statement .......................................................... 67 
Loop with varied control condition ................................................. 67 
The functionality of Agility compiler ................................................... 68 
Connecting pre-defined RAM model .............................................. 69 
Parallelizing techniques .............................................................. 73 
The interface of SystemC Model ........................................................ 73 
Parallel TLM engine (quad mode) .................................................. 74 
Cycle level simulation for scalar TLM engine .................................... 75 
Cycle count with different delay configuration .................................... 75 
Figure 4.22 .Cycle count for parallel TLM engine ................................................ 76 
Figure 4.23 Running time for RTL level TLM engine ......................................... 77 
Figure 4.24 NAND gate count for TLM engine with different delay configurations ....... 78 
Figure 4.25 Flip-Flop count TLM engine with different delay configurations .............. 78 
Figure 4.26 Post synthesis area result for different delay configurations .................... 79 
Figure 4.27 Statistical power results for different configurations ............................. 80 
Figure 4.28 Real clock period results for different configurations ........................... 80 
Figure 4.29 Real time performance of both CPU based TLM processor and ESL designed 
TLM engine ............................................................................ 81 
Figure 4.30 The final VLSllayouts and physical implementation for the ESL designed TLM 
engine and CPU based TLM processor......................................... ... 82 
Figure 5.1 
Figure 5.2 
Figure 5.3 
Figure 5.4 
Decomposition of an 8-point DFT .................................................. 87 
Butterfly operator. .................................................................... 87 
8-point FFT ............................................................................ 88 
Interface ofFFT engine ................................................................ 90 
r 
List of figures 
Figure 5.5 Parallelize process ofFFT engine ................................................... 90 
Figure 5.6 SystemC level simulations for FFT engine ........................................ 91 
Figure 5.7 RTL level simulation result ofFFT engine ....................................... 91 
Figure 5.8 NAND count for FFT implementations ............................................ 92 
Figure 5.9 Flip-flop count for FFT implementations .......................................... 92 
Figure 6.1 Example of the Smith-waterman algorithm ....................................... 96 
Figure 6.2 Synthesizable Smith-Waterman code for Agility compiling ................... 98 
Figure 6.3 SystemC level simulation for SW processor ....................................... 98 
Figure 6.4 Nand count for SW engine ........................................................... 99 
Figure 6.5 Flip flop Count for SW engine ...................................................... 99 
List of Tables 
Table 1.1 
Table 2.1 
Table 3.1 
Table 4.1 
Table 5.1 
List of Tables 
Objectives of thesis ................................................................... 5 
Vector ISA extension ................................................................. 34 
VLSI implementation data ........................................................... 58 
32-bits IEEE 754 data format ...................................................... ,. 64 
The Comparison of floating point FFT implementation ...................... 93 
List of Tables 
ALG 
ALU 
AHB 
AM BA 
ASIC 
BCC 
CA 
CAD 
CDC 
CPU 
COMA 
COW 
CMOS 
CP 
CP+T 
DFT 
DLP 
DlV 
DC 
DRC 
EDA 
EDlF 
DDEJ 
ESL 
EMEL 
FFT 
List of Abbreviations 
Algorithmic 
Arithmetic Logic Unit 
The Advanced High-performance Bus 
Advanced Microprocessor Bus Architecture 
Application-Specific Integrated Circuit 
Bare-C Cross Compilation System 
Cycle Accurate 
Computer-aided Design 
Control Data Corporation 
Central Processing Unit 
Cache Only Memory Access 
Cluster of Works tat ions 
Complementary Metal Oxide Semiconductor 
Communicating Processes 
Communicating Processes with Time 
Discrete Fourier Transform 
Data Level Parallelism 
Division 
Synopsys Design Compiler 
Design Rule Check 
Electronic Design Automation 
Electronic Design Interchange Format 
DNA Data Bank of Japan 
Electronic System Level 
European Molecular Biology Laboratory 
Fast Fourier Transform 
List of Tables 
FP 
FPGA 
FPU 
IEEE 
ILP 
MAC 
MISD 
MIMD 
MIPS 
MPPs 
MUL 
NoN 
NUMA 
OSCI 
PE 
PIS A 
PV 
PV+T 
RTL 
P&R 
QNaN 
RAM 
SCN 
Soc 
SNoN 
SDRAM 
SIMD 
SISD 
SMP 
SW 
Floating Point 
Field-Programmable Gate Array 
Floating Point Unit 
Institute of Electrical and Electronics Engineers 
Instruction Level Parallelsim 
Multiply Accumulate 
Multiple Instruction Single Data 
Multiple Instruction Multiple Data 
Million Instructions Per Second 
Massively Parallel Processors 
Multiplication 
Not a Number 
NonUniform Memory Access 
Open SystemC Initiative 
Processing Element 
Portable Instruction Set Architecture 
Programmer's View 
Programmer's View with Time 
Register Transfer Level 
Place and Routing 
Quiet NaN 
Random Access Memory 
Symmetrical Condensed Node 
System on Chip 
Signaling NaN 
Synchronous Dynamic RAM 
Single Instruction Multiple Data 
Single instruction Single data 
Symmetric Multiprocessing 
Smith-Waterman 
List of Tables 
TcI 
TLM 
TLP 
TSMC 
UMA 
VHDL 
VLIW 
VLSI 
3D 
Tool Control Language 
Transmission Line Matrix 
Thread-Level Parallelism 
Taiwan Semiconductor Manufacturing Company 
Uniform Memory Access 
VHSIC Hardware Description Language 
Very Long Instruction Word 
Very Large Scale Integration 
Three Dimensional 
Chapter I. Introduction 
Chapter 1 
Introduction 
1.1 Numerical computing 
Numerical computing is a general concept which may apply to any algorithm containing 
large amount of mathematic calculations. It not only forms the foundation of scientific 
implementation, but also plays an important role in everyday life such as multimedia and 
speech recognition applications. The common characteristic of these applications is their 
demand for high throughout with necessary precision. 
In this thesis, three applications are chosen as baseline, which are TLM, FFT and 
Smith-Waterman to do the evaluations. 
1.1.1 TLM 
Transmission Line Matrix (TLM) mode ling is a numerical technique for the mode ling of 
wave propagation. It is initially developed for modeling electromagnetic wave propagation 
[1,2,3] 
Because TLM is based on Huygens's principle, it can be used to model any activity which 
also follows Huygens's principle. The past publications show TLM has implementations in 
the following problems: 
Diffusion problem [4] 
Vibration [5] 
Heat transfer [6] 
Radar [7] 
Electromagnetic compatibility [8] 
1.1.2 FFT 
Fast Fourier Transform (FFT) [9] is an efficient algorithm to compute the discrete Fourier 
transform (DFT) and its inverse. FFT is of great importance to a wide variety of applications, 
from digital signal processing to solving partial. The importance of FFT is well recognized by 
the engineering community. 
Chapter 1. Introduction 
1.1.3 Smith-Waterman 
Smith-Waterman [10] is one of the most important data searching algorithms. Through 
comparing sequences, the Smith-Waterman algorithm is used to search for homology. 
1.2 Hardware realization of nnmerical applications and its methodology 
To achieve computing efficiency for these numerical applications, much effort has been made 
in both software and hardware aspects. This thesis will be mainly focused on the hardware 
aspect and its implementation methodology to improve the computing of numerical 
applications. There are primarily two categorizations of hardware implementations, which are 
programmable and hardwired approaches. 
1.2.1 Hardware realization of numerical applications 
1.2.1.1 Programmable approach 
The increasing importance of numerical applications and the properties of modern processors 
have led to resurgence in the development of parallel architectures [11,12,13,14,15,16,17, 
18]. All these architecture have utilized at least one form of parallelisms inherently in 
workload. 
In the programmable approach, no matter General Purpose Processor (GPP) or Application 
Specific Instruction-set Processor (ASlP), they gains higher flexibility. But its computing 
efficiency is also traded-off by the flexibility. Hardwired solution is on the opposite which is 
introduced in the next section 
1.2.1.2 Hardwired approach 
Hardwired approach is another approach to execute numerical workload compared to the 
programmable approach. Hardwired approach is a fully hardware realization of specific 
algorithm. It is developed for specific application. By removing control unit in programmable 
approach, the higher throughout could be achieved. But it also relatively lacks 
programmability, since it is completely dedicated to a specific algorithm. 
There are a few hardwired implementations for chosen numerical applications [19,20,21,22, 
23,24,25,26,27,28,29]. 
2 
I 
I 
I 
I 
Chapter 1.Introduction 
1.2.2 The trade-off between floating point and fixed point 
The representation of data is the core issue in the numerical applications. Based on the way 
they represent numerical values and implement numerical operations internally, hardware 
realizations of numerical applications fall into two major categories. These two major formats 
are fixed point and floating point. 
Fixed point arithmetic represents and manipulates numbers as integers. Floating point 
arithmetic primarily represents numbers in floating point format, although they can also 
support integer representation and calculations. Floating point [annat implements numerical 
value representation as a combination offractional part and an exponent part. 
In fioating point arithmetic, the dynamic range limitations can be p,actically ignored in a 
design. But it tend to be more expensive, since floating point processors can implement more 
functionality (complexity) in silicon and have wider data path (typically 32 bits). Generally, 
for floating point, relative ease of development and schedule advantage are being traded off 
against higher cost and hardware complexity. 
On other hand, the lower cost and higher speed of fixed point implementations are traded off 
against added design effort for algorithm implementation analysis, and data scaling to avoid 
accumulator overflow. 
Cost and Power 
Figure 1.1 Trade-off between fixed point and floating point 
In numerical applications, precision and dynamic range are both demanding. Also, the 
dynamic range reduces the design complexity and facilitates a certain degree of flexibility. 
Therefore, floating point is chosen to implement in this thesis. 
3 
Chapter I.Introduction 
1.2.3 ESL methodology 
In EDA community, increasing attention has been paid to ESL method in recent years. ESL 
was placed as "a level above RTL" by the 2004 International Technology Roadmap for 
Semiconductor (ITRS). The main characteristic of ESL is that it allows the design to be built 
in a high, selective abstraction level. This allows starting with an early verification of 
hardware and software choices, long before a complete R TL model. Therefore, the 
complexity of development could be considerably reduced. There is a large amount of ESL 
tool available [30]. They can be categorized based on different platforms, varied abstraction 
levels or different design entries. In term of design entry, various C/C++ or C-like languages 
have been chosen as the input for many ESL tools [31, 32, 33, 34, 35J. Considering the 
popularity of ESL in EDA community, ESL-based designs are rarely reflected in past 
literature. Therefore, it becomes author's interest to study the suitability of ESL method in 
numerical applications. 
As a relatively mature language for system specification and modeling, SystemC has been 
taken as design entry for the proposed ESL design flow in this thesis. Due to availability and 
maturity at the time, Agility compiler has been chosen in this thesis to establish a novel 
design flow which is from SystemC level to standard cell level. 
1.3 The objectives ofthesis 
With the analysis of the current state of art in the hardware realizations of numerical 
applications and ESL method, the objectives of research in this thesis are formed as: 
How does the performance of programmable approach compare 
Hardware aspect to the hardwired approach using TLM case study? 
How will the performance be improved by utilizing the inherent 
parallel isms? 
How could a full ESL design flow be established? 
Methodology aspect The suitability of ESL method for hardware realization of 
floating point numerical applications. 
Table I Objectives oftheSls 
4 
Chapter I. Introduction 
1.4 The organization ofthesis 
As programmability is more desirable to TLM, TLM is firstly investigated to compare 
programmable approach with ESL-based hardwired approach. 
In chapter2, the architectural aspect of programmable solution is studied. It's demonstrated 
that there are data and thread level parallelisms existed in TLM workload. The related vector 
instructions are proposed as well. Consequently, in Chapter 3, the parallelisms in TLM are 
realized by the multiply and vector CPU architecture based on the LEON platform. In chapter 
4, a novel ESL design flow is presented. Through developing the hardwired TLM engine with 
this design flow and comparing it with programmable solution, it is demonstrated that ESL 
method can potentially accelerate numerical application. To further validate ESL method, the 
ESL-designed hardwired engines for FFT and Smith-Waterman are presented by using the 
proposed ESL method respectively in Chapter 5 and 6. At last, in chapter 7, the final 
conclusion forms and future works are suggested. 
1.5 Author's contribution 
As some works reported in this thesis are based on collaborative work, author's contributions 
are specified as. 
• Creating vector ISA extension on top of Simplescalar toolset for TLM, rewrite TLM 
original code to enable TLM simulation in vector mode, vectorized TLM simulation 
• Partitioning TLM workload using Sim-system simulator and multi-threaded TLM 
simulation 
• Building vectorized IEEE754 floating point datapath for programmable approach for 
TLM 
• Study the LEON3MP system and create the related compiler directive (Barrier) to run 
the parallel TLM workload 
• Creating a synthesizable C-Ievel floating point operation library based on the Softfloat 
• Building a novel ESL design flow which is from SystemC to standard cell 
• Implementing ESL design flow in TLM, FFT and Smith-Waterman cases which are 
down to b the simulated RTL stage. 
5 
Chapter I. Introduction 
Reference 
[I]. P.BJohns and R.L.Beurle, "Numerical solution of 2-dimensional scattering problems 
using a transmission-line matrix" Proc of lEE, vo!. 118, pp. 1203.1208, Sept 1971. 
[2]. WJ.R.Hoefer, "The transmission-line matrix method: Theory and application" IEEE 
Trans Microwave Theory Tech, vo!. MTT-33, pp. 882.893, Oct 1985. 
[3]. W. J. Hoefer, "Huygens and the computer - a powerful alliance in numerical 
electromagnetic" Proceeding ofthe IEEE, vo!. 79, pp. 1459.1471, October 1991. 
[4]. P.BJohns, "A simple explicit and unconditionally stable numerical routine for the 
solution of the diffusion equation" Int. J. Num. Meth. Eng., vol. 11, pp. 1307. 1328,1977. 
[5]. C. Partridge, C. Christopoulos, and P.BJohns, "Transmission line modelling of shaft 
dynamic systems" Proceedings of the institute of mechanical engineers, no. 201, pp. 271 .278, 
1987 
[6]. V. Tfenkic, C. Christopoulos, and J.G.P.Binner, "The application of the transmission line 
modelling (tlm) method in combined thermal and electromagnetic problem" Proceedings of 
the international conference on numerical methods for thermal problems, pp. 1263 .1274, 
1993 
[7]. FJ.German, G.K.Gothard; L.S.Riggs, and P.M.Goggans, "The calculation of radar 
crosssection (rcs) using the tlm method" Int. J. Numerical Modelling, vol. 2, pp. 267 . 
278,1989 
[8]. K. Umashankar and A. Ta_ove, "A novel method to analyze electromagnetic scattering 
of complex objects" IEEE transactions on electromagnetic compatibility, vol. EMC-24, pp. 
397 . 405, November 1982 
[9]. W. T .. Cochran, J.W. Coo1ey, D. L. Favin, H. D. Helms, R.Kaenel,W.W. Lang, G. C. 
Maling, D. E. Nelson, C. M. Radef, and P. D.Welsh, "What is the fast Fourier transform?" 
IEEE Trans. Audio Electroacoust., vol. AU-15, pp. 45-55, Jun. 1967 
[10]. M.Waterman and M. Eggert, "A new algorithm for best subsequence alignments with 
application to trna-rrna comparis," Journal of Molecular Biology., vol. 197, pp. 723-728, 
1987 
[11]. WJ. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, J. Ahn, N. Jayasena, U. J. 
Kapasi, A. Das, J. Gummaraju, and I. Buck, "Merrimac: SupercompuNng with streams" In 
SC'03, Phoenix, Arizona, November 2003 
[12]. R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. 
Asanovic, "The Vector-Thread Architecture" In Proceedings of the 31 st International 
Symposium on Computer Architecture, pp. 52-63, Munich, Germany, June 2004 
[13]. R. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, 
6 
Chapter 1. Introduction 
B. Towles, and A. Chang, "Imagine:Media Processing with Streams" IEEE Micro, pp. 35-46, 
March! April 200 I. 
[14]. Christoforos Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, Krste 
Asanovic, Neal Cardwell, Richard Fromm, Jason Golbus, Benjamin Gribstad, Kimberly 
Keeton, Randi Thomas, Noah Treuhaft, and Katherine Yelick, "Scalable Processors in the 
Billion-Transistor Era: lRAM" Computer, 30(9):75-78,1997 
[15]. E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. 
Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, "Baring it all to Software: Raw 
Machines" IEEE Computer, pp. 86-93, September 1997 
[16]. K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and 
C. R. Moore, "Exploiting !Lp, TLp, and DLP with the polymorphous TRIPS architecture" In 
Proceedings of the 30th International Symposium on Computer Architecture, pp. 422-433, 
San Diego, California, June 2003 
[17]. D.Pham, S.Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J.Kahle, A. 
Kameyama, J. Keaty, Y. Masubuchi, M.Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. 
Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa, "The Design and 
Implementation of a First~Generation CELL processor". In Proceedings of the IEEE 
International Solid-State Circuits Conference, pp. 184-185, San Francisco, California, 
February 2005 
[18]. K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz, "Smart Memories: 
A Modular Recongurable Architecture" In Proceedings of the 27th International Symposium 
on Computer Architecture, pp. 161-171, Vancouver, Canada, June 2000 
[19]. A.H. Saleh, "A dedicated processor for solving TLMfield problems ", PhD Thesis, 
University of Nottingham, 1982 
[20]. S.Gregory, "Design of a single bit processor for TLM using fitll custom IC design ", 
Dissertation (BEng), University of Nottingham, 1989 
[21]. S. David, "The development of an application specific processor for the Transmission 
line matrix method", Loughborough University, 2000 
[22]. http://www.altera.com.cn/products/ip/dsp/transforms/m-ham-ffi-float.htm I 
[23]. http://www.4dsp.com/fft.htm 
[24]. N. Mahdavi, R. Teymourzadeh, M. Bin Othman, "VLSI Implementation of High Speed 
and High Resolution FFT Algorithm Based on Radix 2for DSPApplication" SCOReD 2007. 
5th Student Conference on 12-11 pp.1 - 4Dec. 2007 
[25]. K. Dabbagh-Sadeghipour, M. Eshghi, "A sel.ftimed, pipe lined floating point FFT 
processor architecture" Signals, Circuits and Systems, 2003. SCS 2003. International 
Symposium on Volume I, pp.33 - 36 vol.l, 10-11 July 2003 
[26]. D. Lopresti, "P-nac: systolic array for comparing nucleic acid sequences" Computer, 
7 
r ~~~~,-~~~-".-,-~,~ ~~~--~~, 
I 
I 
I 
I 
Chapter 1.Introduction· 
voL 20, no. 7, pp. 98-99,juI1987. 
[27]. Y. Yamaguchi, T. Maruyama, and A. Konagaya, "High speed homology search with 
fPgas" in Pacific Symposium on Biocomputing, pp. 271-282, 2002. 
[28]. T. F. Oliver, B. Schmidt, and D. L. Maskell, "Hyper customized processors for 
bio~sequence database scanning onfpga" in FPGA, pp. 229-237,2005 .. 
[29]. T. Han, S. Parameswaran, "SWASAD: an ASIC design for high speed DNA sequence 
matching" Design Automation Conference, 2002. Proceedings of ASP-DAC 2002. 7th Asia 
and South Pacific and the 15th International Conference on VLSI Design. Proceedings. 
pp.541 - 546, 7-11 Jan. 2002 
[30]. D. Densmore, R. Passerone, A. Sangiovanni-Vincentelli, "A Platform-Based Taxonomy 
for ESL Design". IEEE Design and Test of Computers, pp. 359-374, September 2006. 
[31]. C2R Compiler product brief, www.cebatech.com 
[32]. S. Gupta, N.D. Dutt, R.K. Gupta, A. Nicolau, "SPARK: A High~Level Synthesis 
Framework For Applying Parallelizing Compiler Transformations ", International 
Conference on VLSI Design, pp. 461- 466, January 2003 
[33]. J. L. Tripp, K. D. Peters on, C. Ahrens, J.D. Poznanovic, M. B. Gokhale" "Trident: an 
FPGA compiler framework for jloating~point algorithms ", Field Programmable Logic and 
Applications, 2005.International Conference on, Volume, Issue, pp. 317 - 322 
24~26 Aug. 2005 
[34]. Catapult Compiler Product brief, 
[35]. http://www.mentor.com/products/esl/high _Ievel_ synthesis/catapult_ synthesis/index.cfm 
8 
Chapter 2 .Architectural explorat'lon for programmable approach in TLM case 
Chapter2 
Architectural exploration for programmable approach in 
TLM case 
2.1 Transmission line 
A transmission line is a device for transmitting or guiding energy from one point to another 
point. [1)[2). The energy may be electromagnetic waves, acoustic waves, or it may be in 
the form of signal information (speech, pictures, data, music). For the purposes of analysis, 
an electrical transmission line can be modelled as a two-port network (also called a 
quadrupled network), as follows: 
PDr\A Transmission Line PortB 
Figure 2.1 Transmission Line 
2.1.1 Modeling Transmission lines with lumped components 
As we discussed before, transmission line is a distributed system. To illustrate the wave 
propagate in a distributed system, it is better to model it as a series of lumped systems. 
Lumped systems are systems whose size is zero, so a wave can travel through a lumped 
system without costing time. A transmission line modeled as a series of lumped systems is 
shown in figure2.2. 
~-----­
$ 
@ 
Figure2.2 The modeling of a transmission line using a series of interconnected lumped 
9 
57 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
systems. 
To do this mode ling, the whole transmission line is parted into small interconnected 
transmission lines with the length of 6x. The length of small transmission line is selected 
as the wavelength of the signal in the medium be much larger than the size of each these 
transmission lines. If the wavelength is),. then: 
)"«!'.x (2.1) 
In a typical linear transmission line, each of these lumped systems can be modeled as a 
combination which includes inductor, capacitor and resistors. In the case that the 
transmission line is loss less, there is no resister in the model. 
Figure 2.3 ModeIing a small section of a transmission line by discrete lumped 
components 
2.1.1.1 Modeling a lumped system witb discrete components 
When a transmission line is broken into small sections which are much smaller than the 
wavelength for the highest frequency in the transmission line, we can use a lumped system 
to replace each section of transmission line. Furthermore this lumped system can be 
modeled as described in figure2.3. 
The value ofC and L depend on the transmission line characteristics. 
2.1.1.2 Dispersion in a transmission line 
As figure2.2 and figure2.3 show, a long loss less transmission line can be modeled by a 
series of connected C and L sections. Since the whole dispersion effect is the sum of the 
dispersion effect of each small section, the transmission line dispersion effect can be found 
by calculating the dispersion effect of each small section. 
The relation between input and output voltages can be found as follows: 
10 
Chapter 2 .Archi tectural exploration for programmable approach in TLM case 
(2.2) 
Where i and 0 respectively represent input and output. The T is a 4x4 matrix called 
transmission matrix which is in the form of: 
In which ~ is the phase constant and 'le is the length of each section. 
[
COS JJI 
T = . sin fJl 
J 
z 
jZsin JJI 1 
cos jJl 
(2.3) 
, The model of the figure2.3 can be part down into 3 segments as figure2.4 
-1,--tvL-----.J~ 
T 
-iL_--lf- T ---jL_---'~ 
l 
Figure 2.4 Part a T model into 3 segments 
Since the T section is the gather of three sections. T can be calculated by equation as 
below: 
(2.4) 
Where ~ , T2 orT3 is the transmission matrices for each of three sections. In figure2.4, 
section 1 and 3 are the same, therefore their transmission matrices are the same. The 
transmission matrix for these two sections is described as below: 
Where ffi is the angular frequency of the wave in the medium and M is the time for the 
wave 10 travel through the section. 
mt.! jZL sin mt.1 cos 
2 2 
T, =T3 = 
sin 
wt.! 
--
wt.! (2.5) j 2 cos --
ZL 2 
11 
Chapter 2 .Architectural exploration for prograrrunable approach in TLM case 
The matrix used to describe the middle section T, is: 
[ 
I 
«)~t T, = tan--
. 2 
J Z 
, 
(2.6) 
If we substitute the values for 7;,2,3 in the equation I, and find the value of T and 
compare it with the general form of T as describes in equation - I, the transmission 
parameter for the section can be calculated: 
, «)~T [ Z] cos(,BI)=1-2sm2(--) I+_L 
2 2Zc 
(2,7) 
As shown in this equation, the phase constant ~ is related to the 00, the angular frequency of 
the wave. 
2.2 TLM modeling and its acceleration 
2.2.1 TLM modeling 
In 1971, John firstly used a mesh of passive transmission line components to model 
Huygens 'principle by sapling the space (3). Wave propagation was modeled as voltage and 
current which travel in this mesh, Time was also sampled. Wave propagation in a two-
dimensional transmission line mesh can be described by formula 2.8 where ~T and ~A 
are used to represent the time and space interval respectively, C is the wave speed in the 
medium 
& = 6.TC (2.8) 
If there is an impulse to the middle node at time zero, then there is a wave which scatters to 
4 neighboring nodes (figure2.5(a» , The scattered wave reaches the neighboring nodes at 
the time t.T (figure2.5 (b»). Then four waves start to travel to their neighboring nodes 
(figure2.5(c», At time 2t.T, the wave front can be found by finding waves scattered from 
points in figure2,5 (b) as shown in figure2.5 (d), At the end of each time interval, each node 
12 
-Chapter 2. Architectural exploration for programmable approach in TLM case 
receives an incident from its neighbors and also scatters incident to its neighbors. By 
repeating the above calculation, wave distribution on the mesh can be calculated. 
2.2.1.1 Two-dimensional TLM modeling 
According to the accuracy and complexity of the modeling medium, one. two or three 
dimensional TLM modeling can be chosen. Since two-dimensional TLM mode ling is more 
intuitive, it will be explained in depth to begin with. One and three dimensional models 
will be briefly explained later on. 
1 
,," , ,1 
I''' " ,," 
1 
Ca) At time t=0 the centre point scatter a 
wave 
0.5,,/' 
0.5 
~ ,,0.5 
0.5,,/ 
0.5 '-...J ,,-' ~.5 ~ '-·0 . . D.5" / '- 0.5 
" ;' '.Dit; ·0.5" , VD: 0.5 ~. 
" 
0.5' -...J /_' 0.5 
0.5 
(c) At time t=.0.T, the points on wave front 
s.catter waves. 
!f "\ 
~ /' 
(b) Wave front at time t=.L\T 
0.25 
O.J5 
(d) At lime 1=2.L\T. the new wave front fOfm 
from waves scattering from points. . 
Figure 2.5 Wave propagation in a two-dimensional TLM mesh 
13 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
2.2.1.1 Two-dimensional TLM modeling of a homogeneous medium 
When all parts of a medium share the same properties, the medium is termed as 
homogeneous. To model homogeneous media, we can use a simple mesh shown in figure 
2.5, since there is no change in the medium properties. The mesh described in figure 2.5 
consists of several nodes. The model of one node is shown in figure 2.6. In this figure, V I 
represents the incident wave (voltage in the transmission line) and VS represents the 
scattered wave. 
3 
2 
1 
4 
VI 
VS ," '~" 
v,,~ 
V,' 
Representation in a mesh 
Figure 2.6 Model of node in a mesh 
v; 
v/ 
~ v' 
vs' 
• 
Real model 
The equation which describes the relation between the incident wave and the scattered 
wave is: 
-1 [
V ]S [-1 , v, ~.'- 1 
V J 2 1 
V 4 K,.I 1 
-1 
I represents the incident wave voltages and S represents the scattered wave voltages, K and 
K+I stand for the consecutive time separated by the sample interval Ll.T. According to this 
equation, if the magnitude of the wave is known at the time of KLI. T, then the magnitude of 
wave in the mesh could be calculated at time (K+I)Ll.T. Wave propagation can be modeled 
by repeating this for each time step. 
14 
Chapter 2 .Archi tectural exploration for prograrrunable approach in TLM case 
2.2.1.1.2 Two-dimensional TLM modeling of. non homogeneous medium 
When TLM is used to model wave propagation in heterogeneous medium, the properties of 
the medium should be properly reflected. To address that, a new model for a node is 
described in figure2.7 [4) [5]. 
When there is loss in the medium, a resistor (R) is added which is in parallel to the capacitor 
(C) as shown in figure2.8 to model the loss [6) 
3 
4 
2 
1 
Representation in a mesh Real model 
Figure 2.7 Model for a node of homogeneous medium 
Figure 2.8 Model for a node in a non homogeneous medium when the medium has 
some losses 
Figure 2.1-2.8 were modifiedfrom M. Ahmadian "Transmission Line Matrix (rLM) modelling 
Dfmediealultrasound" PhD thesis 2001 
15 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
2.2.2 Literature for TLM acceleration 
2.2.2.1 Distributed Computing for TLM 
Building a distributed system for TLM is a relatively easy approach to accelerate 
computing, but this method is theoretically constrained in its efficiency. For example, a big 
part of time would be consumed by the process of passing message. In practice, there is a 
couple of efforts which have been made in this direction [7, 8, 9,10, 11] 
2.2.2.2 Dedicated TLM processor 
Existing specific processors designing for TLM fall into two categories, programmable 
approach which utilize software to perform related calculations and hardwired approach on 
to which the. TLM algorithm is mapped directly. 
2.2.2.2.1 Programmable approach 
Saleh developed the coprocessor approach [12] in 1982. Two-dimensional TLM code was 
used as benchmark. A host system runs the main TLM code. Each time a scatter operation 
is encountered, the data is passed to the TLM coprocessor. It performs a scattering 
computation and returns the scattered data to the host. The TLM array is defined by using 
software, therefore any mesh size may be accommodated depending on the size of memory. 
An instruction register and several control registers provide some control over the 
configuration of the processor for each node. All data regarding the composition of the 
array and the material properties at each node are stored in the host processor. The 
coprocessor only has a stack for holding partial results during calculations. Performance 
improvement is achieved by performing the scatter calculations using dedicated hardware, 
since there is no need for instruction fetch and decodes cycles. Software running on an 
LSI-II computer achieved 62 node iterations per second, where as the same software on an 
LSI-II with the coprocessor can achieved 1670 node iteration per second [12]. The speed 
up is 27 times. The bus which connects host processor and coprocessor is 16 bit. Therefore, 
only 16 bit floating point can be implemented in this case. With only 8 bit mantissa, the 
accuracy is low. 
2.2.2.2.2 Hardwired approach 
Gregory designed a complete application specific, parallel TLM system [13] in 1989.The 
16 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
system was developed around an array of bit serial processing elements, which performed a 
basic 2D scattering operation on their input data. A number of PEs were arranged in an 
SIMD array on to which the TLM mesh, was mapped with one node per PE. The most 
recent work in this direction was done by Stothard [14] in 2000. A dedicated TLM 
processor was developed for different themes of TLM ( 2 D shunt node, the stub loaded 
shunt node, the symmetrical SCN, SCNN) and implemented on a FPGA(Xilinx XC4000) 
platform. 
2.2.2.2.3 Discussion 
These two methods both have throughout improvement over the software version. But they 
also have their own limitations. 
In the programmable approach, while the system has the programmability required by the 
TLM, it doesn't utilize the parallelisms which inherently exist in the TLM. And the design 
is also largely constrained by the technology of time. 
On contrast, while the hardwired approach may have the best theoretical performance, it is 
difficult to accommodate programmability through this architecture. And it is also 
constrained by the technology of time as well. The trade-off between programmability and 
throughout is an important point for study. 
2.3 Simulation Infrastructure 
2.3.1 CPU simulator 
To predict the performance of a CPU system, relevant simulators are needed. Simulation can 
reveal the dynamic characteristics of the hardware model and the workload that executes on it 
and allows for rapid design space exploration. Such hardware model can be implemented in 
CIC++ or hardware description languages such as Veri log and VHDL. Additionally, CPU 
models allow the development of the software prior to the hardware [15, 16, 17]. Normally, 
such software models execute considerably slower than its hardware equivalent; however 
they can be built in very short time [18]. There is a trade-off among abstraction levels, 
execution time and flexibility of CPU simulator. They can be classified into trace driven [18] 
or execution driven [19]. In trace based simulation, a stream of pre-recorded instructions are 
used to drive a hardware timing model. Trace-based is faster than execution driven simulation 
but it requires large amount for storage of traces and large time overheads as traces can 
contain billions of references. Additionally, it can be less accurate due to the difficulty in 
characterizing the behavior of real programs. On the other hand, execution-driven has greater 
17 
Chapter 2 .Architectural exploration for prograrrrrnable approach in TLM case 
accuracy as the execution of the program and the simulation of the architecture are closely 
related. The typical result of this type of simulation is a large number of statistics that can 
help to understand the behavior of simulated system and a estimated execution time. The 
drawbacks of the execution driven simulation are the high model complexity. 
Also, CPU simulator can be classified into instruction-accurate simulator [16, 18, 20] and 
cycle-accurate simulator [21, 22] depending on the abstraction level of simulator. 
Instruction-accurate simulator imitates the behaviar of CPU by "executing" instructions and 
maintaining internal variables which represent the processor's registers, since the 
Instruction-accurate Simulator (ISS) don't contain pipeline detail or timing issues [16, 20]. 
On the other hand, the Cycle-Accurate Simulator (CAS) can perform timing analysis. It's 
more complicated to develop since greater amount of detail is contained than ISS. In term of 
flexibility, different CAS need to be developed for any new implementation of an architecture 
where the ISS only needs minor changes between implementations of the same 
architecture[22] 
2.3.2 SimpleScalar toolset 
SimpleScalar[23) (figure2.9) is a well-know toolkit for CPU simulation in different 
abstraction levels. It provides modules that model typical components of CPUs as well as 
tools for data collection. The designers can create their own custom simulator by utilizing the 
tools existing in the tool se!. 
User 
Progranls 
Prog/Sim 
Interface 
Functional 
Core 
Performance 
Core 
Figure 2.9 The composition of SimpleScalar 1231 
18 
Chapter 2. Archi tectural exploration for progranunable approach in TLM case 
The ISA implemented in the SimpleScalar is known as Portable Instruction Set Architecture 
(PISA) which is based on the MIPS. Additional features in PISA include the extension of the 
opcode from 32 to 64 bits which have flexibility for greater ISA study. In SimpleScalar 
toolset, five different simulation packages are defined to facilitate the simulation which have 
different trade-offs between simulation speed and detail. Two extremes of these simulators 
are named as Sim-fast and Sim-outofouder. As the name indicates, Sim-fast has the fastest 
simulation speed, roughly at 9+ million instructions per second (MIPS) on a typical X86 
Linux-based machine [23]. To achieve this speed, only instruction level information is 
provided, all timing measurements have been removed. In the other extreme, Sim-outoforder 
executes at the cycle-accurate level. Many features such as out-of-order issuing and branch 
prediction are also enabled. 
2.3.3 Si m-system 
The original SimpleScalar toolset was developed for modelling a uni-processor system. In 
order to model a multiprocessor system, the Sim-fast was heavily modified. Sim-system not 
only includes multiple processor contexts but also can deliver the addition of extra registers 
and custom instructions within each processor. 
typedef struct 
sword_t hi, 10; 
int fee; 
/* multiplier HI/LO result registers */ 
/* floating point condition code s * / 
unsigned int PRID; /* Processor ID register */ 
int PSTATER; 
md ctrl t; 
/* Processor state • / 
Figure 2.10 Sim-systems C definition of the control registers declared for each simulated 
CPU. 
Figure 2.10 illustrates the control register for each simulated processor in sim-system. The 
PISA architecture has been extended with two different registers which are PRID and 
PSTA TER. These two registers respectively contain the (private) processor ID number and 
its current state. The processor ID is a unique number which is used to label each processor. 
In Si m-system, each processor can be in one of two states, RUN and SLEEP. In the SLEEP 
state, the processor does not run until an external arbiter send it to the RUN state. 
The second additional control register implemented in the Sim-system is the processor state 
19 
Chapter 2. Architectural exploration for programmable approach in TLM case 
register. This register contains the state of the individual processor contexts, such as sleep or 
run state. As the processor ID, this register can be accessed by the application source which is 
running on the simulator. 
#define eSTATE (context , state)\ 
( ( \ 
asmvolatile ("addu $10,%0,$0" : :"r" (context) 
asm volatile (".word Ox00010000");\ 
asm volatile {".word \ 
4 « 29 /* EXT_OPCODE */ 
2 « 25 /* CATEGORY */ 
15 « 20 I ;* OPCODE */ 
10« 15 I \ 
"#state" « 10" 
) ) ; 
"$10") ; 
Figure 2.11 Assembly macro used to set the processors run state. 
"$10"); \ 
Figure 2.11 illustrates the assembly macro which is used to set state for each processor. It 
takes processor ID and the wanted state as input. By using this function, it is possibly for one 
processor to control the state of any other processor. 
#define GET_PRIDR(var)\ 
( { \ 
asm volatile (".word OxOOOlOOOO");\ 
asm volatile (".word \ 
1 « 29 
2 « 25 
15 « 20 
12« 15" 
/* EXT OPCODE */ 
/* CATEGORY */ 
/* OPCODE */ 
"$12") ; 
asm volatile ("addu %0,$12,$0":"-r"(var));\ 
} ) ; 
Figure 2.12 Assembly macro used to obtain the processor ID number of the calling 
context. 
To allow internal access to the state register for each processor, the function GET_PRIDR is 
implemented as described in the figure 2.12. The first two assembly instructions within the 
20 
Chapter 2.Architectural exploration for programmable approach in TLM case 
macro specify the extended instruction under the NOP opcade annotation. 
#define BARRIERl SLEEP \ 
( (\ 
extern' int gsem; \ 
gsem++;\ 
if (! context) \ 
while (gsem < XC_MAX) \ 
; \ 
if (!context) \ 
gsem = 0;\ 
if (context) \ 
CSTATE(context,l);\ 
) ) ; \ 
Figure 2.13 C-macro implementing the sleep stage ofthe barrier instruction 
The. barrier mechanism is used to synchronize all processors in the multiprocessors system. It 
can be divided into two stages, the sleep stage (figure 2.13) and the synchronous release stage. 
When each processor reaches the barrier, it enters a sleep state. Once all processors arrive 
barrier (g sem=XC _ MAX), eSTATE function is used to release all processors. 
2.3.4 Si m-vector 
In order to design and verify new DLP instruction sets, Sim-vector was developed. The 
Sim-fast simulator in SimpleScalar had been modified to create a simulator which is capable 
of simulating vector architecture. 
typedef struct 
I*Vector length registerp*1 
int VLEN; II 16-bit elements 
int VLEN8; 
int VLEN32; 
I *Main epu int registers*1 
signed int RF[INT_REGS]; 
1* Vector register file*1 
21 
I 
I 
I 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
signed short int VRF[VECTOR_REGS] [VLMAX]; 
float VFPRF[VECTOR_FP_REGS] [VLMAX32]; 
;* Vector accumulators*; 
Signed int VACC[VACCVMVLATORS] [VACC ELEMENTS]; 
;* Predicate registers*; 
unsigned short int PRED_T[PRED_REGS] [VLMAX}; 
unsigned short int PRED_F[PRED_REGS] [VLMAX}; 
;*Scalar registers*1 
signed int SRF[SCALAR_REGS]; 
l'Overflow flag* I 
unsigned short int overflow[VLMAX]; 
I*Carry flag*; 
unsigned short int carry[VLMAX]; 
1* VECTOR OVERFLOW*I 
unsigned short int VV; 
} vstateT; 
Figure 2.14 Vector structure defining the registers required for vector simulation within 
sim-vector 
Figure2.14 describes the structure vstateT which contains all the required vector registers. 
From this declaration, the vector registers contain a multiple of l6bitsxVLMAX or 
32bitxvLMAX32 which represent integer and floating point registers respectively. 
;'Maximum vector length (l6-bit elements)'1 
#define VLMAX 200 
II Max FP reg file 
#define VLMAX32 200 
typedef signed short int VECTOR[VLMAX]; 
Figure 2.15 Vector length definition within sim-vector 
The lengths of both the integer and floating point register files are described using #define 
macros in figure 2. J 5. After the registers are defined, the process of adding vector 
instructions to the simulator is simply done by adding the C or assembly instruction 
definitions into an empty opcode. 
22 
Chapter 2 .Architectural exploration for prograrrunable approach in TLM case 
Major architectural features are the VLMAX parameter and the VRF and VLEN registers. The 
parameter VLMAX is a constant that represents the maximum width of the vector registers. In 
the proposed architecture, each vector element is specified to be 32-bits in length and 
therefore a vector register is VLMAX*32 bits wide. The vector register file, VRF, which 
contains the working data, is used by vector ALU. A single 32-bit register named vector 
length register is used to store the current vector length. This register is used to specify which 
vector register is being used. Since it specifies which data in the VRF is valid and should be 
operated on, this system wide register has great importance to the vector instructions 
typedef struct 
II Scalar integer registers 
signed int RF[INT_REGS]; 
sstateTi 
Figure 2.16 C model of scalar register 
Apart from the vector registers, the C model of the vector architecture also needs a scalar 
register file, RF, which contains INT _ REGS 32-bit registers as described in figure2.16. By 
describing their exact functionality using C, custom vector instructions are modelled. Vector 
load and store instructions are defined firstly which allow for vector operands to be loaded 
from the main memory into the registers and stored from the VRF to memory. After vector 
loadlstore instructions have been modelled, vector ALU instructions are implemented (Since 
this process is application dependent, it will be described in the next section). The above 
method describes how a configurable vector processing engine was created based on the 
SimpleScalar toolset. Within this custom simulator, it is possible not only to design different 
vector instructions but also to implement these instructions within targeted application and 
obtain instruction count from the simulation to evaluate its suitability . 
. 2.4 Thread and data level parallelism exploration and simulations results 
2.4.1 Reference Code 
We have utilized a basic implementation of the SCN TLM algorithm [24] where no external 
boundary conditions were used. A 3D space is mode led in reference code. For testing, a 
23 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
single output node was used as a diagnostic aid to verify correct operation. The accelerated 
scatter method ofNaylor and Ait-Sadi as proposed in [25] is implemented. The code used for 
benchmarking the performance was a fixed mesh of 1000000 nodes [26]. 
This number is convenient as it gives a prime factorization of 26 x 56, which allows for the 
aspect ratio of the problem space to be varied over a reasonable range whilst maintaining the 
same number of nodes. 
2.4.2 Thread level parallelizing 
Thread level parallelism is abundant in the TLM algorithm from initial profiling. To quantity 
its benefit by using Sim-system, the targeted serial workload need to be parallelized as figure 
2.17 shows. 
Serial Excution 
Parallel Excution 
Figure 2.17 Parallelizing Loop 
From the intuitive profiling of TLM code, the major calculations are cost in its scatter and 
connect operations. The proposed method is to partition scatter and connect function into 
sub-sections. And each small piece will be executed by one single thread. In this way, all 
sections of code are executing on a multi-processor system. Hence, the workload is 
parallelized. 
In the process of partition a loop, the first thing is ensuring there is no dependence among the 
each iteration ofloop. If we consider the following code: 
x[il~x[il+y[il ; 
24 
Chapter 2 .Archi tectural exploration for prograrrunable approach in TLM case 
which just adds two IOOO-element arrays and load the sum into one of the input arrays, every 
iteration of the loop can be overlapped with any other iteration. They are independent, but if 
there is any change in the code such as 
for (i=1;i<=1000;i=i+l) 
x[i+1)=x[i)+y[i); 
then there is existed dependence. In this situation, there will be no thread level parallelism 
any more. In the proposed partition technique, the variable "time" is defined to contain the 
size of each piece of the loop. An example is used to explain the partition process. 
The original loop as 
for (k= 0 ; k<Z; k++) 
{1=1+1;} 
After partition was taken place, it is modified as 
time=Z/XC MAX; 
for (k_o1d=O;k_o1d<time;k_old++) 
1=context+(lold*XC_MAX); 
1=1+1; 
The set "t ime=Z /XC _MAX" is to set the size of each piece of the loop, in which Z means 
the maximum iteration of the loop. xc _ MAX is defined by the simple-system as the 
parametric thread number of the program. The variable k _old is used to replace the variable 
k in each piece of the loop. The instruction "k=con text + I k _old *XC _ MAX) " is to assign 
the same value as running in sequential way 
In case that the code "t ime=Z /XC _MAX;" has some remainder, another piece of code is 
required to execute the possible remaindered iterations: 
iflcontext<IZ%XC_MAX) ) 
1=context+ltime*XC_MAX) ; 
1=1+1; 
25 
Chapter 2 .Architectural exploration for prograrrunable approach in TLM case 
So far, within the parallelizing of the loop, efforts have been concentrated on loop manipulation 
and not on the functionality present within the loop. The next stage is to examine the memory 
activity within the loop, examining both the reads and writes, and performing any modification 
required to ensure exclusivity between nodes. 
In parallelized applications, the mean of synchronizing the multiple CPUs is the "Barrier" 
mechanism. \\Barrier" is a synchronized point. When a BARRIER is reached, all thread will 
wait at that point until all other threads have reached that barrier. All threads then resume 
executing in parallel the code that follows the barrier. In TLM code, we need to use Barrier at the 
end of every parallel section to make sure an threads can resume running again in the correct way. 
Another step is to analyze the memory access within the loop, examining both the reads and 
writes to ensure the variable has been accessed exclusively. Several modifications need to be done 
to make sure exclusivity. This can be done through the use of private variables, code section 
executed by specific thread and semaphores[27J. A private variable is a variable in the shared 
memory space that only one thread has right to access. For example, parameter I in example 
should be declared as a private variable which allows each thread to have a different value for 1 
but still can be accessed using a common variable name. Thread selection code is used to assign 
certain section of the code to a specific thread. Finally, through dividing each iteration into 
sections and separating them with barrier instructions, all updates for variables will be visible in 
next section of code. 
shared array[MAX THREAD]; 
Figure 2.18 Pseudo declaration of shared memory array 
To maintain exclusive variable access in a given section, shared memory arrays are used. 
These are standard arrays declared in shared memory with the number of elements equal to 
the number of threads within the system, as shown in figure 2.18. By declaring the array with 
an element allocated to each of the node, it allows the nodes to have a private area of memory 
with exclusive access right. Similarly by declaring the array in shared memory (static) space, 
each node has access to each element of the array. 
variable [context] 10 + I /'parallel computation'/ 
barrier I*context synchronisation*/ 
if context ID ~ 1 /' serial'/ 
loop x from 0 to MAX THREAD /* only context I executes'/ 
total ~+ variable [x] /' this section'/ 
26 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
barrier /'sync with other contexts'/ 
Figure 2.19 Pseudo serial loop illustrating use of shared memory array 
Figure 2.19 above. describes a possible situation in which the use of shared memory arrays 
can be useful. A serial operation is performed after the parallel computations separated by the 
first barrier. Through having barriers between the parallel and serial sections, all parallel 
access to the array has been completed before the serial reads. Therefore, all updates with 
variables would be visible in the sequential section. 
2.4.3 Thread level simulation results 
The performance of the threaded code is measured by the complexity. The complexity is 
Complexity= the number of instructions run by the zero context in MT-mode 
of instructions of Single-thread code. 
the number 
The instruction count within the zero context is' chosen to evaluate the performance, as it 
executes the all non-parallelized workload as well. The figure 2.20 demonstrates the relative 
dynamic instruction count for meshes of 22 X 10 X 10, 22 X 10 X 50, 22 X 10 X 250 and 22 X 
!O X 500 respectively[26]. These results show a significant acceleration potential from 
exploiting thread level parallelism. 
27 
g-- TC-
Chapter 2 .Architectural exploration for prograrrunable approach in TLM case 
O. 6 
0.5 
.f' 
'" 0.4 " c;, 
E 
0 
" 
'" 0.3 " .!:l
-;; 
E 
0 0.2 Z 
O. 1 
0 
2 4 8 
Thread number 
16 
__ 22x]Ox]0 Node 
__ 22x]Ox50 Node 
22 x ]0 x 250 Node 
-><- 22 x]O x 500 Node 
Figure 2.20 Performance of Thread Level Optimization 
2.4.4 Data level parallelizing 
As TLM is matrix based calculation, it allows for a large quantity of data to be processed at one 
time by single instruction. Therefore the instruction count can be reduced significantly compared 
to scalar processors. To qualify DLP in TLM, the simulation process is carried Qut by using 
Sim·vector as illustrated in figure 2.21. 
28 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
Algorithm 
Vector and . 
Scalar 
Instructions 
Re-write Code using 
the Custom Instructions 
Run Tests 
(X86 Mode) 
Instructions in 
Inline ~c--, 
Assembly· 
Simulation 
Architectural 
Results 
Figure 2.21 The methodology used to identify data level parallelism in TLM workload 
29 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
Figure 2.22 Programmers model for vector and scalar registers 
After profiling the TLM workload, a series of vector instructions [28] were developed on top 
of Si m-vector with the pre-defined register file (figure2.22). The development of vector ALU 
is an intuitive process. By profiling the TLM reference code, the potential parallelized parts 
of code has been identified which cost most of computing complexity. Through the detailed 
analysis of dataflow, a set of vector instructions are proposed as table2.1 shows. 
Instruction Description 
MVSR2VLEN Transfer the data in scalar register to vector 
length register(VLEN) 
MVSR2CSR Transfer RISC scalar register to coprocessor 
scalar register 
MVCSR2R Transfer coprocessor scalar register to RISC 
register 
MVSR2CVEL Move RiSe scalar register to co processor 
vector element 
MVeVEL2R Move coprocessor vector element to RiSe 
scalar register 
VLDU Load vector register unal igned under VLEN 
VSTU Store vector register unaligned under VLEN 
VPERM Three-operand bytewise vector permute 
VSPLAT Splat coprocessor scalar register to 
coprocessor vector register 
30 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
VFPADD.S Vector floating-point add (single precision) 
underVLEN 
VFPSUB.S Vector floating-point sub (single precision) 
underVLEN 
VFPMUL.S Vector floating-point multiplication (single 
precision) under VLEN 
Table 2.1 Vector ISA extensIOn 
To run the simulation at functionality level, C macros were developed to represent the vector 
instructions. An example of such instruction is illustrated in figure 2.23. Within this method, 
the instruction level simulation of vector coprocessor can be facilitated. To run the simulation 
in vector mode, the critical part of original code need to be modified using the custom vector 
instruction and verified within the same output of original code. The remaining part of the 
original code of the code was also modified by using scalar instruction set extensions. 
#define VFPMUL(RD, RS1, RS2) 
({int i; 
extern vstateT vstate; 
for (i~O; i<vstate.VLEN32; i++) { 
vstate.VFPRF[RDl [il~vstate.VFPRF[RSll [il 
vstate.VFPRF[RS2] [il; 
1 1 ) ; 
Figure2.23 Definition of vector multiplication instruction 
* 
After the code was verified in the X 86 modes, the custom instructions need to be switched 
into inline assembly mode. Therefore, the simulation can be running on the top of 
SimpleScalar infrastructure. Thus the instruction level simulation can be achieved. The 
figure 2.24 shows an instruction which both defmed in the X86 and SimpleScalar mode. 
#ifdefX86 
#define VLD32(R,ADDR) \ 
({\ int i;\ 
extern vstateT vstate;\ 
for (i~O; i<vstate.VLEN32; i++)\ 
{\ 
31 
Chapter 2 .Archi tectural exploration for programmable approach in TLM case 
vstate.VFPRF[R][i] = (float) *(ADDR+i);\ 
}\ 
f* printf\"%t\n", vstate. VFPRF[R] [0]); *11 
}); 
#else 
#define VLD32(vrf,addr) \ 
({\ 
asm volatile ("addu $10,%0,$0"; ;"r" (addr); "$10");\ 
asm volatile (".word OxOOO I 0000");\ 
asm volatile (". word \ 
2 «291 f* EXT OPCODE *11 
2« 251 f* CATEGORY *11 
7« 20 1 f* OPCODE *11 
"#vrf"« 151 f* VRD = VRF *11 
10« 5"); f* RS2 = HOST REG *f \ 
}); 
Figure2.24 VLD instruction defined both in C and Assembly language 
2.4.5 Data level parallelism simulation result 
After the vectorized TLM code was running on the Sim-vector, the instruction count was 
acquired for different vector length. Compared to the scalar code, the result (figure 2.25) 
shows a speedup of around 10 times when tested on a mesh of one million nodes with a 
vector length of 16[29] 
32 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
O. 7 
.ie' 0.6 
x 0.5 
'" 0. a 0.4 0 
" 
I --100*100*100 
'2 O. 3 
.!:l 
-,; 
E 0.2 -- 2*2*250. 000 
0 
z O. 1 
0 
2 4 6 8 10 12 14 16 
Vector length 
Figure 2.25 Benchmarking using a thin and a cuhic problem space. 
In Figure 2.25, it is observed that the optimal configuration is where the simulated problem 
space is thin. 
0.7 
,.. 0.6 
~ 
.~ 
~ 0.5 
" ~0. 
El 0.4 0 
" 
'0 0.3 
" N 
.~ 
~ 0.2 
'" El 
'" 0 0.1 z 
0 
2 4 6 8 10 12 14 
Vector length 
Figure2.26,Reductive dynamic instruction count for 80x)OOx)25, 100x125x80 and 
80x125x)OO node mesh. All of these results show a similar speedup 
16 
In figure 2.26, a 80xlOOxl25 node mesh is used. It is compared with a mesh 100xl25x80 
and 80x125x100. Since the workload was vectorized around the Z direction, some 
differences are observed. These mesh dimensions are typical for a realistic scattering situation 
(e.g. EM simulation ofa vehicle for EMC) [29). It is observed that the vector alignment only 
changes the complexity by a small amount. In all the cases, a vector length of 16 can give 
about one order of magnitude speedup. A very important observation is that the TLP benefit is 
33 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
orthogonal to that of DLP. This potentially leads to substantial runtime benefits per 
processor-coprocessor. 
2.5 Summary 
This chapter is focused on the architectural level exploration for CPU-based TLM processor. 
Since a large part of this thesis is about hardware realization ofTLM, the TLM algorithm was 
introduced firstly. The two-dimensional TLM modeling is explained in detail to build a basic 
understanding of TLM algorithm. The background of programmable architecture also was 
briefly introduced. To facilitate the architectural level investigation, the custom CPU 
simulators are developed based on the open-source SimpleScalar toolse!. After the optimizing 
TLM reference code, the parallelized TLM code was running on the custom simulator. The 
results demonstrated that there are plenty of parallelisms existed in TLM workload. By 
utilizing these parallelisms, the performance of CPU-based TLM could potentially increase. 
34 
Chapter 2 .Architectural exploration for programmable approach in TLM case 
Reference: 
[1] K. D. John, Electromagnetics. McGraw hill, 3rd edition 1984. 
[2] K.D. John, Electromagnetics. McGraw hill, 4th, 1992. 
[3]. P.B.Johns and R.L.Beurle, "Numerical solution of 2-dimensional scattering problems 
using a transmission-line matrix" Proc ofIEE, vo!. 118, pp. 1203-1208, Sept 1971. 
[4]. P.B.Johns, "Application of the transmission-line matrix method to homogeneous 
waveguides of arbitrary cross-section" Proc . inst. elec. eng., vol. 119, pp. 1086 . 1091, Aug. 
1972. 
[5]. P.B.Johns, "The solution of inhomogeneous waveguide problems using a transmission 
line matrix" IEEE trans. Microwave Theo!)' Tech, vol. MTT-22, pp. 209-215, March 1974. 
[6]. S. Akhtarzad and P. Johns, "Transmission line matrix solution of wave guides with wall 
Losses" Electron. Lett, vol. 9, pp. 335-336, July 1973. 
[7]. P. So, C.Eswarappa, and W. Hoefer, "Transmission line matrix method on massively 
parallel processor computers", Progress in Applied Computational Electromagnetics, 9'h 
Annual Review, ppA67-474, 1993 
[8]. P. O. Luthi, B. Chopard, and J. Wagen, "Wave propagation in urban microcells: a 
massively parallel approach using the TLM method", Applied Parallel Computing in Physics, 
2'd International workshop, ppA08-418, 1995 
[9]. C. Tan, and V. Fusco, "TLM modeling using an SIMD computer", Int. Jnl. Num. Mod: 
Elect. Networks, Devs. And Fields, Vo1.6, pp.299-304, 1993 
[10]. P. J. Parsons, S. R. Jaques, S. H. Pulko, and F. A. Rabhi, "TLM modeling using 
distributed computing", IEEE Microwave and Guided Wave Letters, VoI.6(3), pp.141-142, 
1996 
[IIJ. D. W. Bauer, E. H. Page, "Optimistic parallel discrete event simulation of the event-based 
transmission line matrix method" Simulation Conference, 2007 Winter, pp.676 - 684,9-12 Dec. 
2007 
[12J. A. H. Sal eh, "A dedicated processor for solving TLM field problems", PhD Thesis, 
University of Nottingham, 1982 
[13]. S. Grego!)" "Design of a single bit processor for TLM using full custom IC design", 
Dissertation (BEng), University of Nottingham, 1989 
[14]. S. David, "The development of an application specific processor for the Transmission 
line matrix method", PhD thesis, Loughborough University, 2000 
[15]. J. L. Peterson, P. J. Bohrer, and e. ai, "Application of full-system simulation in 
exploratory system design and development," vol. 50, pp. 321-332, March 2006.J[ 
[16J. S. Hangal and M. O'Connor, "Performance Analysis and Validation of the pica Java 
Processor," in IEEE Micro. vol. 19, pp. 66-72, May 1999 
35 
Chapter 2 .Archi tectural exploration for programmable approach in TLM case 
[17]. T. A. Diep, C. Nelson, and J. P. Shen, "Performance evaluation of the PowerPC 620 
microarchitecture" in Proceedings of the 22nd annual international symposium on Computer 
architecture, pp. 163-174, Italy, 1995 
[18). T. Austin, E. Larson, and D. Ernst, "SimpleScalar: An Infrastructure for Computer 
System Modeling" in Computer. vo!. 35 - no. 2, pp.59-67, February 2002 
[19). S. Dwarkadas, J. R. Jump, and J. B. Sinclair, "Execution-Driven Simulation of 
Multiprocessors: Address and Timing Analysis" ACM Transactions on Modeling and 
Computer Simulation (TOMACS), vo!. 4, pp. 314 - 338 October 1994 
[20). W. S. Mong and J. Zhu, "A retargetable micro-architecture simulator" in Proceedings 
of the 40th ACM IEEE conference on Design automation, Anaheim, CA, USA, pp. 752-757, 
2003. 
[21). L. GueITa, J. Fitzner, D. Talukdar, C. Schliiger, B. Tabbara, and V. Zivojnovic, "Cycle 
and phase accurate DSP modeling and integration for HW/SW coverification" in Proceedings 
of the 36th ACM/IEEE conference on Design automation, New Orleans, Louisiana, United 
States, pp. 964 - 969,1999, 
[22). G. Maturana, J. L. Ball, J. Gee, and e. aI, "Incas: A Cycle Accurate Model of 
UltraSPAR " in Proceedings of the 1995 International Conference on Computer Design: 
VLSI in Computers and Processors, Los Alamitos, California" pp. 130-135, October 1995. 
[23]. http://www.simplescalar.com 
[24). P. B. Johns, "A symmetrical condensed node for the TLMmethod," IEEE Trans. 
Microw. Theory Tech., vo!. 35, no. 4, pp. 370-377,1987. 
[25). P. Naylor and R. Ait-Sadi, "Simple method for determining 3-D TLM nodal 
scallering in nonscalar problems," Electron. Lett., vo!. 28, no. 25, pp. 2353-2354, 1992. 
[26). V.A. Chouliaras, J. A. Flint, and Y. Li, "Towards a Custom Vector-Parallel Machine 
for TLM" , Fifth lEE International Conference on Computation in Electromagnetics ,lEE Con 
Pub 505, lEE, CEM 2004, Stratford-upon-Avon UK, pp.33-34, April 2004. 
[27]. J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach, 
2003. 
[28]. V. A. Chouliaras, 1. A. Flint, Y. Li, J. Nunez-Yanez , "A System-an-Chip Vector 
Multiprocessor for Transmission Line Modeling acceleration ", Proceedings of the 2005 
IEEE International Workshop on Signal Processing Systems, Athens, Greece, pp 213-218, 
November 2005. 
[29]. V. A. Chouliaras, J. A. Flint, Y. Li, "Parametric Data-Parallel architecturesfor TLM 
acceleration", Proceedings of the 3rd International Conference on Computational 
Electromagnetics and Its Applications (lCCEA), Beijing, China, ppI24-127, Nov. 1-42004 
36 
.-----------------------------------------------------------
Chapter 3. Hardware realization of programmable approach in TLM case 
Chapter 3 
Hardware realization of programmable approach in TLM 
case 
3.1 Hardware realization of programmable approach 
3.1.1 The categorization of processor 
The idea of using multiple processors both to increase performance and to improve 
availability dates back to the earliest electronic computers. Flynn [I] proposed a general 
category in 1972, which is still in use today as illustrated in figure 3.1. This category 
classified all computers into four categories according to the different parallelisms among the 
instruction and data stream. 
Single instruction stream, single data stream (SISD)--- This kind is uniprocessor like normal 
PC, which is used widely today. 
Single instruction stream, multiple data streams (SIMD)----This is scalar instruction stream 
with parallel data stream. The same instruction is executed using multiple data streams. 
Vector architectures, which we will discuss later on, belong to this type. 
Multiple instruction streams, single data stream (MISD)---In MISD architecture, many 
functional units perform different operations on the same data. This architecture is mainly 
used in fault tolerant cases. Pipeline architecture belongs to this type, though there may be an 
argument. So far, there is no any kind commercial MISD processor available yet. 
Multiple instruction streams, multiple data streams (MIMD)---Each processor fetches its own 
instructions and operates on different data. 
37 
Chapter 3. Hardware realization of programmable approach in TLM case 
SISD Instruction Pool MISD Instruction Pool 
MIMD 'InstructionPool SIMD Instruction Pool 
PE is referred to processing element as CPU 
Figure 3.1 Flynn's Taxonomy 
The model illustrated in figure 3.1 gives a general categorization of all processor-based 
systems. A real processor system may not belong to one specific category, but it should 
belong to a kind of hybrid based on this classification. 
Practically, the parallelisms are utilized by the MIMD and SIMD architecture as illustrated in 
figure3.2. 
38 
Chapter 3. Hardware realization of programmable approach in TLM case 
Figure 3.2 A taxonomy of parallel computers 
SIMD can be classified into two subgroups whether a vector or array processor [2]. In vector 
processor, the same instruction operates on the vectors. In an array processor, such as the 
ILLIAC IV, a master control unit broadcasts instructions to many independent ALUs. 
MIMD can be categorized into multiprocessors (share-memory machines) and 
multi-computers (message-passing machines). There are three existing kinds of 
multi-processors. They are termed as UMA (Uniform Memory Access), NUMA 
(NonUniform Memory Access), and COMA(Cache Only Memory Access). 
The other main category of MIMD machines is the multi-computers which, unlike the 
39 
Chapter 3. Hardware realization of programmable approach in TLM case 
multiprocessors, don't have shared primary memory at the architectural level. 
Multi-computers may be divided into two categories. The first category contains MPPs 
(Massively Parallel Processors), which are high-end computers consisting of many CPUs 
tightly coupled by a high-speed proprietary interconnection network. The other category is 
the cluster of works tat ions (COW) or just called cluster. 
3.1.2 Hardware realization of parallelisms 
3.1.2.1 Vector processor 
Vector processor architecture appeared in the late 1960s and early 1970s to support massive 
vector and matrix calculations. The first implementations of vector processors were the 
Control Data Corporation (CDC) STAR-100[3] and the Texas Instruments Advanced 
Scientific Computer(TI ASC)[4] in 1964. In these architectures, the communication is 
through memory-to-memory system with high bandwidth memory systems located in a 
vector processing unit. However, due to the long start-up overhead of vector instructions and 
the deep pipelining, they weren't commercially successful[5].The first commercial success of 
vector architecture was CRAY-I computer system[6] which introduced in 1976. This machine 
was centered on scalar processing but it was using vector-register architecture and it had 
significantly lower overhead and less memory bandwidth requirements. In the 1980s 
smaller-scale vector processors appeared with the most successful designed by Convex and 
Alliant. Also, in this period, Japanese supercomputer made their appearance starting with the 
Fujitsu VPIOO, Hitachi S810 and the NEC SX/2 that used vector-register architectures with 
similar performance to the CRAY X-MP [5]. These computer systems continued to be. 
developed with NEC SX/5 which was the fastest vector supereomputer in 2001 within a 16 
processors configuration operating at 312 MHz and Fujitsu VPP5000 with a 128 processors 
configuration clocking at 300 MHz. There was a trend to believe that vector processing is 
redundant, since the development of superscalar architectures in the early 1990s [7]. 
Howerver, with multimedia-rich applications becoming dominant, the interest for vector 
processing has come revived [8]. 
3.1.2.2 Multiprocessor 
Another approach to achieve high execution performance is to utilize the available thread 
level parallelism. The architectures which reflects this philosophy belong to the MIMD 
category[9]. Such architectures consist of a collection of interconnected single-thread 
processors, which can execute independent instructions streams operating on multiple data 
40 
Chapter 3. Hardware realization of programmable approach in TLM case 
items. When the processors run the independent workload (program), this is the case of a 
multi-programmed environment. When the multiple processors execute different parts of the 
same program and share most of their address space, this is known as multithreading. The 
independent parts or processes of the same program are called threads. These threads can 
execute concurrently and define another type of parallelism that is known as Thread-Level 
Parallelism or TLP[9]. TLP is a coarse-grained type of parallelism since each processor 
works on a specific process and communicates with the other processor only if it is necessary. 
The theoretical performance improvement obeys Amdahl's law which was introduced in 
Chapter 2. 
3.1.2.3 Hybrid approaches and research 
In practical, the various forms of parallelisms are often not clearly separated and they can be 
combined to increase performance further. For example, the NEC SX-4 vector supercomputer 
is a pipelined superscalar vector microprocessor architecture which can utilise lLP, DLP, TLP 
[7]. Simultaneous multithreading processors employ TLP and ILP at the same time[IO]. 
Another example which can potentially combine all the parallelisms is the SS_SPARC [11]. 
There also has been a trend of adding vector extension to the existing instruction sets. 
Examples of general-purpose microprocessors with vector extensions are Intel's MMX [12], 
PowerPC's Altives[13], Sun UltraSparc's VIS[l4] and Tarantula [15]. Another interesting 
combination is the merge of ILP and DLP paradigms in a single architecture [16] and the 
SMV architecture that combines simultaneous multithreading and DLP [17] 
As demonstrated in chapter 2, there are abundant data and thread level parallelisms in the 
TLM workload. Therefore, a hybrid architecture will be explored to exploit both data and 
thread level parallel isms. 
3.2 Leon3MP and its extended compiler environment 
3.2.1 Introduction of Leon core 
The Leon proessor core [18] is a synthesizable VHDL model which is 32-bits and compliant 
with SPARC V8 standard[19]. The core is well configurable and particularly suitable to the 
system-on-chip design. The designer can use the the design template to con figure the 
processor and optimize the processor core for performance, power consumption, I/O 
throughput, silicon area and cost. The perpiheral device can interface the processor core 
through AMBA bus. The core can be efficiently implemented on FPGA and ASIC. It uses 
standard RAM cells for both register and cache implementations. Since the Leon core is 
4\ 
Chapter 3. Hardware realization of programmable approach in TLM case 
under open license, all the source code is supplied. 
3.2.2 The architecture of Leon3 
The Leon processor is implemented by an advanced 7-stage pipeline with separate data and 
instruction cachet Hardvard architecture) as illustrated in figure 3.3. The core supports the 
SPARC V8 [SA[19], including the MUL, MAC and DIV instruction. It aslo has an optional 
IEEE-754 floating point unit (FPU)[18]. The FPU can support for both single and double 
precision floating point operations. The maxium configuration of cache is 4 sets per cache, 
256 Kbyte per set. A designer can choose the replacement strategy from the choice of LRU, 
LRR and random replacement. 
Leon processor can be employed in synchronous multiprocessor mode. It also supports cache 
coherency, processor enumeration and SMP interrupt steering. The design template also 
includes a debug interface which allows non-intrusive debugging for both single and 
multiple-processor systems. It also supply access to all on-chip register and memory. Trace 
buffers for both instructions and bus are available. 
The processor core including pipeline,cache controllers and AHB interface cost about 20,000 
gates and has been implemented on both ASIC and FPGA teChnologies. In the 
implementation of Xilinx Virtex2Pro FPGAs, 125 MHZ frepuency can be achieved. On a 
typical 0.13 urn standard-cell technology, freqency of more than 400MHz can be achieved. 
SaC IIF 
Pr!)cessing Unit 
~-----------------------------~ I I I 
I I 
I Coprocessor Lean I 
I I 
/ I 
I I pcr DMA I VLSU icache! acache !dcache I 
'/ Unit I I L ______ ------------~-------: t AHB 
1-/ t per rlF r,,1emory I I PPB Bridge I ~ Controller 
- J L Host I Timers l I!O !! _ Sy~tem J Regrsters 
I SDRAM I SRAM I 
Figure 3.3 Architecture of Leon3 
42 
Chapter 3. Hardware realization of programmable approach in TLM case 
3.2.3 GRLIB library and software development environment 
All IPs associated with Lean processor are encaspluated in GRUB library. To reach the 
optimun performance and reduce the cost of building a SaC design, the IP library need to 
have the reusing ablitity for existing IP cores which can be con figured toward to specific 
application. The philosophy of GRLlB is to provide a standardised developing environment 
where many basic reusable IP cores exist. 
When a design involves integrating IP care with CPU, it could need significant effort. The 
key feature of GRLIB is vendor independedt. The library is easy to be parted to different 
CAD tool and target technologies. It doesn't depend on any specific tool or interface which 
need to be licensed. All IPs in the library are designed as "bus-centric". It is assumed that all 
perpipheral devices are connected with processor through bus which is in AM BA standard. 
The software environment on the Lean processor is cross compiler which is based on many 
open source tools. These include: 
• GNU C/C++ compiler 
• Standalone C-library 
• Linker,assembler,arhiver etc. 
• RTEMS real-time kernel with network support 
• Boot-prom utility (mkprom) 
• Remote debugger monitor for gdb 
• GNU debugger with Tk front-end 
• DDD graphical user interface fro gdb 
The compiling environment enables the single-thread or multi-threaded C / C++ code. The 
software developer can use debug program through gdb debugger. 
Lean port of Linux is a special version of the SnapGear Embedded Linux distribution. 
Snap Gear Linux is a full package which includes kernel, libraries, etc. It is suitable for 
rapidly developing of embedded Linux systems.The Lean port of SnapGear supports both 
MMU and non-MMU functionality of Lean. It also has the ability to enable the MULlDlV 
instructions as well as floating-point unit (FPU). 
One cross-compilation for C/C++ application(BCC) is supplied which is able to compile 
C/C++ program for any configuration. 
43 
Chapter 3. Hardware realization of programmable approach in TLM case 
3.2.4 Extended compiler environment for Leon3MP 
The whole structure of extened compiler is described in figure3.4. In the first level, there is 
Leon3MP soft core which is in VHDL. The most notable benefit of Leon3MP soft core is 
highly configurability. Nearly any parameter of this soft core can be configured through a 
friendly GUi. The second level is Bare C Compiler (BCC) which is composed by 
gce-elf-spare. Through gce-elf-spare, the C code can be translated into SPARC assembly 
code. On top of existing infrastructure, several compiler directives such as barrier have been 
developed to compile multi-thread code. 
e code in multi-thread mode 
Level 3: Merging Bee to multiprocessor mode 
Level 2: BCC 
Level I: Leon3MP soft core (VHDL) 
Figure 3.4 The extended compiler for Leon3MP 
Barrier is used as synchronization mechanism in our multiprocessor system. The barrier 
forces the first coming process waiting. After all proceess reach the barrier, it releases all of 
he processs as figure 3.5 shows. The typical implementation of a barrier has two spin locks: 
the first one is to protect a counter that count the processes arriving at the barrier and the 
second one is to hold the processes until all process arrives at the barier. 
44 
Chapter 3 Hardware realization of programmable approach in TLM case 
j j 1 
I B .' 
. 
arner 
.. ' •....... 
Figure 3.5 The Barrier mechanism 
The code in figure 3.6 shows the algorithm of the barrier mechanism .. 
lock (counter lock) ; ;*make sure exclusively access to variable count*; 
if(count~ ~O) 
{release~O; } 
count=count+l; 
unlock(counterlock); 
if (count~ ~total) ;*all arrived*; 
{count~O; 
release=l; 
else{ 
spin (release~ ~1); ;*firstly coming threads keep running here to wait all*; 
Figure 3.6 The algorithm of barrier mechanism 
The lock (counterlock) statement is used to make sure that all threads only can access the 
variable count exclusively. The variable count keeps the number of processes which have 
arrived in the barrier. The variable release holds the processes until all processes reached the 
barrier. The operation spin (release~ ~l) causes a process to wait all processes arrived in the 
barrier. 
To run the C code on Leon3MP platform, a compiler directive which implements barrier 
mechanism needs to be defined. Only with this compiler directive, the TLM code can be 
modified and run on the Leon3MP platform. The assembly code of barrier is described as in 
45 
Chapter 3. Hardware realization of programmable approach in TLM case 
figure 3.7, it was written using SPARC ISA (19]. 
asm volatile("loop:ld [%0],%%10": :"r"(&Count) :"%10"); 
asm volatile ("cmp %10,-1"); 
asm volatile ("be loop"); 
asm vo1atile("nop"); 
asm volatile ("mov -1,%10"); 
asm volatile ("swapa [%0] 1 , %%10":: "r" (&Count)); 
asm volatile ("cmp %10,-1"): 
asm volatile ("be loop"); 
asm volatile("nop"); 
asm volatile ("st %%10, [%0] ":: "r" (&current_value)); 
current_value=current_value-1; 
I*Lock*1 
asm volatile ("Id [%OJ, %%10":: "r" (&current value): "%10"); I*unlock 
*1 
asm volatile("stbar"); 
asm volatile (list %%10, [%0]":: "r" (&Count)); 
do /*busy-waiting*/ 
asrn volatile("nop"); 
while(Count>O); 
Figure 3.7 The implementation of barrier mechanism using SPARC ISA 
3.3 Run time for TLM workload on Leon3MP platform 
With the same reference code used in architectural exploration, the TLM code was running on 
different configuration of processor for varied size of modelling space. All these figures (3.8, 
3.9,3.10,3.11) showed the same performance improvement 
46 
Chapter 3. Hardware realization of programmable approach in TLM case 
~ 
1. 20E+08 
1. 00E+08 
::1 8.00E+07 
~ 
Cl 
.;j 6.00E+07 
+-' 
§ 4.00E+07 
'" 2.00E+07 
O.OOE+OO 
ICPU 2CPU 4CPU 
_ 20ns/ cycle 
III IOns/cycle 
11 5ns/ cycle 
Figure 3.8 Run time for 4X4X4 TLM configuration 
4.00E+08 
3.50E+08 
~ 3.00E+08 
'" 
-52.50E+08 
11 .~ 2.00E+08 
+-' 
'" 1. 50E+08 
::l 
'" 1. 00E+08 
5.00E+07 
O.OOE+OO 
ICPU 2CPU 4CPU 
I- 20ns/ cycle I 
1 .. 
lIII. IonS/CYcle] 
II 5ns/ cycle 
Figure 3.9 Run time for 4X 16X 4 TLM configuration 
4.00E+08 
3 .. 50E +08 
~ 3.00E+08 
'" 
-52.50E+08 
'" 
.;j 2. OOE +08 
+-' 
c 1. 50E+08 
::l 
'" 1. 00E+08 
5.00E+07 
O.OOE+OO 
ICPU 2CPU 4CPU 
i. 20~~;~y~i; 
! ill lOns/ cycle 
I 5ns/ cycle 
Figure 3.10 Run time for 16X 4X 4 TLM configuration 
47 
Chapter 3. Hardware realization of programmable approach in TLM case 
400000000 
350000000 
~ 300000000 
en 
" 250000000 .20ns/cycle ~ 
<lJ 
E 200000000 11 IOns/cycle .~ 
...., 
" 
150000000 !I! 5ns/ cycle 
::l 
"" 100000000 
50000000 
0 
1CPU 2CPU 4CPU 
Figure 3.11 Run time for 4X4X 16 TLM configuration 
3.4 The heterogeneous TLM processor and VLSI implementation result 
3.4.1 Micro-architecture of heterogeneous TLM processor 
In the heterogeneous TLM processor, a vector coprocessor is attached to Leon processor 
through a dedicated interface. The coprocessor/processor communicates with SDRAM 
controller through AHB on-chip bus[20](figure 3.12). 
Figure 3.12 Single processor-coprocessor architecture 
48 
Chapter 3. Hardware realization of programmable approach in TLM case 
Figure 3.13 Detailed scalar and vector core microarchitecture 
Figure 3.13 is used to illustrate the microarchitecture of a coprocessor in detail. The 
coprocessor described here is only for a vector length of two 32-bit(single-precision) 
elements. In the main processor, instructions are fetched from the multi-way set-associative 
instruction cache and stored in a single 32-bit register. Source operand addresses are set-up 
on the falling edge of the clock in the DECODE stage. During the DECODE stage, the 
register file is accessed and the two source registers are retrieved. The resolved operands are 
clocked into the ALU register. At the same time, the vector opcodes are realized and 
dispatched to the tightly-coupled vector accelerator. Decoding stage in the co processor 
produces a number of control fields which are processed down the control pipeline. Vector 
operand accesss are triggered by the falling edge of the clock during decoding stage of 
coprocessor. 
49 
Chapter 3. Hardware realization of programmable approach in TLM case 
During the EXEC stage, the main processor executes the scalar instruction or computes the 
virtual address of a Load/Store operation. At the same stage, the vector coprocessor performs 
the first stage of the pipe lined floating point computations.ln the next step, scalar data return 
to the main processor via the data cache return path whereas the vector accelerator performs 
the last stage of execution. Due to the timing constraints, floating point results are stored in 
an intermediate register prior to being clocked into the vector register file. The coomunication 
between the main processor and the vector coprocessor is illustrated in figure 3.14. 
·7_ 
~ ~ ~ ~ ~ "--~ 
'" 
'" '" 
"--
~~ mm ,,~ .T"~~ <J 
-d<l~4"tolc~ 
pOOjUn.opc(19:01 
'" 
'00 
-~'" ouiva('Q--I 
dput 
~~~ l....-- 1 nc<.;::n 
It' ' I 
Figure 3.14 Processor-coprocessor IIF communication 
Peripheral Bus 
Periheral Bw 
Bridge 
SDRAM Channel 
CPU referred Leon3 CPU coupled with parametric vector coprocessor 
Figure 3.15 Hetergenerous Soc architecture 
50 
Chapter 3. Hardware realization of programmable approach in TLM case 
The whole system has a configurable number of processor-coprocessor pairs connected by 
the high performance AHB as depicted in figure3.15. There is an AHB-to-APB bridage which 
connects the processor-co processor pairs to a number of peripherals. In this system, CPUO 
serves as the contr01 processor performs all 110 operations and handles interrupting. The 
number of processors and size of cache can be specified by configuring VHDL code. 
3.4.2 VLSI implementation result 
The targeting technology for VLSI implementation is 0.13 um CMOS [21]. The design was 
firstly synthesized for maximum performance using Synopsys Design Compiler [22]. Then 
floorplanning and power routing were implemented in Cadence SoC Encounter. After that, 
the design was exported to Synopsys Physical Compiler for placement optimization. Finally, 
SoC encounter was used again for detailed routing. Figure 3.16 depicts the floorplan and final 
layout of two processor implementation. The implementation data are shown in figure 3.16. 
Figure 3.16 Floorplan and layout for heterogeneous TLM Sac 
Parameter Value 
Std cells 110099 
RAMs 62 
Maximum Frequency 158.5 
Size 3424*3426 u m (11733569) u m' 
Utilization 83.8% 
Table 3.1 VLSI ImplementatIOn data 
51 
Chapter 3. Hardware realization of programmable approach in TLM case 
3.5 Summary 
In this chapter, the hardware realization of programmable approach was detailed. A custom 
vector architecture was developed to exploit the data level parallelism. A Leon3MP system 
was used to exploit the thread level parallelism. With an established custom C compiler on 
Leon3MP system, TLM workload was running on the VHDL model ofLeon3MP. Hence, the 
cycle-accurate result was presented in this chapter to further demonstrate thread level 
parallelism. At last, a VLSI implementation of programmable TLM processor was presented. 
52 
Chapter 3. Hardware realization of programmable approach in TLM case 
Reference 
[I]. M. Flynn, , "Some Computer Organizations and Their Effectiveness", IEEE Trans. 
Compu!., Vol. C-21, pp. 948,1972. 
[2]. A. S. Tanenbaum, "Structured computer organization" Fifth edition 
[3]. R. G. Hinz and D. P. Tate, "Control data STAR-lOO processor design" in IEEE 
COMPCON, September 1972. 
[4].W. Watson, "The TJ-ASC, A highly modular andjlexible super computer architecture" in 
American Federation of Information Processing Societies AFIPS, pp. 221-228, 1972. 
[S]. John L. Hennessy and David J. Patterson, Computer Architecture: A Quantitative 
Approach 2nd ed.: Morgan Kaufman, 1996 
[6]. R. M. Russell, "The CRAY-I computer system" Communications of the ACM vol. 
21 ,pp. 63-72, 1978 
[7]. K. Asanovic, "Vector Microprocessors" PhD Thesis, University of California at Berkeley, 
May 1998. 
[8]. K. Diefendorff and P. Dubey, "How Multimedia Workloads Will Change Processor 
Design" in IEEE Computer. vol. 30, pp. 43-4S, September 1997. 
[9]. K. W. Rudd, "VLIW Processors: Efficiently Exploiting Instruction-Level Parallelism" 
in Electrical Engineering: PhD Thesis, Stanford University, December 1999 
[10]. S. 1. Eggers, J. S. ErneI', H. M. Levy, et aI., "Simultaneous Multithreading: A Platform 
for Next-Generation Processors" in IEEE Micro. vol. 17, pp. 12-19,October 1997. 
[11]. V. A. Chouliaras, K. Koutsomyti, T. Jacobs, et aI., "SystemC-defined SIMD instructions 
for high performance SoC architectures" in 13th IEEE International Conference on 
Electronics, Circuits and Systems, Nice, France, pp. 822-82S, December 2006. 
[12]. A. Peleg and U. Weiser, "MMX Technology Extension to the Intel Architecture"in IEEE 
Micro. vol. 16, pp. 42-S0, August 1996 .. 
[13]. K. Diefendorff, P. K. Dubey, R. Hochsprung, et aI., "AltiVec Extension toPowerPC 
Accelerates Media Processing," in IEEE Micro. vol. 20, pp. 8S-9S,March 2000 .. 
[14]. M. Tremblay, J. Michael O'Connor, Venkatesh Narayanan, et aI., "VIS Speeds New 
Media Processing" in IEEE Micro. vol. 16, pp. 10-20, August 1996. 
[IS]. R. Espasa, F. Ardanaz, J. Gago, et aI., "Tarantula: A Vector Extension to the Alpha 
Architecture" in the Proceedings of the 29th Annual International Symposium on Computer 
Architecture (ISCA'02) Anchorage, Alaska, pp.281-292,2002. 
[16]. F. Quintana, R. Espasa, and M. Valero, "A Case for Merging the 
!LP and DLP Paradigms" in 6th Euromicro Workshop on Parallel and Distributed Processing, 
Madrid, Spain, 1 pp. 217-224,1998,. 
53 
Chapter 3. Hardware realization of programmable approach in TLM case 
[17]. R. Espasa and M. Valero, "Exploiting Instruction- and Data-Level Parallelism"in IEEE 
Micro. vo!. 17, pp. 20-27, September 1997, 
[18]. http://www.gaisler.com 
[19]. http://www.spare.com 
[20]. V. A. Chouliaras, J. A. Flint, Y.Li, J. Nunez-Yanez, "A System-on-Chip Vector 
Multiprocessor for Transmission Line Modeling acceleration ", the 2005 IEEE International 
Workshop on Signal Processing Systems, November 2005, pp27-29, Athens, Greece 
[21]. Advanced Logic Technology - 0.13Jlm Taiwan Semiconductor Manufacturing Company, 
2006. 
[22]. Design Compiler 2003.06 Synopsys Inc., 2003. 
54 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Chapter 4: 
A novel ESL design flow and hardwired approach for TLM 
4.1 ESL methodology 
4.1.1 ESL 
Electronic System-Level (ESL) has come to scenario for decades. In recent years, it attracts 
more and more hardware designer's attention according to the increasing complexity of chips 
and narrower market window. Within ESL tools, design space can be explored in short time. 
The 2004 International Technology Roadmap for Semiconductor (ITRS) defined ESL as "a 
level above RTL". The ESL design has benefits as below: 
• Raising the abstraction level at which designers express systems 
• Enabling new levels of design reuse 
• Providing for design chain integration across tool flows and abstraction 
levels 
4.1.2 SystemC 
In proposed design flow, SystemC is chosen as the design entry because of its popularity in 
the ESL community and availability of related synthesis tool. 
SystemC is a C++ library which facilitates the system level design. It can support different 
abstraction level of design. The library is provided by the Open SystemC Initiative (OSCI). a 
third party organization. The OSCI is composed of numerous companies. universities and 
individuals [I]. They aim at achieving the IEEE standardization for SystemC. The free 
SystemC simulator can be down loaded at www.systemc.org. Quite a few EDA companies 
supply commercial implementations of SystemC languages and support for mixed languages 
simulations. 
4.1.3 Abstraction level ofSystemC model 
As discussed before. SystemC can be used to describe models in varied levels of abstraction. 
It is important to clarify the different levels of abstraction. So far. there is no standard 
classification of abstraction level in SystemC yet. It can be differentiated according to the 
following levels of abstraction [2]. 
55 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Algorithmic (ALG) 
This level has the most general description of functionality. There is no any architecture 
detail on this level. The SystemC model on this level almost looks like C++ code. 
Communicating Processes (CP) 
The functionality of model in this level is divided. Each part offunctionality is assigned to a 
process. The communication among these processes is facilitated by shared variable, so there 
is no communication mechanism defined. Model in this level is still architecture independent. 
Communicating Processes with Time (CP+T) 
This level is functionally same to the CP level. Only difference is that timing information has 
been added into the model. So the cycle count can be acquired through the simulation. On 
this level, the protocol of communication is still not specified and interconnect timing is not 
accurate. 
Programmer's View (PV) 
The PV level is much more detailed than the CP level. The communication mechanism 
(bus/network-on-chip) between components should be specified in this level. The model is 
register accurate. Hardware driver can see the programmer's representation of the hardware. 
Programmer's View with Time (PV+T) 
As the CP+T level is based on the CP level, a PV+T level is functionally same to the PV level 
but have timing information added. The timing information in this level is much more 
accurate than that available in the CP+T model. 
Cycle Accurate (CA) 
CA model typically has micro-architecture defined and bit-accurate interfaces. Protocol 
compliant arbitration of the communication infrastructure is specified. CA models capture 
micro-architectural details and typically have bit-level interfaces. The model is clocked and 
all timing annotations are accurate to the level of individual clock cycles. 
Register Transfer Level (RTL) 
As RTL model in traditional design flow, model at this level is described down to the register 
level. Cycle timing can be explicitly acquired at this level. 
In practice, most of SystemC models belong to the level between Algorithmic and Cycle 
56 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Accurate. Models in this domain are often termed as transaction level model. 
4.2 A novel ESL design flow 
The proposed design flow in ESL-based design is specified as figure 4.1 
Algorhhm': 
. (C/C++/SystemC) 
Synthesizable Floating 
Point Function 
Run Tests 
Tests OK? 
Replace arrays using RAM 
ports 
Optimizations toward 
Synthesizable SystemC 
SyntheSis the 'design into 
Verilog 
Connect the RAMs (HDL) 
Run Tests (Modelsim 
platlorm) 
Tests OK? 
Final synthesis U Place & 
Routing and Power Analysis 
Figure 4.1 A novel ESL design approach 
57 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
4.2.11EEE 754 floating point datapath 
As the first step of design flow is to build a synthesizable floating point datapath, the IEEE 
compatible floating datapath is introduced in this section. The floating point datapath 
implemented in this thesis is a 32-bits floating point operation which comply with IEEE 754 
standard. 
The IEEE 754 single precision representation requires a 32-bits word, which is represented as 
numbered from 0 to 31, left to right. The first bit is the sign bit, S, the next eight bits are the 
exponent bits, 'E', and the final 23 bits are the fraction 'F': 
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF 
Sign bit Exponent bits Fraction bits 
Table 4.1 32-bits IEEE 754 data format 
The arithmetic operations employed in the floating point datapath include: Add, Subtract, 
Multiply, Divide. There are four rounding modes supported for each operation as 
• Round to nearest even 
• Round to zero 
• Round-up 
• Round-down 
In the proposed ESL method, Softfloat [3] is modified to produce synthesizable floating point 
functions which implement the datapath as figure (4.2, 4.3,4.4) show 
58 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Yes 
Figure 4.2 Datapatb of Add and Subtraction 
59 
Chapter 4: ,A novel ESL design flow and hardwirect approach for TLM 
No 
Figure 4.3 Datapath of multiplication 
60 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Yes 
No 
Figure 4.4 Datapath of division 
4.2.2 Synthesizable IEEE754 Function 
As floating point functions are not included in the synthesizable subset of SystemC, all 
61 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
floating point operations in the design need to be replaced by the custom synthesizable 
floating point functions. 
Through the analysis of dataflow in the design, all data involved in floating point operation 
need to be reformatted as IEEE754 standard which is defined as float32: 
Typedef unsigned int float32 
Due to incompatibility of data format between float32 and integer, all data need to be 
reformatted as one single form. Therefore all initial value for variables are needed to be 
reformatted as 32-bits binary which follow the single precision IEEE754 definition as 
illustrated in figure4.5. 
#Define X 2 
#Define Y 4 
#Define Z 8 
The integer data need to be reformatted as 
2-> 0 10000000 00000000000000000000000 
4-> 0 10000001 00000000000000000000000 
8-> 0 10000010 00000000000000000000000 
1f32-bits binary number are represented as Hexadecimal number, it become 
# Define X 40000000 
# Define Y 40800000 
# Define Z 41000000 
Figure 4.5 Data reformation toward 32bits IEEE754 
After the uniformity of data format, the next step is to replace the floating operation using the 
synthesizable lEEE754 Softnoat library in which littleeddian was set. The figure 4.5 is used 
to illustrate this process. 
The original code: 
float A[i] ,8[i] ,C[i]; 
int i; 
62 
r 
I 
I Chapter 4: A novel ESL design flow and hardwired approach for TLM 
for (i=O; i<10; i++) 
Cri) =A[ij +B[i); 
After synthesizable floating point function is used to replace non-synthesizable floating point 
operation, the original code becomes as: 
float32 A[i) ,B[i) ,C[i); 
int i; 
For (i=0;i<10;i++) 
C[i)=f1oat32_add(A[i),B[i) ); 
Figure 4.6 Using synthesizable IEEE754 floating point function 
4.2.3 Replacing array access by pre-defined RAM ports 
A pre-defined RAM model which targets the TSMC library is used in this stage to replace 
array in the SystemC code. The ports of RAM is detailed as figure 4.7 
1 
1 
4 32 
32 
elk Clock signal addr Address input 
clk en Clock enable signal din Input data 
wen Write enable signal dout Output data 
Figure 4.7 Pro-defined single port RAM 
Unlike the 3D array, the RAM only can be addressed as one dimensional space. Therefore, 
when the pre-defined RAMs are used to replace arrays in the TLM code, the index of array 
needs to be converted as one dimension as illustrated in figure 4.8. 
63 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
)-+-I-,X"-__ I~ 
Z 
o 0 X+1U '0 Y+1U 'U Z+10 
> ~IIIIIIIIII~ 
Translate data index from 3D 
array to 1 D memory space 
3D array space indexed as 
A[x][y][z] 
1 D memory space addressed as 
A[x'(Y +1 )'(Z + 1 )+y'(Z +1 )+z] 
Figure 4.8 Convert array index as one dimensional memory address 
After the array index is translated into the one dimensional memory address, the array access 
need to be modified as the access of RAM ports. The figure 4.9 shows this technique. After 
this optimization, in the latter stage, the pre-defined RAM (HDL) can be connected through 
these ports. 
The original code as: 
float A[X] [Y] [Z], B[X] [Y] [Z], C[X] [Y] [Z]; 
int i,j,k; 
for (i-O;i<X;i++) 
for (j-O;j<X;j++) 
for (k-O; k<X; k++) 
Cri] [j] [k]-A[i] [j] [k]+B[i] [j] [k]; 
The modified code without array access: 
float32 teml.tem2,tem3; 
int i; 
for (i-O;i<X;i++) 
for (j-O;j<x;j++) 
for (k-O;k<X;k++) 
{A addr.write(i* (Y+l) * (Z+1)+j* (Z+1)+k); 
Teml-A_in.read(); 
B_addr.write(i* (Y+l)* (Z+l)+j*(Z+l)+k); 
Tem2-B_in.read(); 
64 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Tem3~float32_add(tem1,tem2); 
C_wen.write (15); / /keep the wen port as 1111 
c_out.write(tem3); 
C_addr.write(i*(Y+l)*(Z+1)+j*(Z+1)+k); 
C_wen.write (0); 
Figure 4.9 Using the RAM ports instead of array 
4.2.4 Optimization techniques toward synthesizable SystemC 
The last step is to modifY the top level module according to the requirement of Agility 
compiler. Since the se_main function is ignored by targeted SystemC synthesizer. Instead, a 
function called ag_ main needs to be created within the functionality instantiated as figure 
4.10 shows. 
Class TLM module 
while(l) 
wait();} ); 
void ag_main () 
public se module 
"Implemented functionality" 
TLM module TopLevel ("TopLevel") ; 
Figure 4.10 .~m.in() used to enahle the SystemC synthesis 
The statement while(l) is used to ensure all functionality in the while loop is triggered by the 
clock signal.. 
Pointer 
A pointer is synthesizable if it is a pointer to a synthesizable type, and the value to which the 
pointer points is compile-time determinable as figure4.11 
void clear{char *a, char *b) 
65 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Wait() 
*a=255: *b=255: 
Transferred as 
se_out <unsigned char> out: 
unsigned char x, y; 
clear(&x, &y); 
Qut.write(x) ; 
Figure 4.11 Resolvable pointer 
The Wai t statement is a powerful tool supported by the proposed SystemC synthesizer, as it 
can determine the critical data path(longest data path) in the design(figure 4.12). All 
statements between two wai t () (wait (n) format is not supported) statements will be 
constructed as combinational logic. 
The longest delay (critical path) can be shortened by having wait () statement in the 
reported longest path. Through a series of such iterations, the longest path in the synthesized 
code would be shortened again and again. In the Softfloat library, a series of longest delay (in 
NAND number) is acquired by this method. After a series of experiments, the longest delay 
was found as 35, 44, 47, 48, 68, 75, 80, I 15 and 197. 
void testModule .. Run() 
( int a,b; 
wait(); 
a=5; 
wait(); 
a=a+l; 
b=l; 
wait {); 
a=b; 
b~b+ 1; 
wait(); 
I*wait for the first clock edge*1 
1*c1ock cycle 1 *1 
1*c1ock cycle 2*1 
1*c1ock cycle 2*1 
1*c1ock cycle 3*1 
1'c1ock cycle 3'1 
66 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Figure 4.12 Wait () statement is used to determine the critical path 
Loop unrolling 
As discussed before, it can be seen that any loop without waitO statement will be unrolled to 
produce combinational logic. If a loop has a waitO statement inside loop body, it will not be 
unrolled. Figure 4.13 shows such a loop. 
For (int i~O; i<lO; i++) 
(sum ~sum + A[iJ; 
wait(); 
Figure 4.13 Loop with waitO statement 
The above example shows a loop with constant iterations. When the loop control condition 
has to be changed, a wait () statement needs to be placed inside a loop body as figure 4.14 
illustrates 
X~data in.read( ); 
for(i=O;i<X;i++) 
(j ++; 
wait() ; 
data out.write(j); 
Figure 4.14 Loop with varied control condition 
Also, the experiments show the sc_int<32> data type can be successfully compliant with 
floal32 data type which is defined in the Softfloat library. Therefore, all ports related 32-bits 
floating point are defined as sc int<32> data type. 
Additionally, in the proposed design flow, when SystemC code is synthesized using the 
targeted tool, it is impol1ant to use the compile option -hdl-style ModelSim to output 
appropriate Verilog code for the simulation on ModelSim platform. 
67 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
4.2.5 SystemC synthesis 
To acquire synthesizable RTL, A commercial compiler called Agility is used (figure4.1S) to 
translate SystemC model directly. 
Figure 4.15 The functionality of Agility compiler \4\ 
Agility is one of most mature SystemC synthesis tools on the EDA market at the time. The 
major limitation of Agility is that the synthesizable SystemC subset for Agility is not 
sufficient for many implementations. When it is used to implement the numerical algorithm, 
floating point arithmetic is needed. To address this issue, the Solifloat source is adopted and 
modified to add floating point functions into synthesizable subset 
4.2.6 Pre-defined RAM 
After the RTL level model was generated from the SystemC synthesizer automatically, the 
next step is to integrate the model with the pre-defined RAM model on ModelSim platform. 
This process is illustrated using TLM case as figure 4.16. 
68 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Synthesizable TLM Engine 
4.16 Connecting pre-defined RAM model 
As the data type for the ports are generated as unsigned automatically (the data type for ports 
can't be set in the current release of compiler), two functions std_log ic _vector () and 
to _unsigned are used to transfer the signal between RAM port and ESL-generated model 
ports according to the direction of data flow. 
4.2.7 RTL levell'alidatioll 
When ESL-based TLM engine is successfully integrated with the pre-define RAM model, the 
next step is to validate the whole design again on ModelSim platform to make sure the 
consistency of design. 
A top level VHDL file named test_top. vhd was created to test the design(Verilog 
Model). Two processes are resided inside the test_top.vhd. The first one is clock 
process. As the design is triggered by the clock signal, it keeps running during the validation 
process. The functionality of second process (test process) is to feed the test vector, once the 
design under test is idle. Two functions str_read and strhex to slv are used to 
handle the text I/O. As the da ta_ready signal is set whenever the output is ready, this 
signal is used as the trigger for the test process. 
Within test cases, the outputs of RTL model are compared with output from SystemC model. 
In this way, the design is validated again in RTL level. 
69 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
4.2.8 Logical Synthesis 
Once the design is verified, the next phase is synthesis. Synthesis is the process which 
translates the design from one abstraction level to a more detailed abstraction level. In the 
logical synthesis stage, the design is automatically transformed from a Register Transfer 
Level (RTL) to a gate level netlist The compiler takes inputs as RTL level HDL description, 
timing constraints and attributes for the design and the targeted technology library and 
produces a fully-mapped gate level netlist. The fist step for synthesis is to define the 
constraints for each RT L block and optimize the gate-level netlist for area, timing and 
power [5]. The tool used is the industry standard Synopsys Design Compiler (DC) [6]. The 
target technology is Taiwan Semiconductor Manufacturing Company's(TSMC) 0.13 urn 
standard-cell library (lPoly, 8 Copper)[7]. The design constraints which are described in the 
design compiler's Tel (tool Control Language) driven script. The DC tool can guide the 
synthesis and optimization process of the design to meet the user-specified constraints. Apart 
from the netlist which represents the mapped and optimized design, the DC compiler also 
output the design timing constraints in Synopsys design constraints (* .sde) format. 
4.2.9 Place and Route 
The place and Route comes to the stage after the logical synthesis. In this stage, the necessary 
files are produced for statistical power analysis. This process starts by running Cadence First 
(FE) in batch mode, also the physical implementation of RAM and the timing information of 
the targeted library are taken consideration. The optimized Veri log netlist (in ·.v format) from 
the previous stage and Synopsys Design Constraints (*.sde) file which describe the timing 
constraints are taken as inputs. Then, the DC tool carries out floor planning, power grid 
specification (power/ground ring and stripes), placement of RAM macros and standard cells, 
and clock tree synthesis. After that, global and detail routing, extraction of RC data and post 
clock tree synthesis are performed. Finally, filler cells are added to fill the area between the 
placed and routed standard cells and their VDD and VSS rails are connected to the power ring. 
After the layout is ready, it needs to be checked against the Veri log netlist. Also, Design Rule 
Check (DRC) is used to ensure that the technology library design rules are implemented 
correctly in the final layout. The results from this stage include area, and maximum frequency 
. reports, interconnect delays in standard delay format (*.sdl) file, standard parasitic extraction 
format (* .spel) file and an updated netlist which represents the final place-and-route result. 
Based on these outputs, the DC tool can perform the statistical power analysis. 
70 
1 
I 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
4.2.10 Statistical Power Analysis Stage (Design Compiler) 
After the Place-and-Route is carried out, power analysis is performed by using the post 
placed-and-routed Verilog ne!list with the timing constraints (* .sdc) file and the standard 
parasitic extraction formate' .spet). As the results of power analysis stage, several files are 
created which contain average power dissipation, area, worst IR drop etc. 
4.3 Parallelizing Technique 
The basic idea of parallelizing technique is to partition computing-intensive loops into 
smaller pieces with less iterations, then encapsulate each piece of loop into sub-function 
which is defined as thread process(SC_THREAD) in SC_CTOR. As pre-defined in SystemC 
library, these threads are running in parallel when they are triggered by the same signal. In 
this way, the most computing intensive part of algorithm is parallelized. 
The implemented SystemC model is at Communicating Process with timing (CP+T) level 
[2].At this level, the functionality of implemented algorithm is partitioned into parallel 
processes that exchange complex, high-level data structure. Model at this level is still 
architecture independent. Communication among these threads is done by sharing arrays. 
Two SystemC programs are presented here in figure 4.17 to illustrate the paraBelizing 
technique. The first SystemC program has only one process which assigns value into array by 
a loop for (i~O; i<10; i++). This loop can be parallelized by having two loops as 
for(i~0;i<5;i++) and for (i~5;i<10;i++) which are respectively encapsulated 
into two process. In this way, the original process is parallelized. 
Original code: 
se MODULE (example_i) 
se in elk elock 
se in<bool> reset 
se in<bool> enable; 
se out<se uint<4> > array[10]; 
int i; 
void assign array () \ 
while (true) 
wait (); 
if (reset. read () 1) { 
eQunt=O; 
71 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
counter out.write(count): 
) else if (enable. read () 1) 
For(i-Oii<10;i++) 
array[ij-1; 
) ) ) 
SC_CTOR(example 1) 
SC_CTHREAD(incr count, clock.pos()) ) ); 
ParalIelized code: 
#include "systemc.h" 
se MODULE (example 2) 
sc in clk clock 
sc in<bool> reset 
sc in<bool> enable: 
sc out<sc uint<4> > array[10j; 
int i; 
void assign_array_1 () 
while (true) 
wait (); 
if (reset. read () 1) 
count=O; 
counter_out.write(count); 
) else if (enable.r;=ad() 1) 
For(i=O;i<5;i++) 
array[ij-1; 
) ) ) 
void assign_array_2 () 
while (true) 
wait (); 
if (reset.read() 1) 
count=O; 
counter out.write(cQunt); 
) else if (enable.read() 1) 
For(i-5;i<10;i++) 
array[ij-1; 
) ) ) 
72 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
4.4 TLM engine 
SC_CTOR(example 2) \ 
se CTHREAD(assign_array_l, clock.pos()) 
SC_CTHREAD(assign_array_2, clock.pos()) 
) ); 
Figure 4.17 Parallelizing Techniques 
4.4.1 Interface ofTLM engine 
The interface of systemC model is demonstrated in figure 4.18. When the model receives the 
clock, the data_request signal is on. Four parameters (X, Y, Z and Interactions) input 
through Data_in port sequentially into the model. When the computation finishes, the 
results will output through Da ta _ ou t _I and Da ta _ ou t _ 2 in pair. Since the number of 
output is not fixed, a Da ta_ready signal is necessary to validate the data out signal. 
.. .. 
Data_request Data out 1 
Data_in Data_ouL2 
Data_valid TLM Engine Data_Ready 
eLK Data_acl<. 
Figure 4.18 The interface of SystemC Model 
The interface of parallel TLM engine is the same as the scalar one. Only the internal 
functionality of TLM algorithm is parallelized in multi-thread TLM engine as illustrated in 
figure 4.19. 
73 
r 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
ResulCouC1 
~ 
ResulCouL2 
~ 
DataJeady 
~ 
Dala_ack 
Result_QuL 1 
~ 
Figure 4.19 ParallelTLM engine (quad mode) 
74 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
4.4.2 SystemC level simulation 
After the C level description of TLM functionality is encapsulated into the 
SystemC-defined interface, the cycle-accurate simulation can be facilitated. 
30000 
25000 
§ 20000 
o 
~ 15000 
r-< 
u G 10000 
5000 
o 
22*10*1022*10*5022*50*1050*22*10 
Modelling Configuration 
Figure 4.20 Cycle level simulation for scalar TLM engine 
Figure 4.20 show the cycle counts over different configurations of modeling space. In this 
figure, the cycle count for scalar TLM engine increase proportionally with the modeling 
space. 
" " 0 U 
'" () 
~ 
U 
400000 
350000 
300000 
250000 
200000 
150000 
100000 
50000 
0 
197 115 80 75 68 48 47 44 35 
Delay 
I 
'.4*4*4 
.4*16*4 
III 16*4*4 
!I; 4*4*16 
Figure 4.21 Cycle count with different delay configuration 
75 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
As discussed before, the maximum delay (critical path) information can be acquired through 
a series of experiments. The figure 4.21 represents the cycle count over different delay 
configuration and mode ling configuration. It can be observed that the cycle count increases 
when the maximum delay decrease, which means more wai t () statements have been placed 
into SystemC model. 
When the Scatler and Connect functions are paral1elized by using the proposed technique 
before, the cycle-accurate simulation is enabled for multi-thread TLM engine. 
200000 
180000 
160000 
§ 140000 
8 120000 
..!l 100000 
~ 80000 
u 60000 
40000 
20000 
o 
Different Configuration of Modelling Space 
Figure 4.22 Cycle count for parallel TLM engine 
.1 Thread I 
.2 Threads 
III 4 Threads 
:18 Threads 
As shown in this figure 4.22, for each configuration of modelling space, the cycle counts 
significantly reduce along with the increase of number of threads, which indicates the 
potential of performance improvement by using parallel TLM engine. 
4.4.3 RTL level simulation 
After the SystemC synthesis took place and pre-defined RAM is integrated, the simulation of 
TLM engine (synthesizable RTL level) can be running on ModelSim platform both in scalar 
and parallel mode. With Sns/cycle, the running time can be found through simulation. In this 
stage, hundreds of test cases are fed into TLM engine, which result in the simulation results 
as shown in figure 4.23 
76 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
3.50E+07 
3.00E+07 
-;;;- 2.50E+07 
5 
El 2.00E+07 
E::: l.50E+07 
" ~ I.OOE+07 
5.00E+06 
O.OOE+OO 
198 liS 78 72 67 54 50 48 47 44 38 
Delay 
Figure 4.23 Running time for RTL level TLM engine 
Bl Thread I 
.2 Threads 
111 4 Threads 
18 Threads 
As figure 4.23 illustrates. in term of running time, perfonnance can be enhanced by using 
multithread TLM engine. In contrast to figure 4.22, figure 4.23 shows that the difference of 
running time across different threads is less remarkable. This is due to RAM model is used to 
replace the array in this abstraction level and RAM access costs dominant time during the 
run-time ofTLM engine. 
4.4.4 Implementation space exploration 
When SystemC synthesis has been done, NAND and Flip-t1op counts needed are estimated 
and reported by Agility compiler. That becomes an interesting tool to explore implementation 
space. The NAND and Flip-t1op counts for TLM engines are presented in figure 4.24 and 
figure 4.25 respectively. 
77 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
"E 
;::l 
0 
u 
-0 
;::l 
'" Z 
6000000 
5000000 
4000000 
3000000 
2000000 
1000000 
0 
198 115 78 72 67 54 50 48 47 44 38 
Delay 
.1 Thread 
.2 Threads 
III 4 Threads 
~ 8 Threads 
Figure 4.24 NAND gate count for TLM engine with different delay configurations 
Figure 4.24 shows the NAND counts are relatively stable across the different 
configurations of delay with the same thread, but approximately doubled from one thread 
to eight threads. 
I, ::: 
:.2 Threads 
I Cl 50000 
]' 40000 
"-.~ 30000 
rlilThr"da 
! iI 4 Threads 
i_ft. § Thre('lds ! 
G: 20000 
10000 
o 
198 115 78 72 67 51 
Delay 
50 18 17 38 
Figure 4.25 Flip-Flop count TLM engine with different delay configurations 
With reference to figure 4.25, it can be observed that the Flip-Flop counts increase 
significantly when delay configuration is decreased from 198 to 38. This is due to more 
Flip-Flop need to be placed into TLM engine, when the critical path decreases. 
After the RTL level design is validated on ModelSim platform, a final synthesis was carried 
out to investigate the VLSI implementation space of a number of ESL-designed TLM engines. 
The results are presented in figure 4.26 (area), figure 4.27 (statistical power) and figure 4.28 
(minimum period). In all three sets of results, multiple targeted frequency (clock period) 
constraints were set to understand the dependency of the three measured parameters. 
78 
, 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
The results in figure 4.27 show that performance was similar in term of statistical power over 
different delay configurations. The reason for this is the dominance of the RAM-block 
consumption compared to that of the logic itself. In figure 4.28, further investigation showed 
that the increasing use ofwai t () statements in the SystemC code was beneficial in reducing 
the post-synthesis critical paths. 
TlM processor area (ptlrtoute. no wireload) 
nOOCO·T-···-"-·~=-····"·""-""""'''····'''·'''''··''''''''''''''''''·''·'''''' ............................................................................................................................ , 
7000~+-----~~~~~--~~-----------------------------------1 
i 
7500000+-----~~----~~~~~~~, .. ~-.. ~~-_--.. --------------------~ 
c 
i7~OOOOt-----------------~~~~~~~~~~~c~----------------~ 
I ~ 
7300000 I-------------------------"V:=;;:::~.~,,~;;;;;;;;;;~~------I 
7200000t-------------------------------------~~~ .••.••••.••.• --~~~-l 
7100Q(){l+-----_----_----~----_------~----_----_----_-----I 
DLY35 DlY44 DlY47 DLY46 DlY6a Dl Y75 Dl V80 DLY115 Dl Y197 
Delayconflgurat!on 
-+--100s 
~'9ns 
,OO 
....-,oo 
-+--5'" 
-+-4ns 
-300 
-'2'" 
Figure 4.26 Post synthesis area result for different delay configurations 
79 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
600.00 
500.00 
200.00 
100.00 
0.00 
10 
2 
o 
TLM Processor Power consurrptlon 
... 
\ 
\ 
~ 
~ 
--
o 2 • 
Requested period (n5) 
I 
I 
I 
I 
! 
i 
I 
" I 
I 
10 12 
-+:-DLY3$ 
=~t~~J 
+DLYB8. 
4COLY75 
Rl)t~~'d';: 
.2:::- DLY115 
::"''' DLYl97 
Statistical power is referred to estimated power consuming without test vector 
Figure 4.27 Statistical power results for different configurations 
-..... -... .. --
+--.---1' 
"r' 
.. -~.-", 
c 
TLM Processor Frequency (period) 
-
/ 
/ 
7 
4 , , 
Requested period {ns) 
12 
-+-DLY35 
--DLY44 
DLY47 
- DLY48 
-'Jf-DlYB8 
-'-OLY75 
-t-DLY80 
-DLYl15 
··-,,···OLY197 
Figure 4.28 Real clock period results for different configurations 
80 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
4.5 Comparison ofLeon3 based TLM processor and 
ESL-designed dedicated TLM engine 
4.5.1 Performance comparison between CPU based TLM processor and ESL designed 
TLM engine 
100000 
1000 
10 
116*4*4 
0.1 
lCPU 2CPU 4CPU ICPU 2CPU 4CPU DelayJ4 DelaLll5 
(lOO) (lOO) (lOO) (200) (200 ) (200 ) (200) (166) 
484.6 276. 5 189.6 99.82 66.4 46.06 0.431 0.295 
1713 1071 576.8 342.6 214.3 115.8 1.705 1.178 
1712 1056 574.1 342.4 212.2 115.1 1. 704 1.177 
116*4*4 1721 1074 586.2 344. 2 214.8 116 1. 704 1.179 
~ .. --
Designs are synthesized for different frequency configurations and timing information is in 
ns. 
Figure 4.29 Real time perform.ance of both CPU based TLM processor and ESL 
designed TLM engine 
Figure 4.29 shows the real-time performance of all configurations of Leon3 based 
multiprocessor and the ESL-designed engine. The vertical axis indicates real time 
(microseconds) in log domain and the horizontal axis shows the performance of the single, 
dual, and quad programmable architectures (synthesized for 100 and 200 MHz). The last 
column shows the performance of two ESL-designed engines con figured for delay 44 and 
liS, both optimized for 250 MHz. The resuits show that the performance of ESL-designed 
engine can achieve a three order of magnitude over the single Leon3 CPU at lOO MHz. If we 
compared the ESL-designed TLM engine with the best Leon3 based configuration, a quad 
Leon3 CPU at 200 MHz, the ESL solution still can demonstrate 153 times performance 
improvement. 
81 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
If we integrate the vector floating point infrastructure with the quad configuration CMP, it is 
estimated that ESL designed dedicated engine still have a 15 x better performance. These 
differentials demonstrate the impact of architectural difference between programmable 
solution and dedicated engine on the performance. 
4.5.2 VLSI Implementation comparison between CPU based TLM processor and ESL 
designed TLM engine 
Both RTL models were taken through the full logical-synthesis, place and route flows as 
presented in chapter 3 and the final silicon footprints (GDS2 files) were obtained. The target 
technology was TSMC's 0.13 micron I Poly, 8 Copper layers High-Speed (HS) process. Both 
VLSI layouts are implemented with SRAM cells which replac'e arrays in TLM source code 
(ESL) and used in the individual CPU/AHB sub-systems for the CMP. The results are 
compared in figure 4.30 
Parameter ESL CMP 
Instances (Macros) 93536 (28) 292816(92) 
Area(l'm') 3264x3239 3858 x3856 
(10572096) (14876488) 
Max Freq(MHz) 220.7 330.5 
FIgure 4.30 The final VLSI layouts and physIcal ImplementatIOn for the ESL deSIgned 
TLM engine and CPU based TLM processor 
82 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
4.5.3 Conclusion 
To conclude, it is observed that the use of ESL methodologies in numerical algorithm 
implementation can lead to a very powerful solution in term of design time and performance, 
compared to the CPU based architectures. Driven by this conclusion, other numerical 
implementations are investigated further. 
4.6 Summary 
A novel ESL design flow was detailed in this chapter. In this method, an IEEE 754 floating 
point datapath was implemented by adopting Softfloat source code. Within this design flow, 
the parallelizing technique was presented as well. Hardwired TLM engines were developed to 
compare with programmable solution using this method. From this comparison, it is 
interesting to demonstrate that the ESL method is a powerful approach for hardware 
implementation. Also, compared to programmable solution, the hardwired approach can 
significantly improve the performance in term ofthroughout for TLM application. 
83 
Chapter 4: A novel ESL design flow and hardwired approach for TLM 
Reference 
[1] www.systemc.org 
[2] Adam Donlin "Transaction Level Modeling: Flows and Use Models", CODES+ISSS'04, 
, Stockholm, Sweden, ppI24-l29, September 8-10, 2004 
[3] http://www.jhauser.usiarithmetic/SoftFloat.html 
[4] www.celoxica.com 
[5] S. Akella, "Guidelines For Design Synthesis Using Synopsys Design Compile" 
Department of Computer Science Engineering, University of South Carolina, Columbia, 
December 2000. 
[6] Design Compiler 2003.06 Synopsys Inc., 2003. 
[7] Advanced Logic Technology - 0.13(.'m Taiwan Semiconductor Manufacturing Company, 
2006. 
84 
Chapter 5: Hardwired approach for FFT using ESL method 
Chapter5 
Hardwired approach for FFT using ESL method 
5.1 FFT background 
As one of the most important tools in digital signal processing, the Discrete Fourier 
Transform (DFT) is the basis of numerous applications such as spectral estimation, 
convolution and compression. The issue of DFT is that un-optimised implementation is very 
intensive in term of computing. It would hardly be feasible to perform in real time. 
Fortunately, DFT can be implemented using a very efficient algorithm known as Fast Fourier 
Transform (FFT). Although there are a large number of existing algorithms to implement the 
FFT, we will take a particular technique called radix-2 which is a special case [1], but 
contains all basic knowledge. 
The expression for the DFT of a sequence x[n], n~O, ... , N-J, can be described by 
N-J 
XN[kj = Lx[nj~"' 
0=0 (5.1) 
k = O, ..... ,N-1 
with w N = e-j2fflN .The N in XN[kj indicates that this is a DFT of an N-point sequence. 
The whole summation can be divided into two terms, one with even indices and one with odd 
indices 
(5.2) 
n even n odd 
Assuming the number of data points N to be even, we can write the even parts as n~2m and 
the odd indices as n~2m+ I, with m~O, ... , (N/2)-1. As a result, the DFT becomes 
(N/2)-1 (N/2)-1 
XN[kj= Lx[2mjw~"k + W~ Lx[2m+ljw~mk (5.3) 
m=O m=O 
For the right summation, we used w~ W~llk instead ofw~m+')k " There are two facts: 
w 2 - e- j2ffl(NI2) =W 
N - N/2" (5.4) 
Therefore, the two summations in the DFT contains parts: the DFT of the even samples 
x[O],x[2], ... ,x[N-2] and the DFT of the odd samples x[I],x[3], ... ,x[N-I]. As a result, the 
equation becomes 
85 
Chapter 5: Hardwired approach for FFT using ESL method 
Where 
X N [k] = X;:;; [k] + w NX~d:,[k] 
k = O, ... ,N-I 
(N/2)-1 
(5.5) 
X;:;;[k] = DFT{x[0],x[2], ... ,x[N -2]) = Lx[2m]w~~2 (5.6) 
m=O 
(N/2)-1 
X~~:[k] = DFT{x[l],x[3], ... , x[N -I]) = Lx[2m + l]w~~2 (5.7) 
m=O 
Alternatively the OFT of an N-point sequence is changed as the combination of two OFTs of 
two N/2-point sequences, x[2m] and x[2m+I], with m =0, ... ,(N/2)-1. 
For instance, if N~4, it is needed to compute X4[0],X,[I],X,[2],X,[3] which can be 
divided into X;"" [0], X;"" [I] and X;dd [0], X;dd [I]. Therefore for the indices k~O,l the 
equation can be described as 
X4[k] = X;""[k]+ w:X;dd[k] (5.8) 
For the other two indices, k~3,4, since the OFT of a 2-point sequence is periodic with period 
2, still for k~O, I, it can be described as 
X,[k + 2] = X;""[k] + W;+kX;dd[k] 
(5.9) 
B X [k] X [k+2] B 2 (-i2ffl ') I h . b ecause 2 = 2 . ecause w 4 = e = - , t e two expreSSIOns can e 
described as 
X, [k] = x;m [k] + w :X;dd [k] 
X, [k + 2] = X;"" [k]- w :X;dd [k] 
(5.10) 
For k~O,1. This can be generalized for any sequence of even length N as 
X N [k] ~ X;;;;" [k] + w ~ X~d:2 [k] 
X [k+N/2]=X"'"[k]-WkX'dd [k] N N/2 N NI2 
(5.11 ) 
For k~O, .. . ,(N/2)-I, it is shown in figure 5.1 for the case N~8. The operation of "mixing" the 
two N/2-point DFTs is written in terms of the "butterfly" operator, shown in figure 5.2 
86 
Chapter 5: Hardwired approach for FFT using ESL method 
~----------------------------------.~~ 
, ' 
I • 
x[ l] --'--I . x,rt[O] 
X~[Jl 
........ - Xfd12] 
2-DFf 
.r[5]~-I l-___..-.a 
wg 
x[31---'--~ 
2·DFr wJ 
.t{71~---L_j---~ X~[31 
4-DFT , 
...... ...................................................................... _1 
Figure 5.1 Decomposition of an 8-point DFT 
• 
.+b 
b 
.-b 
Figure 5.2 Butterfly operator 
When the data length N is the power of 2, it can be written as N = 2 L (L is an integer). In 
this way, N is even and whenever it is divided by 2, the quotient is still even sequence. For 
instance, when N~8 as shown in figure 5.1, N/2~4 is still even, and it can be divided again 
into two DFTs of length 2, as shown in figure 5.2. A 2-point DFT is just a butterfly. Since 
w 2 = e-j2 ,/2 = -J and w~ = (_1)k =± I, when k~O,I, the final result is shown in Figure 
5.3 for N~8. 
87 
Chapter 5: Hardwired approach for FFT using ESL method 
X[O] X8[O] 
X[4] 
" 
Xg[1] 
X[2] w, Xg[2] 
X[6] w' Xg(3] 
Xl!] 
WO 
w1 
Xg[4] 
X[5] X8[5] w" 
X[3] 
, 
w2 
W' w' 
X8[6] 
X[7] 4 
Xg[7] 
Figure5.3 8-point FFT 
Figure 5.1-5.3 were modifiedfrom R. Cristi "Modern Digital Signal Processing" 2003 
5.2 ESL-based FFT engine 
Although a considerable amount of literature has been published on the hardware realization 
of FFT, only a few publications reported floating point based FFT hardware. The earliest 
interest about floating point FFT implementation can be traced back to 1977 [2]. Though no 
hardware implementation has been discussed in this publication, it clearly demonstrates the 
interest of floating point FFT according to the demand for higher precision which is critical in 
the application such as military purpose. The first literature about hardware realization of 
floating point FFT Was in 1988[3]. A vector DSP dedicated to FFT was developed. The 
internal architecture is highly parallel. The ALU is deeply pipe lined for FFT butterfly 
function. Moreover, multiple set of this processor can be connected through shared bus. As 
larger area is needed in floating point based hardware implementation, hardware based 
floating point implementation only became relatively popular recently[4,5,6,7,8,9,1O,11]. 
Swartzlander anticipated that longer transforms will be needed at higher data rates, with 
greater precision and lower power consumption. It is expected that the use of single precision 
IEEE standard floating point arithmetic will be required [4]. In commercial area, a number of 
vendors sell a single precision FFT core, but they shallowly mention technique details [5,6,7]. 
In Sadeghipour's work, self-time approach was adopted in floating point operation to 
overcome global clock overhead and distribution problem in synchronous FFT processors 
due to large area size of floating point arithmetic units [8]. Although there is only a 8 point 
FFT implemented and no direct data regarding the throughout, it can potentially raise FFT 
performance with less power consumption. Hemmert et al. reported that 64 bits floating point 
arithmetic was implemented for FFT implementation in scientific applications ranging from 
88 
Chapter 5: Hardwired approach for FFT using ESL method 
global climate modelling to molecular dynamics [9]. In Mou et al.'s implementation, as it's 
limited by available resources on FPGAs, only a single butterfly function is used in this 
processor [10]. In IEEE 754 floating point addition and multiplication operation, it 
respectively needs 7 and 3 clock cycles. Four pieces of RAMS are utilized in the design, 
when the design is implemented on Virtex-II XC2VIOOO-fg256-6 FPGA from XiIinx. 
Through optimization of RAM access, this design achieves more than 150MHz, 34.8us for 
1024 point [10]. Floating point FFT was also implemented on FPGA board in Mahdavi's 
work[ll]. For 1024 point configuration, it achieves 5131 cycles with the mean square error of 
0.0001. From Xilinx ISE synthesize report, it was shown minimum clock period is 9.94ns 
(Maximum Frequency is 100.6 MHz)[II]. 
Apart from the IEEE standard floating, block floating point was also adopted in the past to 
save memory size [12,13]. 
Another interesting literature showed FFT has been already developed using ESL method in 
the past on the FPGA board[l4]. In this effort, Handel-C language (a near relative of 
SystemC) was chosen and the design was tested using Celoxica RC 1000-PP prototyping 
board. Through, only integer arithmetic was used in this design, interestingly, it was 
demonstrated the ESL based implementation outperforms other implementation of FFT on 
the same series of FPGA. 
5.2.1 SystemC model 
A SystemC level model was built within the interface illustrated in figure 5.4. This engine 
performs FFT computing on a sequence of complex inputs, using Cooley-Tukey radix-2 
decimation in frequency algorithm [I]. 
In SystemC level, the FFT engine has been configured up to 4096 points for verification. In 
different configurations ofFFT points, the arithmetic and control functionality ofFFT engine 
don't change, only the array size needs to be reconfigured which result in the different size of 
RAM in later RTL stage. 
Since there is no data dependences for butterfly function, the scalar FFT engine can be easily 
parallelized by having more butterfly blocks as illustrated in figure 5.5. This method has been 
illustrated in chapter 4. 
89 
Chapter 5: Hardwired approach for FFT using ESL method 
out~real Input Bits Output Bits 
in_real 32 outJeal 32 
in_imag 32 out_imag 32 
data_valid dataJeady 
data_ack dataJeq 
Figure 5.4 Interface of FFT engine 
The FFT engine initiates the reading of a sample by sending a data_req signal. Then it 
waits for data_valid signal to become high. After the data_valid is set, the 
da ta _ req signal is lowered. FFT engine begin to read the real and imaginary data samples 
from it's imput ports in_real and in _ imag respectively. 
in_real 
in_imag 
dataJeq 
data_ valid 
data_valid 
paralleIize FFT 
workload 
liil~tI:f{ijtj]£;f:4~~~j~~di;ll 
l;i!j3@e~11YF(jRHli>RI 
mj~U'itb:~dYF~drttti6h '!il 
fi;J3tittg~fliijtWict'i()~ J 
outJeal 
out_imag 
dataJeady 
data_ack 
out_imag 
Figure 5.5 Parallelize process of FFT engine 
90 
Chapter 5: Hardwired approach for FFT using ESL method 
5.2.2 SystemC level simulation result 
1800000 
1600000 
1400000 
~ 1200000 
a 1000000 ! 800000 
600000 
100000 
200000 
o 
35 44 47 48 68 75 80 115 197 
Delay 
Figure 5.6 SystemC level simulations for FFT engine 
.1 Thread 
.2 Threads 
:.I 4 Threads 
ill 8 Threads 
The SystemC level FFT models were simulated with 4k point. The results are presented in 
figure 5.6, where the Y axis represents the cycle count for 4096 point computation, and the X 
axis is different maximum delay configurations for various thread configurations. It can be 
observed that the cycle count decrease significantly from one thread to eight threads, also 
there is increasing trend for cycle count from 197 to 35. From those results, it is observed that 
the cycle count increases with more waitO statements placed and more butterfly function can 
potentially improve throughout in term of cycle count. 
5.2.3 RTL level simulation 
10000000 
_ 8000000 
~ 
" 
-;;; 6000000 
• 
- 4000000 
" 
" ~ 2000000 
35 18 68 
DBlay 
75 80 115 197 
Figure 5.7 RTL level simulation result ofFFT engine 
After integrated with pre-defined RAM, the FFT engines were simulated in the ModelSim 
with 50 sets 4k input and the clock period is set as 5 ns. The result is shown in figure 5.7. The 
vertical axis is the running time in ns. The horizontal axis represents the FFT engine. from 
single to eight threads with different delay configurations. It is observed that the run time 
reduces significantly from one to eight threads. 
91 
Chapter 5: Hardwired approach for FFT using ESL method 
5.3 The implementation of ESL-based FFT engine 
The NAND gate count and flip-flop gate count are reported by the Agility compiler in each 
SystemC synthesis process. They are shown in figure 5.8 and figure 5.9. 
2500000 
2000000 
+' 
S 1500000 
"0 1000000 ~ 
500000 
0 
35 44 47 48 68 75 80 115 197 
Delay 
.1 Thread 
.2 Threads 
III 4 Threads 
1: 8 Threil.d2J 
Figure 5.8 NAND count for FFT implementations 
In figure 5.8, the vertical axis is the NAND count for FFT engines. The horizontal axis 
represents the FFT engines from one thread to eight threads with different delay 
configurations. It is noted that the NAND counts increase notably from one to eight threads 
and there is also decreasing trend from delay 35 to delay 197. Within figure 5.8. it is reflected 
that better performance comes with the trade-off in the NAND counts. 
40000 
+' 35000 § 30000 8 25000 
i5 20000 ~ 
"- 15000 
'" 
.r< 10000 ~
"- 5000 
0 
35 44 17 48 68 75 80 115 197 
Delay 
-------l 
.1 Thread 
.2 Threads! 
1114 Threads, 
• 8 Threadsl 
Figure 5.9 Flip flop count for FFT implementations 
As discussed before, Flip-Flops are placed into design to shorten the critical datapath. 
Therefore. with shorter critical data path, the number of implemented Flip-flop increases. 
Figure 5.9 reflects this observation 
92 
Chapter 5: Hardwired approach for FFT using ESL method 
5.4 Result and discussion 
Academic References I K 1732)1s [2] 
IK 34.8fls[IOJ 
IK 51fls[l3J 
Commercial References IK 60ftS [5J 
4K 123)1s[6] 
ESL Based Design 4K 160l1s in scalar mode 
Table 5.1 The Compartson offloatmg pomt FFT .mplementatlOn 
Apart from Tong's work, all past literatures about floating point FFT were all presented in 
recent year with advantage technology. ESL-based design was con figured to 4K point and 
achieves 160)1s on Modelsim platform within 5ns per cycle. Since the complexity of FFT 
calculation is proportional to the number of point, it can be deduced that the performance for 
IK point is roughly 40)1s. Compared to both academic and commercial references, the 
ESL-based design slightly less performs than Mou er al.[IO] and Altera 's work[6], but 
outperforms Ochi et al.'s design[13] and commercial IP(S]. Through, there is no 
comparison regarding area, as our design was not implemented on FPGA platform, the result 
indicates that the performance of scalar mode ESL-based FFT design is competitive 
compared to past literatures. The parallel implementation can raise the performance further. 
These findings suggest that ESL-based design flow has proved to be a very useful language 
in rapid prototyping. It accelerates the design cycle time drastically. Design verification is 
easier and quicker with high level description as well. 
5.5 Summary 
To further validate the ESL method, FFT was chosen in this study. In the case of FFT, 
hardwired FFT engine was developed both in scalar and parallel mode. Compared to related 
literature, the performance of ESL-based FFT engine is competitive. 
93 
Chapter 5: Hardwired approach for FFT using ESL method 
References 
[I]. J. W. Cooley, and W. T. John, "An algorithm for the machine calculation of complex Fourier 
series," Math. Compu!., pp.297-301, 1965 
[21 Tran Thong, B. Liu, "Accumulation of round off errors injloating point FFT," Circuits and 
Systems, IEEE Transactions on, Volume 24, pp.132 - 143, Mar 1977 
[3] A. Genusov, P. Feldman, R. Friedlander, V. Fruchter, R. Jaliff, A. Mohr, R. Shenhav, "A new, 
highly parallel, 32 bit jloating point DSP vector signal processor," International Conference on 
Acoustics, Speech, and Signal Processing, pp. 2116 - 2119 vol.4, April 1988 
[4] RRSwartzlander, "Systolic FFT Processors: Past, Present and Future," International 
Conference on Application-specific Systems, Architectures and Processors, pp.153 - 158, Sept. 
2006 
[5] http://www.electronicstalk.com/news/nal/naII03.htm!. 
[6] http://www.altera.com/literature/wp/wp fft radix2.pdf. 
[7] http://www.transtechdsp.com/datasheets/qx-fft_vl.pdf 
[8] K. Dabbagh-Sadeghipour, M. Eshghi, "A self-timed, pipelined jloating point FFT processor 
architecture" International Symposium on Signals, Circuits and Systems, pp.33 - 36,10-11 July 
2003 
[9] K.S. Hemmert, K. D. Underwood, "An analysis of the double-precisionjloating-point FFT on 
FPGAs," IEEE Symposium on Field-Programmable Custom Computing Machines, 2005, 
pp.171 - 180, 18-20 April 2005 
[10] S. Mou, X. Yang,"Design of a High-speed FPGA-based 32-bit Fioating-point FFT 
Processor" International Conference on Software Engineering, Artificial Intelligence. Networking, 
and Parallel/Distributed Computing, pp.84 - 87, July 2007 
[11] N. Mahdavi, R. Teymourzadeh, M. Bin Othman, "VLSllmplementation of High Speed and 
High Resolution FFT Algorithm Bosed on Radix 2 for DSP Application, ",5th Student Conference 
on Research and Development,_pp.I-4, Dec. 2007 
[12] T. Lenart, V. Owall, "Architectures for Dynamic Data Scaling in 2/4/SK Pipeline FFT 
Cores," IEEE Transactions on Very Large Scale Integration (VLSI) Systems,_pp.1286 - 1290, 
Nov. 2006' 
[! 3] Hiroshi, Ochi, "Rn design of parallel FFTwilh block floating point arithmetic, .. 
IEEE Conference on Soft Computing in Industrial Applications, pp.273 - 276, 25-27 June 2008 
[14] S.Sukhsawas, K.Benkrid, "A high-level implementation of a high performance pipeline FFT 
on Virtex-E FPGAs," IEEE Computer society Annual Symposium on VLSI. pp.229-232, 2004. 
94 
Chapter 6: Hardwired approach for FFT using ESL method 
Chapter6 
Hardwired approach for Smith-Waterman using ESL method 
6.1 Smith-Waterman background 
In bioinformatics, data searching is used to find a gene or protein related to our newly 
determined sequence. The result of data searching is sequence likeness or similarity. If the 
similarity is great enough we are allowed to make the scientific inference that the two 
sequences are homology [I]. 
Smith-Waterman is one of the most important data searching algorithms [2]. Through 
comparing sequences, the Smith-Waterman algorithm is searching for homology by 
comparing sequences. The algorithm belongs to dynamic programming. In dynamic 
programming, the entire problem is divided into sub-problems. The solution for the entire 
problem is built by putting all solutions for sub-problems together. Through implementing 
the idea of dynamic programming, the Smith-Waterman algorithm finds the optimal local 
alignment by considering alignments of any possible length starting and ending at any 
position in the two sequences being compared. The basis of a Smith-Waterman 
implementation is the comparison of two sequences 
A=(a,a,a3···a,) and B = (b,b,b, ... b,) (6.1) 
The Smith-Waterman algorithm is based on individual pair comparisons between characters 
as: 
SW' =max 
'J 
SWH .i_, + s(a i , b), 
SWi_k,i + gi' 
S Wi.i-k + gi' 
O. (6.2) 
SWiJ, is the Smith-Waterman score for the partial alignment ending at residue i of sequence 
a and residue j of sequence b. In Smith-Waterman equation, the four terms need to be 
considered and the term with the maximum value is selected. The first term, 
SW " ,+s(a, b) C • I' . f h 
1- ,J- 1 J represents Jar extendmg the a Ignment by one residue rom eac 
SW+g .. 
sequence. The second term, ,-,.,I J, corresponds to extending the alignment by 
including residue j from sequence b and inserting a gap of k residues in length, aligned to end 
. h 'd . f b . h h' d Sw" k +g d 'b' . Wit resl ue J 0 sequence ,mto sequence a. T et Ir term, ,J- I, escn es Insel1mg 
a gap into sequence b. 
95 
Chapter 6: Hardwired approach for FFT using ESL method 
- G 
- 0 0 
C 0 5 
G 0 5 
A 0 0 
A 0 0 
G 0 5 
C 0 0 
A 0 0 
T 0 0 
C 0 0 
-4/+5 depend on If match -7 
,\J 
C T G G A 
0 0 0 0 0 
0 0 0 0 0 
0 I 5 5 0 
I 0 0 I 10 
0 0 0 0 6 
0 0 5 5 0 
10 3 0 I I 
3 6 0 0 6 
0 8 2 0 0 
5 I 4 0 0 
A G 
0 0 
0 0 
0 5 
5 0 
15 8 
8 20 
I 13 
6 6 
2 2 
0 0 
Parameters: 
Match +5 
Mismatch -4 
Extend gap -7 
G T 
0 0 
0 0 
5 0 
I I 
I 0 
13 6 
14 9 
9 10 
2 14 
0 7 
FIgure 6.1 Example of the Smlth-waterman algortthm 
A C 
0 0 
0 5 
0 0 
0 0 
6 0 
0 0 
2 0 
14 7 
7 10 
10 12 
As illustrated in figure 6.1. the example is used to find the best scoring alignment between the 
two DNA sequences as GCTGGAAGGTAC and CGAAGCATC respectively. By 
implementing the Smith-Waterman algorithm, the scoring table is filled. The best scoring 
alignment can be found by tracing the highest score to the first non-zero score. In this case, it 
is highlighted in red as GAAG. 
As literature shows, dedicated hardware units can calculate SW hundreds times faster than 
the software solution [3,4,5]. For example in [4], the dedicated SW engine performs 170 
faster than an optimized C program running on a 1.6 GHz Pentium IV processor. 
As the sizes of the GenBank/EMBLlDDBl double about every 15 months, faster and higher 
precision implementations of the Smith-Waterman algorithm are needed. 
6.2 ESL-based SW engine 
For the SystemC model, Cdc25 [6] and Ste5 [7] are implemented as query database and query 
sequence respectively. The PAM 2 50 is picked to calculate the score for each matrix cell 
[8]. 
The original Smith-Waterman code needs to be modified for Agility compiling as figure 6.2 
shows 
Original code: 
for (i~l;i<~a[O] ;i++) 
for (j~l;j<~b[O] ;j++) 
diag ~ h[i-l] [j-l] + sim[a[i]] [b[j]]; 
96 
Chapter 6: Hardwired approach for FFT using ESL method 
down ~ h [i -1] [j] + DELTA; 
right ~ h[i] [j-1] + DELTA; 
max=MAX3 (diag,down, right); 
if (max <~ 0) 
h[i] [j]~O; 
xTraceback[i] [j]--l; 
yTraceback[i] [j]--l; /*thesevaluesalready-I*! 
else if (max =~ diag) { 
h[i] [j]~diag; 
xTraceback[i] [j]-i-1; 
yTraceback[i] [j]-j-1; 
else if (max =~ down) { 
h [i] [j] =down; 
xTraceback[i] [j]~i-l; 
yTraceback[i] [j]~j; 
else 
h[i] [j]=right; 
xTraceback[i] [j]=i; 
yTraceback[i] [j]=j-l;) /*endforloop*! 
Optimization code towards Agility compiling: 
for (i=l;i<=a[O];i++) 
for (j=l;j<~b[OJ ;j++) 
diag = float32 add(h[i-l][j-l] , sim[a[iJ ][b[j] J); 
wait(); 
down = float32_add (h[i-l] [j] , DELTA); 
",ait(); 
right = float32_add(h[i] [j-1J , DELTA); 
"ait(); 
max=MAX3(diag,down,right); 
wait(); 
if (max <= 0) 
h[i] [j]=O; 
97 
Chapter 6: Hardwired approach for FFT using ESL method 
xTraceback[iJ [jJ--l; 
yTraceback[iJ [jJ--l; 
else if (max -- diag) { 
h[iJ [j]-diag; 
xTraceback[i] [j]-i-l; 
yTraceback[i] [j]-j-l; 
else if (max -- down) ( 
h [iJ [j J -down; 
xTraceback[iJ [j]-i-l; 
yTraceback [iJ [j ]-j;) 
else ( 
h[i] [j]-right; 
xTraceback [iJ [j J -i; 
yTraceback[i] [j]-j-l; 
Figure 6.2 Synthesizable Smith-Waterman code for Agility compiling 
In figure6.2, no waitO statement among integer operation, as the datapath of integer operation 
is much shorter than the floating point. 
SystemC level simulation has taken place which produces cycle-accurate results as figure 6.3 
shows. 
12000000 
10000000 
+-' 
C 8000000 ;:l 
0 
~ ~ u 
Cl) 6000000 •• SW Engine' 
~ 
u ,., 4000000 u 
2000000 
0 
35 14 47 48 68 75 80 115197 
Figure 6.3 SystemC level simulation for SW processor 
In figure 6.3, cycle count decreases from delay 35 to delay 197 which reflects that more waitO 
statements are needed to shorten critical datapath. 
98 
Chapter 6: Hardwired approach for FFT using ESL method 
6.3 The implementation of ESL-hased SW engine 
The implementation space was studied by the same method proposed before. The relevant 
results are shown in figure 6.4 and 6.5. 
+' 
" ::l 0 
Cl 
-0 
" ro z 
1400000 
1200000 
1000000 
800000 
600000 
400000 
200000 
0 
35 44 47 48 68 75 80 115 197 
Delay 
Figure 6.4 Nand count for SW engine 
Figure 6.4 illustrates that there is increasing trend from delay 197 to 35. It seems that more 
NAND counts are needed when the longest datapath has been shorten by using proposed ESL 
method. 
+-' 
<:: 
::l 
0 
Cl 
0. 
0 
r-< 
"'" 
0. 
• ..< 
r-< 
"'" 
25000 
20000 
15000 
10000 
5000 
0 
35 44 47 48 68 75 80 115 197 
Delay 
Figure 6.5 Flip flop Count for SW engine 
From the figure 6.5, it can be seen that Flip-flop count reduce significantly with the decrease 
of the maximum delay configuration. This is due to Flip-flop is the basic component for 
sequential logic and more sequential logics are needed in shorter datapath. 
99 
Chapter 6: Hardwired approach for FFT using ESL method 
6.4 Conclusion 
In this study, IEEE-compatible floating-point SW engines were developed using ESL method. 
By using ESL method, those developments can be completed easily in short time. Compared 
to the past study [14] where 13 person months were needed to develop a dedicated 
Smith-Waterman engine, there is a significant advantage in term of design efficiency. 
From the two studies presented, the ESL method proposed in chapter 4 was demonstrated 
further. This methodology can efficiently accelerate the hardwired implementation of 
numerical applications. Also, the ESL-based engines are competitive, compared to other 
reported designs produced by conventional design flow. 
6.5 Summary 
Smith-Waterman was chosen in this chapter to further validate ESL method. A scalar 
hardwired engine was developed using ESL method. To the author's knowledge, this was the 
first time that a floating point hard wired engine was demonstrated for Smith-Waterman 
application. 
100 
Chapter 6: Hardwired approach for FFT using ESL method 
Reference: 
[1]. A. M. Lesk, Introduction to Bioinformatics. Oxford University Press, 2002. 
[2]. M.Waterman and M. Eggert, "A new algorithm for best subsequence alignments with 
application to trna-rrna compariso, " Journal of Molecular Biology., vo!. 197, pp. 723-728, 
1987. 
[3]. Y. Yamaguchi, T. Maruyama, and A. Konagaya, "High speed homology search with 
fpgas" in Pacific Symposium on Biocomputing, pp. 271-282, 2002. 
[4]. T. F. Oliver, B. Schmidt, and D. L. Maskell, "Hyper customized processors for 
bio-sequence database scanning on JPga" in FPGA, pp. 229-237, 2005. 
[5]. T. Han, and S. Parameswaran , "SIVASAD: an ASIC design for high speed DNA sequence 
matching" Design Automation Conference, 2002. Proceedings of ASP-DAC 2002. 7th Asia 
and South Pacific and the 15th International Conference on VLSI Design. Proceedings. 
[6]. http://en.wikipedia.org/wiki/Cdc25 
[7]. http://db.yeastgenome.org/cgi-bin/locus.pI?Iocus~ste5 
[8]. http://www.icp.be/-opperd/private/pam250.htmI 
101 
Chapter 7: Conclusion and future work 
Chapter 7: 
Conclusion and future work 
The aim of this thesis is to study the feasibility of ESL for accelerating numerical codes 
execution. TLM was explored firstly. In order to do architectural exploration for 
programmable solution, custom simulator based on SimpleScalar was detailed. Within these 
simulation infrastructures, the TLM code was optimised to investigate the data and thread 
level parallelisms in TLM workload. The instruction-accurate simulation result showed that a 
significant performance improvement can potentially be achieved by implementing vector 
and multiprocessor architecture. Consequently, the micro-architecture of these architecture 
and VLSI implementation were developed. With the increasing interest and availability of 
ESL method, a dedicated hardwired TLM engine was developed by a novel ESL design flow. 
An IEEE 754 compliant floating point datapath was implemented in the TLM engine. It was 
observed that a dedicated TLM engine can have significant performance improvement over 
programmable solution. Parallelisms can be utilized by using this method as well. Apart from 
these benefits, design effort can also be saved by using ESL method. To demonstrate the 
suitability of proposed ESL method, the ESL-based implementation of FFT and 
Smith-waterman was investigated further. 
Contributions 
I. A programmable solution for TLM algorithm was developed. In this solution, data and 
thread level parallelism were realized by the vector coprocessor and multiprocessor 
architecture. 
2. The comparison between ESL-based hardwired approach and programmable approach 
has been presented in TLM case. It is interesting to demonstrate that a novel ESL design 
flow can raise the design efficiency and a dedicated TLM engine can significantly 
improve performance over single processor solution. 
3. The suitability of ESL method was investigated by implementing FFT and 
Smith-Waterman algorithms 
However, there are still significant potential in the proposed ESL method, hence future works 
are suggested below. Currently, floating point datapath was implemented by using ESL 
102 
Chapter 7: Conclusion and future work 
method in higher abstraction level A more structured SystemC model could potentially reduce 
implementation area and save power consuming. In author's view, some improvements to the 
proposed ESL method may be brought by implementing additional levels in the abstraction 
hierarchy for more detailed representation of architecture. 
Due to the limitation of synthesizable subset, some numerical codes are difficult to be 
implemented by using the proposed ESL method. More arithmetic functions should be added 
into synthesizable subset such as log and trigonometric functions in the future. 
To address the power issue, power awareness also could be brought to the current ESL as 
future work. 
In programmable approach, to better explore different possible architectures, a cycle-accurate 
simulator could be developed on top of Simplescalar in the future. 
103 

