University of Windsor

Scholarship at UWindsor
Electronic Theses and Dissertations

Theses, Dissertations, and Major Papers

2007

A low-cost processor-based logic emulation system using FPGAs
Marwan Kanaan
University of Windsor

Follow this and additional works at: https://scholar.uwindsor.ca/etd

Recommended Citation
Kanaan, Marwan, "A low-cost processor-based logic emulation system using FPGAs" (2007). Electronic
Theses and Dissertations. 4615.
https://scholar.uwindsor.ca/etd/4615

This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor
students from 1954 forward. These documents are made available for personal study and research purposes only,
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution,
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or
thesis from this database. For additional inquiries, please contact the repository administrator via email
(scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208.

A Low -C ost P rocessor-B ased Logic
E m ulation S ystem U sin g F P G A s

by

M arw an K an aan

A Thesis
Subm itted to the Faculty of G raduate Studies
through Electrical and Com puter Engineering
in P artial Fulfillment of the Requirements for the
Degree of M aster of Applied Science at the
University of W indsor

W indsor, O ntario, Canada
2007

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Library and
Archives Canada

Bibliotheque et
Archives Canada

Published Heritage
Branch

Direction du
Patrimoine de I'edition

395 W ellington Street
Ottawa ON K1A 0N4
Canada

395, rue W ellington
Ottawa ON K1A 0N4
Canada
Your file Votre reference
ISBN: 978-0-494-34930-4
Our file Notre reference
ISBN: 978-0-494-34930-4

NOTICE:
The author has granted a non
exclusive license allowing Library
and Archives Canada to reproduce,
publish, archive, preserve, conserve,
communicate to the public by
telecommunication or on the Internet,
loan, distribute and sell theses
worldwide, for commercial or non
commercial purposes, in microform,
paper, electronic and/or any other
formats.

AVIS:
L'auteur a accorde une licence non exclusive
permettant a la Bibliotheque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public
par telecommunication ou par I'lnternet, preter,
distribuer et vendre des theses partout dans
le monde, a des fins commerciales ou autres,
sur support microforme, papier, electronique
et/ou autres formats.

The author retains copyright
ownership and moral rights in
this thesis. Neither the thesis
nor substantial extracts from it
may be printed or otherwise
reproduced without the author's
permission.

L'auteur conserve la propriete du droit d'auteur
et des droits moraux qui protege cette these.
Ni la these ni des extraits substantiels de
celle-ci ne doivent etre imprimes ou autrement
reproduits sans son autorisation.

In compliance with the Canadian
Privacy Act some supporting
forms may have been removed
from this thesis.

Conformement a la loi canadienne
sur la protection de la vie privee,
quelques formulaires secondaires
ont ete enleves de cette these.

While these forms may be included
in the document page count,
their removal does not represent
any loss of content from the
thesis.

Bien que ces formulaires
aient inclus dans la pagination,
il n'y aura aucun contenu manquant.

i*i

Canada
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

© 2007 M arwan K anaan

All Rights Reserved. No P a rt of this docum ent may be reproduced, stored or oth
erwise retained in a retreival system or tran sm itted in any form, on any m edium by
any m eans w ithout prior w ritten perm ission of the author.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

A bstract

Logic em ulation systems are used to verify the functionality of logic designs targeted
for integrated circuit im plem entation. In this thesis, the design and im plem entation
of a low-cost processor-based logic em ulation system is presented. It contains m ulti
ple processors interconnected together and packaged in one em ulation engine. It is
capable of em ulating com binational and sequential logic at relatively high speeds of
187 KHz or more, in real operating environm ents and with predictable compile time.
T he im plem entation was done on an F PG A to reduce cost. The proposed system is
scalable to a m ulti-FPG A system where several of these identical FPG A s could be
connected together to increase the logic capacity of the system.
The architecture and operation of the em ulator is first described. A rchitecture
exploration experim ents were conducted in order to choose suitable values for different
architecture param eters for im plem entation on the target FPG A . The design was
im plem ented on an A ltera S tratix FPG A . A four-bit multiplier was em ulated to verify
correct operation of the proposed em ulation system.

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

To my family for their unending love and support.

V

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

A cknow ledgments

I th an k God Almighty th a t this thesis has been completed. I stand here humbly at
the end of this accomplishment confident th a t I would not have been able to do it
w ithout His support and help. I ask Him, once and again, to continue to shed light
on each and every p a th I take.
I would like to thank my supervisor, Dr. Mohammed Khalid, for his support,
guidance and determ ination throughout the course of this work. I am deeply and
forever grateful for all the invaluable efforts he made. I would also like to th an k Dr.
Abdel-Raheem and Dr. Zhang for sitting on my comm ittee and reviewing my thesis
and Dr. K ar for sitting in as Chair of Defense.
Thanks to my family for all their love, support and advice. To my mom and
dad, thanks for all the encouragements, prayers, help and patience. I am w hat I am
today largely because of my parents and for th a t I am thankful. To my sister and my
grandm other, I am thankful for all the encouragements, prayers and care.
T hanks to all my friends and fellow graduate students at the University of W ind
sor. Jay and Ian, I ’ll never forget all the times we spent together. It was wonderful
to have you as officemates. My thanks also go to the current and former members
of our research group: Amir, Kevin, Raymond, Om ar, Junsong, Aws, Hongmei and

vi

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

ACKN O W LED G M EN TS

T huan. I would also like to th an k my friends Harb, Ali, Bahador, Paym an, Ashkan,
M ahzad, and Amr for their friendship over the past two years.
Funding and technical support for this research was provided by the N atural
Sciences and Engineering Research Council (NSERC) of Canada, the University of
W indsor, the Canadian Microelectronics C orporation (CMC) and A ltera Corporation.
Their contributions are gratefully acknowledged.

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

C ontents
A bstract

iv

D ed ica tio n

v

A cknow ledgm ents

vi

List o f Figures

xii

List o f Tables

xiv

List o f A bb reviations

xv

List o f Sym bols

xvi

1 Introdu ction

1

1.1

2

Thesis O b je c tiv e s ...............................................................................................

3

1.2 Thesis O rg a n iz a tio n ...........................................................................................

4

B ackground and P reviou s W ork

5

2.1

Design V e r if ic a tio n ...........................................................................................

5

2.1.1

F o rm a l V e r i f i c a t i o n ......................................................................................

6

2.1.2

Software Simulation

..........................................................................

7

2.1.3

H ard ware-Accelerated S im u la tio n ..................................................

8

viii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

CONTENTS

2.2

2.1.4

R apid P r o t o t y p i n g ..............................................................................

8

2.1.5

Logic Em ulation

.................................................................................

9

Logic Em ulation S y s te m s.................................................................................

10

2.2.1

FPG A -Based Em ulation S y s t e m s ...................................................

12

2.2.1.1

Field Program m able G ate A r r a y s ..................................

12

2.2.1.2

A rchitecture and CAD for FBEs

..................................

17

Processor-Based Em ulation S y stem s................................................

19

2.2.2.1

Em ulation Processors

......................................................

19

2.2.2.2

A rchitecture and CAD for P B E s ...................................

20

Commercially Available Logic E m u la to rs ......................................

23

2.2.2

2.2.3

3

S y stem A rchitecture and O peration

24

3.1 Introduction and M o tiv a tio n ...........................................................................

24

3.2 Levels of H ie r a r c h y ............................................................................................

25

3.3 Logic Em ulation P r o c e s s o r ..............................................................................

27

3.3.1

Control S t o r e .........................................................................................

29

3.3.2 D ata S t a c k s ............................................................................................

31

3.3.3 Logic E le m e n t.........................................................................................

32

3.3.4 A rchitecture and O p e r a t i o n ..............................................................

33

3.4 Memory Em ulation P r o c e s s o r ........................................................................

36

3.4.1

Control S t o r e .........................................................................................

37

3.4.2 Memory S t o r e .........................................................................................

39

3.4.3 Release Memory W ord U n i t ..............................................................

39

3.4.4 C apture Memory W ord U n i t ..............................................................

40

3.4.5 A rchitecture and O p e r a t i o n ..............................................................

40

3.5 Em ulation M o d u l e ............................................................................................

42

3.5.1 M odule Level R outing S w i t c h ...........................................................

43

3.5.2 Sequential F ille r ......................................................................................

44

ix

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

CONTENTS

A rchitecture and O p e r a t i o n .............................................................

44

Em ulation C h i p ..................................................................................................

45

3.6.1

Chip Level R outing S w itc h ................................................................

45

3.6.2

A rchitecture and O p e r a t i o n .............................................................

46

Em ulation E n g in e ...............................................................................................

46

3.7.1

M ulti-FPG A S y s t e m ..........................................................................

46

3.7.2

Scalability I s s u e s .................................................................................

48

3.5.3
3.6

3.7

4

A rch itectu re E xploration and Im plem entation R esu lts

50

4.1

Im plem entation T a r g e t .....................................................................................

50

4.1.1

A ltera S tratix F P G A ..........................................................................

51

A rchitecture E x p lo ra tio n .................................................................................

52

4.2.1

Key P a r a m e te r s .....................................................................................

52

4.2.2

Effect of Changing P a ra m e te rs ..........................................................

53

4.2.2.1

Effect of Changing Lookup Table Size

53

4.2.2.2

Effect of Changing Number of Em ulation Steps

4.2.2.3

Effect of Changing Total Number of O utputs . . . .

4.2.2.4

Effect of Changing Memory Word S i z e ..............

4.2

4.2.3

.......................
...

58
64

Choice of P a r a m e te r s ...........................................................................

4.3 Im plem entation Results

56

66

.................................................................................

66

4.3.1

Logic P r o c e s s o r .....................................................................................

67

4.3.2

Memory Processor

..............................................................................

68

4.3.3

Em ulation M o d u l e ..............................................................................

68

4.3.4

Em ulation C h i p .....................................................................................

69

4.4 Im plem entation E stim ates for Em ulation E n g in e ......................................

71

4.5 Em ulation E x a m p le ............................................................................................

72

4.5.1

Four-Bit M u ltip lie r ..............................................................................

72

4.5.2

Scheduling and Im p le m e n ta tio n .......................................................

73

X

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

CONTENTS

5

C onclusion and Future W ork

77

5.1

Research C o n trib u tio n s ...................................................................................

77

5.2

Comparisons w ith O ther S y s t e m s ..............................................................

78

5.3

Future W o rk .......................................................................................................

79

R eferences

80

V IT A A U C T O R IS

83

xi

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

List of Figures
2.1

Logic Em ulation S y s t e m .................................................................................

11

2.2

A Generic F PG A A rc h ite c tu re .......................................................................

14

2.3

Internal Structure of a Logic E le m e n t........................................................

15

2.4

Internal Structure of a Lookup T a b l e ........................................................

15

2.5

M ulti-FPG A S y s t e m ........................................................................................

17

2.6

CAD Flow for F B E s ........................................................................................

18

2.7

Processor-Based Em ulation S y s t e m .............................................................

21

2.8

CAD Flow for P B E s ........................................................................................

22

3.1

Em ulation Design C y c l e .................................................................................

26

3.2

System H ie ra rc h y ...............................................................................................

28

3.3

Logic Em ulation P r o c e s s o r ..............................................................................

29

3.4 Logic Processor Control W ord F i e l d s ..........................................................

30

3.5

33

Logic E le m e n t......................................................................................................

3.6 O peration of the Logic Processor

.................................................................

35

3.7

Memory Em ulation P r o c e s s o r ......................................................................

37

3.8

Memory Processor Control W ord F ie ld s .....................................................

38

3.9

O p e r a tio n of th e M e m o ry P r o c e s s o r ....................................................................

41

3.10 Em ulation M o d u l e ............................................................................................

43

3.11 Em ulation C h i p ..................................................................................................

45

xii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

LIST OF FIGURES

3.12 8-Way Mesh M F S ..............................................................................................

47

3.13 Fully Connected MFS

....................................................................................

47

3.14 Clock D uty C y c l e ..............................................................................................

48

4.1

LUT Size vs. Area in L P .................................................................................

54

4.2

LUT Size vs. Memory Bits in L P ................................................................

54

4.3

LUT Size vs. Speed in L P .............................................................................

55

4.4

Num ber of Em ulation

Steps vs.

Area in LP

........................

57

4.5

Num ber of Em ulation

Steps vs.

Memory Bits

in L P ............

57

4.6

Num ber of Em ulation

Steps vs.

Speed in L P ............................

58

4.7

Num ber of Em ulation

Steps vs.

Area in M P ............................

59

4.8

Num ber of Em ulation

Steps vs.

Memory Bits

in M P ...........

59

4.9

Num ber of Em ulation

Steps vs.

Speed in M P

.........................

60

4.10 Num ber of Total O u tp u ts vs. A rea in L P .................................................

61

4.11

Num ber of Total O u tputs vs. Memory Bits in L P ..................................

61

4.12

Num ber of Total O u tputs vs. Speed in L P ...............................................

62

4.13 Num ber of Total O u tputs vs. Area in M P

................................................

62

4.14

Num ber of Total O u tputs vs. Memory Bits in M P ...................................

63

4.15

Num ber of Total O u tputs vs. Speed in M P ...............................................

63

4.16 Memory W ord Size vs. Area in M P ............................................................

64

4.17 Memory W ord Size vs. Memory Bits in M P

............................................

65

.........................................................

65

4.19 M odule Level R outing S w i t c h ......................................................................

70

4.20 O peration of the Four-Bit M ultiplier.............................................................

73

4.18 Memory W ord Size vs. Speed in M P

xiii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

List of Tables
3.1

Logic Processor Control W ord Fields Description

.................................

30

3.2

Memory Processor Control W ord Fields D e s c rip tio n ..............................

38

4.1

M ultiplier Scheduling for Processors 0 - 3 .....................................................

74

4.2

M ultiplier Scheduling for Processors 4 - 7 .....................................................

76

xiv

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

L ist o f Abbreviations

A bbreviation
ASIC
CAD
DUT
EDA
FB E
FF
FPG A
IC
I/O
LE
LP
LUT
MFS
MP
MUX
PB E
PCB
VHDL
VLSI

Definition
Application-Specific Integrated Circuit
Com puter Aided Design
Design Under Test
Electronic Design A utom ation
FPG A -Based Logic Em ulation System
Flip-Flop
Field Program m able G ate Array
Integrated Circuit
In p u t/O u tp u t
Logic Element
Logic Em ulation Processor
Lookup Table
M ulti-FPG A System
Memory Em ulation Processor
M ultiplexer
Processor-Based Logic Em ulation System
Printed Circuit Board
Very High Speed Integrated Circuit Hardware Description Language
Very Large Scale Integration

XV

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

L ist o f Symbols

Symbol Definition
M

Lookup table size.

N

Total num ber of em ulation steps in one design cycle.

P

Total num ber of outputs of all processors in one module.

Q

Memory word size.

R

The num ber of logic em ulation processors in one emulation module.

S

T he num ber of m emory em ulation processors in one em ulation module.

T

The num ber of em ulation modules in one emulation chip.

xvi

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C hapter 1

Introduction

In this day and age, electronic devices, ranging from cell phones to personal com
puters, play an essential role in our daily lives. Designing such devices and verifying
their functionality could be an excruciating task for engineers if the necessary tools
are missing. These tools, known as Computer-Aided Design (CAD) tools, have long
been a vital p art of the research in chip design where considerable research efforts
have been m ade and are always being carried out to ensure th a t designers have the
most reliable and efficient of these tools.
One task of these design tools is design verification, the process where the func
tionality of an electronic device is validated. In the past three decades verification
has become one of the m ost crucial p a rts of the design cycle. Its im portance is due
to the fact th a t it is absolutely necessary for designers to make sure th a t their design
is correct prior to fabrication. A simple error discovered after production is very ex
pensive to fix thus potentially costing the m anufacturing company millions of dollars
in losses [12]. In addition to th a t, the increase in chip size [26] and the need to reduce

1

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

1. IN TRO D U C TIO N

tim e-to-m arket require more capable and robust design verification tools to cope with
the growing industry.
Several design verification tools are available

011

the m arket today.

The most

effective of these tools are logic emulators. A logic em ulator is a design verification
tool where a reprogram m able system im itates the functional behavior of a logic design.
The system can be program m ed to act exactly as a desired chip and thus gives
the user the ability to check the logic design in real time circuit environm ents and
conditions before m anufacturing [9, 17]. By doing so the designer could verify the
Design under Test (DUT) by running tests th a t the real chip would have to pass.
This process could be repeated several tim es and in this m anner most errors could be
identified and corrected. Logic em ulators give the user the ability to catch alm ost all
functional errors in a logic design, however we should note th a t tim ing requirem ents
and constraints cannot be verified using this tool.
C urrently there are two types of logic emulators, FPG A-Based Logic Em ulation
System s (FBEs) and Processor-Based Logic Em ulation System s (PBEs). In FBEs,
several reprogram m able chips known as Field Programmable Gate Arrays (FPGA s)
are connected together to em ulate the functional behavior of a logic design. W hile
FB Es are considered to be low-cost and efficient emulators they face a m ajor problem
when it comes to their CAD tools. The second type of logic em ulation systems are
processor-based em ulators where m ultiple emulation processors are packaged together
in an emulation engine capable of em ulating a logic design of significant size and
complexity. PB E systems are considered to be an efficient verification tool and do
not suffer from problem atic CAD tools, however they are implem ented on custom
m ade chips and are therefore very expensive.
The m otivation behind this thesis is to design a logic emulation system th a t would
combine the advantages of FBEs and PBEs; a system th a t would be as efficient as
a PB E and as inexpensive as an FBE. To achieve this goal, one solution would be

2

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

1. IN TRO D U C TIO N

to implem ent the processor-based em ulation system on an FPG A , a reprogram m able
chip known for its low cost. In this thesis we explore the architecture of such a system.
T he proposed em ulation system would run at relatively high speeds and be capable
of em ulating designs of significant logic capacity and complexity.

1.1

T hesis O b jectives

T he m ain objective of this thesis is to explore FPG A design and im plem entation of
a low-cost processor-based em ulation system. To achieve this goal, an architecture of
a logic em ulation system is explored and implemented. This thesis has the following
objectives:
1. Explore the architecture of a processor-based logic em ulation system th a t can
be implem ented on an FPG A .
2. Implement the em ulator by targeting a specific FPGA.
3. Ensure th a t the PB E is scalable. Several FPG A s should be able to connect
together in a m ulti-FPG A system to increase logic capacity.
4. Verify the system by em ulating a logic design.
To satisfy the first objective, an architecture of a processor-based em ulator was
explored in term s of cost, functionality, area and speed upon which key design pa
ram eters were chosen accordingly. To satisfy the second objective, an A ltera Stratix
F P G A was targeted. The im plem entation was tuned specifically for this FPG A . To
address the third objective a scalability study was done and results are presented.
For the fourth objective, a four-bit m ultiplier was designed, scheduled and em ulated
on the designed system to verify its correctness.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

1. IN TRO D U C TIO N

1.2

T h esis O rganization

T he rest of this thesis is organized as follows. C hapter 2 discusses the background
and previous work done on the subject.

C hapter 3 presents the architecture and

operation of the proposed system and its various components. C hapter 4 presents
the architectural exploration and im plem entation results. Lastly, chapter 5 concludes
w ith some discussion of possible future work.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

C hapter 2

Background and Previous W ork

This chapter presents the background for the research done in this thesis and briefly
describes related previous work. The first section begins by defining design verification
and its significance in to d ay ’s industry. It then briefly discusses the five m ajor types
of verification tools available on the m arket today, along with their m ain advantages
and disadvantages. The second section of this chapter focuses on logic emulation
systems. A brief introduction is given w ith a discussion of the two m ain types of
logic emulators. The chapter concludes by presenting some examples of commercially
available logic emulation systems.

2.1

D esign V erification

Design verification is the process whereby a logic design is checked for functional
errors. In this p art of the design cycle, the functional behavior of a logic design is
validated. As chips increased in size and complexity, design verification tools became

5

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

more com plicated and required more efforts. Today the design verification process
alone may consume up to 60% of the whole design cycle in term s of tim e, resources
and manpower making it the bottleneck for design development [14, 29, 16].
Over the past few decades design verification has evolved from simple m athem at
ical techniques th a t were carried out by m anual calculations to test the validity of
small designs to multi-million dollar machines capable of verifying a design consisting
of millions of gates.
There are several types of design verification tools available on the m arket today,
each w ith advantages and disadvantages, and in general they can be categorized into
five m ajor groups:
1. Formal verification
2. Software sim ulation
3. Hardware-accelerated sim ulation
4. R apid prototyping
5. Logic emulation
In the following sections we introduce each type of these tools and we describe
their capabilities and weaknesses.

2 .1 .1

F orm al V erification

In form al verification the designers prove the validity of a logic design using formal
m ethods; all or p art of the design is modeled in a m athem atical framework after
which the designer would solve the m athem atical equations to verify the correctness
of the design [20, 23].
The m ain advantage of formal verification is th a t it is highly effective in catching
design errors. Since it relies on a m athem atical approach, formal verification is almost

6

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PRE VIO U S W O R K

completely guaranteed to find any functional error. The main disadvantage, however,
is th a t it is very tim e consuming.

Proving the validity of a design using formal

verification requires extended periods of tim e from expert designers and hence it is
im practical to use in large IC designs [20].
Despite th a t, formal verification is probably the most comprehensive of all verifi
cation tools but due to its heavy cost in term s of tim e it could be only used in small
circuits or in specific parts of large designs.

2 .1 .2

Softw are S im u la tio n

Software sim ulation is w ithout doubt the most popular and widely used verification
tool [29]. It is widely available, inexpensive and above all user friendly. In sim ulation
the Design under Test (DUT) is represented in software models which the designer
would test for correctness by applying input test vectors to them then reading the
outp u ts to check for errors [27],
Simulation has m any advantages over other verification tools. It is generally easy
to use; the user’s task is only to choose the input vectors after which he or she has
to wait for outputs. Simulation is also inexpensive since it requires only a software
platform . It provides the user w ith high visibility and flexible debugging; the user can
observe each signal traversing the design to check for errors. B ut probably the most
im portant advantage is the flexibility th a t sim ulation provides. Since it is softwarebased, changing and modifying parts or all the design is relatively easy to do.
Software sim ulation also has several disadvantages. The degree of accuracy of the
verification process depends heavily on the user’s choice of input test vectors. The
choice of these vectors should be comprehensive enough to cover all aspects of the de
sign or else some functional behavior of the design might be missed and go unchecked
for errors. A second m ajor disadvantage, and perhaps the most im portant, is th a t
sim ulation is relatively slow [27]. Because of the sequential nature of software pro

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

cessing, sim ulating a large design, especially in its real world operating environment,
could literarily take days or even weeks.

2 .1 .3

H a rd w a re-A ccelera ted S im u lation

Hardware-accelerated sim ulation shares the same basic principles w ith software sim
ulation. T he m otivation behind this m ethod was to simply overcome the slow speed
problem of software simulation. A logic design is still modeled in software, however
this tim e the sim ulation is executed on custom m ade hardware rather on a software
platform running on a single processor.

The processing power achieved by hard

ware accelerates the sim ulation and gives the advantage of faster sim ulation speed
[27, 22, 31].
Despite the speed acceleration th a t this m ethod provides it still suffers from the
same problem th a t software sim ulation suffers from: the degree of accuracy of the
verification process still depends on the designer’s choice of inputs. In addition, the
speedup provided by the hardw are accelerators is still lim ited by the comm unication
m edia between the host com puter and the hardw are accelerator itself [29]. The tim e
needed for the input vectors to be generated and the output signals to be read is still
restricted by the connective devices.

2 .1 .4

R a p id P r o to ty p in g

As the nam e suggests, in rapid prototyping, a custom m ade prototype of the logic
design is built by the designer to verify the functionality of a design [19, 8]. Usually a
custom m ulti-FPG A system is built for each prototype. In such systems, the FPG A s
are program m ed to im itate the functional behavior of the design and perm anent
connections are established between them to ensure design connectivity.
T he m ain advantage of prototyping is speed. Since the whole design is imple
m ented in hardware, rapid prototyping achieves the fastest verification speeds of all

8

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AND PREVIO U S W O R K

verification tools. In addition to speed, another advantage of rapid prototyping is th a t
it provides the user with the capability of testing the prototype in its real operating
environm ent. R ather th an using input test vectors to test the system, real inputs are
supplied from the surrounding targ et system, thus giving the user higher confidence
in th e validity of the design.
Nevertheless, rapid prototyping has a m ajor disadvantage when it comes to cost.
Once a prototype is built for a specific design it cannot be modified to implement
another design; in other words it is a throw-away effort after the user is done with
only one design. This basically m eans th a t the system is not reusable m aking its cost
very high.

2 .1 .5

L ogic E m u la tio n

The newest type of design verification and the most efficient one is logic emulation.
A logic em ulator is a reprogram m able system th a t can be program m ed and repro
gram m ed to emulate logic designs a t relatively high speeds. Once program m ed, an
em ulator would function exactly as the desired hardw are w ithout the need for fabri
cation. In doing so, an em ulator would be combining the advantages of software and
hardw are together; because it is reprogram m able it is as flexible as software and since
it utilizes hardw are it achieves very high speeds. However, it is im portant to note
th a t although an em ulator is program m able it is still quite different from software
sim ulation. The hardw are here is not being modeled in software; in fact, it is actually
im plem ented on reprogram m able hardware.
Com pared to other verification tools logic emulation has m any advantages. It is
as flexible as software sim ulation yet much faster and although it is not as fast as
rapid prototyping yet it is not as costly because it is reprogrammable. B ut the main
advantages of the logic em ulation would have to be in-circuit emulation, the capability
to function like an actual IC chip in real world operating environments. After being

9

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROU N D AN D PREVIO U S W O R K

program m ed an em ulator could be connected to a target system and tested, giving
the user the opportunity to verify the operation before fabrication. This removes the
need to generate input test vectors, and like rapid prototyping, gives the user higher
confidence in his or her design by relying on real inputs supplied by the target system.
Logic em ulation still has some disadvantages, mainly its cost. Logic em ulation
system s are still very expensive and can only be afforded by big companies. Designing
and m anufacturing such a system is still a costly process.
Since it is the m ain focus of this research, in the following sections of this chapter
we describe the m ain types of logic em ulators and we discuss their advantages and
disadvantages in detail.

2.2

Logic E m ulation S ystem s

A typical logic em ulation system , shown in Figure 2.1, contains three m ain elements:
1. Em ulation engine (or em ulator for short)
2. Em ulation support facilities
3. Interface circuitry
An emulator is basically a reprogram m able hardw are system th a t can implement
any logic design. This reprogram m able system could be a set of FPG A s or emulation
processors connected together. Some details about the architecture of th e em ulator
would be discussed in later sections of this chapter.
Em ulation support facilities include a host computer along w ith an em ulation
compiler [15]. The task of the host com puter is to act as an interface between the
user and the emulator. T he compiler is responsible for converting the DUT supplied
by the user into a bit stream to be downloaded to the program mable hardw are. The

10

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AND PREVIO U S W O R K

DUT

I
Interface
Circuitry

System
Host
C om puter

Logic
Emulator
Figure 2.1: Logic Em ulation System

em ulation support facilities m ight also include some other com ponents like a D ata
C apture U nit used to read the outputs from the emulator and relay them to the user.
T he interface circuitry of the logic em ulation system is used to connect the emu
lator to a target system to perform in-circuit emulation.
To use the system the user supplies the compiler with a logic design (e.g. w ritten
in a hardw are description language). The compiler compiles the design and generates
a bit stream which can then be downloaded onto the emulation engine. At this tim e
an em ulator is working exactly as the desired chip would work. Using the interface
circuitry the user could connect the em ulator to a target system and test the design.
This is the m ain advantage of emulation: the ability to test a design in its typical
operating environment w ith real inputs.
To illustrate this consider the example where the designers are verifying the func
tionality of a video card for a personal com puter. In this case the DUT is the logic
design for the video card and the targ et system is the personal com puter. To perform
in-circuit emulation, the designers would program the emulator w ith the design of the
video card and connect it to the personal com puter using the interface circuitry. The

11

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

personal com puter would then be powered up w ith the em ulator acting as its video
card. In this way the em ulator could be checked thoroughly for errors.
Logic em ulation systems are currently considered to be the most effective and
fastest m ethod for design verification.

They are used by most top semiconductor

vendors to test IC designs before fabrication. The price for such systems varies from
tens of thousands to millions of dollars depending on the type of the system, capacity
and speed. C urrently there are two m ain types of logic emulation systems available
on the market:
1. FPG A -B ased Em ulators (FBEs)
2. Processor-Based Em ulators (PBEs)
In w hat follows we present each of those types, their architectures, design tools
and operation.

2 .2 .1

F P G A -B a s e d E m u la tio n S y stem s

T he basic building block of an FB E system is an FPG A. In this em ulator, several
FPG A s are connected together to emulate (im itate) the functional behavior of a logic
design. Before discussing the architecture of this system we first introduce FPG A s in
detail.

2.2.1.1

Field Program m able G ate Arrays

A field programmable gate array is a reprogram m able chip th a t was first introduced
in the 1980s [17]. By m eans of reprogram m able logic embedded inside, an FPG A
can virtually implement any logic design. The m ain advantages of FPG A s can be
sum m ed up in two m ain points. The first advantage is th at they are inexpensive; the
price for a single FPG A sta rts from a few dollars. In addition, the reprogram m able
capability of the F PG A makes it reusable for many designs which lowers its cost even

12

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AND PREVIO U S W O R K

more. The second m ain advantage of FPG A s is th a t they have a fast tim e-to-m arket.
Unlike custom m ade chips where every single design has to be handled individually,
FPG A s, because they are not custom made, are available off the shelf.
T he above mentioned properties or advantages have given great im portance to
FPG A s in the industry. More and more designs are being implem ented on FPG A s
to save money and tim e.

Companies could use FPG A s for their designs instead

of Application-Specific Integrated Circuits (ASIC) chips to go around the lengthy
and costly process of designing and building custom made chips. However, these
gains do not come w ithout a price; FPG A s are still bigger and slower th an their
counterpart ASIC chips. Because of their program m able nature and since they are
not built to suit a specific design b u t rather any design, FPG A s still suffer from a
decrease in logic utilization, i.e. bigger area, and slower speed. F PG A m anufacturers
are addressing this problem now more th an ever and with the emergence of m odern
more sophisticated FPG A s these problems are becoming of lesser im portance and the
advantages of FPG A s are outweighing any disadvantages they have.
T he program m able ability of an F PG A is derived from the use of program m able
logic elements able to em ulate or im itate the functional behavior of any logic func
tion. Several architectures for FPG A s have been proposed, however, they all share
some basic components.

Figure 2.2 is a simplified illustration of a typical FPG A

architecture [30].
FPG A s are m ade up of several m ajor components.

The two most im portant

com ponents are logic elements and routing resources. A logic element in the FPG A
is responsible for em ulating the behavior of a logical function.

In other words a

logic element could im itate the function of any logic gate. A typical logic element,
shown in Figure 2.3, contains three m ain elements: lookup table, flip-flop and a 2-to-l
m ultiplexer. It is the task of the lookup table to operate as a logic gate. A typical
lookup table is shown in Figure 2.4. The lookup table shown in the figure has four

13

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

I/O

L = Logic Block
C = Connection Block
S = Sw itching Block
I/O = In pu t/O utp ut Pad

Figure 2.2: A Generic FPG A Architecture

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

Logic Element (LE)
C onfiguration b itB

O u tp ut

4-Input
LUT

Inputs

> c lk

Clock

Figure 2.3: Internal Structure of a Logic Element

Lookup Table
0
1

0
1

2

2

^

O utput
MUX

! 15
■. . . . . .
Inputs

—

Figure 2.4: Internal Structure of a Lookup Table
inputs. It contains a memory array and each array element is connected to an input
of a 16-to-l multiplexer. The selection bits for this multiplexer are the inputs of the
lookup table, i.e. the presum ed inputs of the logic gate. To program the table the
compiler sets the bits of the m emory array. Based on the selection bits (inputs) of
the m ultiplexer one of those array elements is chosen. The lookup table shown in the
figure is an example of a four-input AND gate; only when all the inputs are l ’s is the
last element of the array chosen and the ou tp u t is 1.
The lookup table only handles the com binational part of the logic element. To

15

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

accom m odate for sequential logic, the logic element contains a flip-flop whose input
is th e output of the lookup table. The ou tp u t of this flip-flop is fed into a 2-to-l
m ultiplexer whose selection bit is reconfigured by the compiler. This selection bit
decides the output of the logic element.
Besides the logic elements, FPG A s contain routing resources to connect these
elements together. The routing resources are basically made up of connection blocks,
switching blocks and a set of wires th a t run vertically and horizontally across the
FPG A . Connection blocks situated between the logic elements can be program m ed
to connect the outputs of these logic elements to any vertical or horizontal wire.
Switching blocks situated between the connection blocks can in tu rn be program m ed
to connect the wires together [13].
By program m ing the logic elements, connection blocks and switching blocks a user
can im plem ent any logic design on the FPG A . Nevertheless, m apping a logic design
onto an F PG A is not an easy task and is the m ajor challenge in F PG A research.
In addition to the logic elements and routing resources, FPG A s contain two other
im portant components: embedded m em ory blocks and in p u t/o u tp u t pads. Typically
the F PG A would have several m em ory blocks of different sizes to store d a ta th a t
would be used to implement memory arrays or registers in a logic design. I/O pads,
on the other hand are used to connect the F PG A to the outside world. B oth memory
blocks and I/O pads are connected to other elements of the FPG A via the routing
resources m entioned above.
C urrently there are two m ajor F PG A vendors: A ltera Corporation and Xilinx In
corporated [4, 34], The latest FPG A s produced by these companies contains hundreds
of thousands of logic elements capable of em ulating what is equivalent to millions of
ASIC logic g a te s [6, 35]. The ro le of FPG A s in th e in d u s try is grow ing a n d sig n ifican t
research is being carried out to enhance their performance and capabilities.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AND PREVIO U S W O R K

FPGA

FPGA

FPGA

FPGA

Interconnection
V Network J

FPGA

FPGA

FPGA

FPGA

Figure 2.5: M ulti-FPG A System

2.2.1.2

A rch itectu re and C A D for F B E s

In an FPG A -based em ulation system several FPG A s are connected together to be able
to em ulate a design of significant size. Several architectures were proposed to create
the M ulti-FPG A System (MFS)[21, 32, 7]. A typical architecture is shown in Figure
2.5. Here eight FPG A s are connected to each other by means of a program m able
interconnection network. Such a system is highly flexible since all inter-FP G A con
nections are program mable.
The CAD flow, shown in Figure 2.6, for a m ulti-FPG A system is as follows. The
user supplies the compiler w ith a logic design. After performing logic synthesis and
technology mapping, the compiler partitions the design into several parts such th a t
each p a rt could fit on one FPG A . Then each p a rt of the design would be assigned
to a specific FPG A inside the MFS and the compiler starts routing the signals or
connections between all the FPG A s, this is known as inter-FPG A routing. After
inter-FP G A routing is done the compiler sta rts placing and routing each p a rt of the
design in its specific FPG A , this is known as intra-FP G A placement and routing.
The final step would be to generate a bit stream of the design and download it to the
FPG A s [13].
FBEs are an efficient verification tool; they can emulate any design and can ac-

17

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

0=0
/

>
Logic S ynthesis and
Technology Mapping

Partitioning

Board-Level
Placem ent

Inter-FPGA
Routing

(

Intra-FPGA
I Placem ent and Routing

(

G enerate
Bit Stream

Figure 2.6: CAD Flow for FBEs

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

com m odate any logic capacity by simply increasing the number of FPG A s. FB Es also
have a relatively high em ulation speed because they exploit parallelism in hardware.
A nother m ajor advantage is th a t they have a relatively low price starting from only
several thousand dollars.
Nonetheless, FBEs still face a m ajor problem: CAD tools. M apping a logic design
to a m ulti-FPG A system by program m ing the FPG A s and the inter-FP G A routing
resources is very problem atic. P artitioning a design, placing it on FPG A s and then
routing the signals have been one of the m ain focuses of FPG A research. A lthough
m any algorithm s have been proposed and architectures suggested, CAD tools for
FPG A s are still very complex. Compiling a design for a m ulti-FPG A system has an
unpredictable compile tim e and m ay never even succeed. In addition, FB Es have very
lim ited visibility and debugging support, which makes it very difficult for the designer
to catch errors. Also, if an error was discovered and fixed the change m ight trigger
a chain reaction in the whole system and the design would need to be compiled and
downloaded again.

2 .2 .2

P r o c e sso r -B a se d E m u la tio n S y stem s

T he second m ajor type of logic em ulation systems is processor-based emulators. The
basic building block of a PB E is w hat is known as an emulation processor th a t can
em ulate a large num ber of logic gates and memory functions. Several of these emula
tion processors are connected together and run in parallel to emulate the functional
behavior of a logic design [15]. Before we discuss in detail the architecture of this
system it is useful to briefly describe th e em ulation processors and their operation.

2.2.2.1

E m ulation Processors

Similar to a logic element in an FPG A , an emulation processor can perform the
logical operation for any given function. A lthough it is m ade from custom hardw are

19

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

it is still programmable; th a t is achieved because embedded inside this processor is
a reconfigurable lookup table. The structure of this lookup table is exactly the same
as th a t of the one inside the FPG A . The m ain difference between this processor and
the logic element of the F PG A is th a t a logic element of the F PG A is program m ed
only once before em ulation sta rts and therefore can only implement one logic function
during the whole em ulation cycle. This is in contrast with this processor which can
reprogram its lookup table during em ulation to emulate different logical functions.
T he array elements for the lookup table would be stored inside the processor and
then loaded into its lookup table during emulation to change the logic function at
any time. The ability to change its operation type during emulation is w hat gives the
processor its advantage over the logic element of the FPG A. More on the processor’s
architecture and operation will be described in later chapters.
It is w orth noting th a t inside a PB E there might be several kinds of em ulation
processors. A PB E could have all homogeneous processors, in which case each proces
sor would have to be able to perform any function of the logic design. Alternatively,
a P B E could have heterogeneous processors, in which case specific processors would
perform specific tasks (e.g. several processors would perform logic operations while
others would perform memory functions) [15, 10].

2.2.2.2

A rch itectu re and C A D for P B E s

A typical architecture of a processor-based em ulator is shown in Figure 2.7. T he emu
lation processors are connected together via a program mable interconnection network
to ensure th a t a signal could traverse from one processor to another. It should be
noted th a t unlike FBEs where the interconnections are fixed during emulation, this
interconnection can be reprogram m ed during emulation. The reader should keep in
m ind th a t reprogram m ing the processors during emulation is quite different from
program m ing them prior to em ulation. The former is done by the em ulation support

20

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AND PREVIO U S W O R K

Processor

Processor

Processor

Processor

Interconncetion
Network

Figure 2.7: Processor-Based Em ulation System
facilities while the latter is done by the em ulator itself with no connection to the
facilities.
The CAD flow for PB E, shown in Figure 2.8, is described as follows. T he user
supplies the compiler w ith the logic design. The compiler performs logic synthesis and
technology m apping then partitions the design into several parts such th a t each p art
would be able to fit in one em ulation processor. After each of those p arts is assigned
to a specific processor, a process called scheduling starts. During scheduling different
logic functions which have been assigned to each processor are allotted different tim e
slots throughout the em ulation period. For example, an em ulation processor would
perform a logical AND during a specific tim e slot and a logical O R during another
tim e slot. After scheduling is done a b it stream is generated and downloaded onto
the processors before em ulation sta rts [15].
PB Es have several advantages. They have very efficient and fast CAD tools com
pared to FBEs. In addition to th a t, they have much better visibility and debugging
support. Finding an error and fixing it in a PB E is a standard procedure and usually
does not trigger a chain reaction in the whole emulator. W hen an error is found
the designer would only have to fix the specified processor and not the whole design
unlike FBEs. CAD tools in PB Es are much less complicated th an FB Es and have a
well predictable compile time.
PB Es also have some disadvantages. They are comparatively slower th a n FBEs.
Because processors in PB Es have to reprogram themselves periodically this leaves an

21

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2. BACKGROUND AN D PREVIO U S W O R K

GD
Logic S ynthesis and
Technology Mapping

Partitioning,
A ssignm ent and
Scheduling

G enerate
Bit Stream

Figure 2.8: CAD Flow for PBEs

22

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

2. BACKGROUND AN D PREVIO U S W O R K

effect on the em ulation speed. Nonetheless, current PBEs are becoming faster and
faster and are able to com pete w ith their FB E counterparts. The m ajor disadvantage
of PB Es is their price which is due to the fact th a t the whole system is built on
custom hardware. Their prices are currently in the order of millions of dollars.

2 .2 .3

C o m m ercia lly A v a ila b le L ogic E m ulators

To give the reader an idea of the em ulation technology on the m arket today, we
present two examples of logic em ulators m anufactured by leading Electronic Design
A utom ation (EDA) companies, Cadence Design Systems and M entor G raphics [11,
25].
The Incisive Palladium I I is a PB E system supplied by Cadence Design Systems
[18]. This machine is capable of sim ulation acceleration and in-circuit em ulation and
can reach a speed up to 1.5 MHz. This em ulator can compile up to 30 million gates
per hour on a single w orkstation and has a m aximum capacity of 256 million gates.
The V StationPRO is an example of an FPG A -based emulation system [33]. This
product is m anufactured by M entor Graphics. It has a scalable capacity from 1.6 to
120 million gates and can reach a speed up to 1 MHz. This em ulator can compile at
a rate of 5 million gates per hour.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C hapter 3

S ystem Architecture and
Operation

This chapter presents the architecture and operation of the proposed logic em ulation
system . The first and second sections include an introduction and a general view of
the em ulator. Sections 3 and 4 discuss the two basic components, logic and memory
em ulation processors, in detail. Sections 5, 6 and 7 discuss the em ulation module,
em ulation chip and em ulation engine respectively.

3.1

In trodu ction and M otivation

C hapter 2 introduced the two m ajor types of logic emulation systems, FPG A -based
e m u la to rs a n d p ro c e sso r-b a se d e m u la to rs, a lo n g w ith th e a d v a n ta g e s a n d d is a d v a n 

tages of each one of them . Keeping th a t in mind, the m otivation behind this work
is to design an em ulator th a t combines the two most im portant advantages of both

24

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

3. SYST E M ARCHITECTURE AN D OPERATION

systems: the low cost of FBEs and the high efficiency of PBEs. To achieve th a t we
have to design a PB E th a t can be im plem ented on an FPGA.
It is im portant to note th a t this research only deals with the hardw are part of
this proposed system. The m ain goal is to design an efficient architecture for a PBE.
T he CAD tools necessary to operate this em ulator are not the focus of this research
and are beyond the scope of this work.
Before delving into the details of the system architecture, it is im portant to high
light one im portant aspect of a processor-based emulator.

The m ain clock in an

em ulator, known as the design clock, is shown on top of Figure 3.1. It is the fre
quency of this clock th a t determ ines the speed of a PBE. During each clock period
of this design clock a num ber, known as the emulation step, increments from zero
to a specific num ber (127 in Figure 3.1). Shown a t the bottom of the figure is the
emulation clock whose clock period corresponds to a single em ulation step.
During each em ulation step, em ulation processors will perform a different opera
tion type which in effect m eans th a t a single processor could perform a m aximum of
128 different operations given th a t the num ber of emulation steps in a single design
cycle is 128.
We now discuss the details of the architecture and operation of the logic em ulation
system startin g w ith the basic components. We should note th a t the architecture pro
posed for this design is based on the architectures of [15] and [10] but has substantial
differences w ith them .

3.2

Levels o f H ierarchy

To enhance scalability, the design contains three levels of hierarchy connected together
using different topologies. These levels are:
1. Em ulation module

25

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

3. SYSTEM ARCHITECTURE AN D O PERATIO N

Design
Clock

Emulation
Steps
Emulation
Clock

Figure 3.1: Em ulation Design Cycle

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

3. SYST E M ARCH ITECTURE AN D OPERATIO N

2. Em ulation chip
3. Em ulation engine
The building blocks of this em ulation system are the logic em ulation processor
and the m emory em ulation processor. A specific number of each type of these two
processors are connected together by an interconnection network to form an emulation
module making it the first level of hierarchy.
T he second level of hierarchy is the emulation chip which contains a certain num
ber of identical em ulation modules. All the modules inside one em ulation chip are
connected by an interconnection network similar to the one inside the em ulation
m odule itself. Each em ulation chip would fit on one FPG A, hence the nam e chip.
T he th ird level of hierarchy is the emulation engine. To increase logic capacity,
several em ulation chips would be implem ented in a specially designed m ulti-FPG A
system. This m ulti-FPG A system is known as the emulation engine which is capable
of em ulating a design of significant size.
Figure 3.2 gives an overview of the hierarchy of the system. Here, the em ulation
engine is m ade up of 8 em ulation chips and each of those chips contains 8 em ulation
modules. Inside each of those modules is a num ber of logic and memory processors.
Note th a t the interconnections between the chips and between the modules are not
shown.

3.3

Logic E m ulation P rocessor

T he m ost basic component of the system is the Logic Emulation Processor (LP). The
sole purpose of this processor is to em ulate the functional behavior of logic gates.
Each gate is represented as a lookup table th a t can be program m ed to im itate any
desired logic function. The to ta l num ber of logic gates th a t a single processor can

27

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

3. SYST E M ARCH ITECTURE AN D OPERATIO N

E m u la tio n
E n g in e v

E m u l a ti o n
Chip 0

E m u l a ti o n
Chip 1

E m u la tion
Chip 2

E m u l a ti o n
Chip 3

E m u l a ti o n
Chip 4

E m u la tio n
Chip 5

E m u l a ti o n
C hip 6

E m u l a ti o n

\c h ip 7

E m u la tio n
Chip 8

E m u la tio n
M o d u le 0

E m u l a ti o n
M o d u le 1

E m u la tio n
M od ule 2

E m u l a ti o n
M o d u le 3

E m u la tio n
M o d u le 4

E m u la tio n
M od ule 5

E m u la tio n
M o d u le 6

E m u l a ti o n
M o d u le 7

E m u la tio n
M o dule 8

Figure 3.2: System Hierarchy

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

3. SYSTEM ARCHITECTURE AN D O PERATION

Step

Control Store

External
Stack
Logic
Element
Internal
Stack

External
Input

Logic
O utput

Figure 3.3: Logic Em ulation Processor
em ulate depends on its lookup table size and the number of emulation steps executed
in a single design cycle. The proposed logic processor has three m ain elements:
1. Control store
2. D ata stacks
3. Logic element
An architectural overview of this processor is shown in Figure 3.3.

3 .3 .1

C on trol S tore

The control store is used to store a unique control program for each processor to
determ ine the operation type during each em ulation step. The control store contains
several instructions of predeterm ined w idth th a t are generated by an em ulation com
piler whose task is to partition a logic design given by the user into several clusters.

29

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

3. SYSTEM ARCHITECTURE AN D OPERATIO N

Choose Input

LUT

SelA

RAA

SelB

RAB

SelM

RAM

Figure 3.4: Logic Processor Control Word Fields
Table 3.1: Logic Processor Control W ord Fields Description

C ontrol W ord Field

D escription

C h o o seln p u t

Picks an external input from the interconnection network.

LU T

The array elements of the lookup table.

S e lA

Selects the source of the first input to the lookup table
(internal or external stack).

RAA

Read Address A: the address for the first input of
the lookup table.

S e lM

Selects the source of the M th input to the lookup table
(internal or external stack).

RAM

Read Address M: the address for the M th input of
the lookup table.

These clusters are formed such th a t each one can fit into a single em ulation processor.
T he em ulation compiler then converts these clusters into a set of control words. The
control store is filled up w ith these words prior to emulation. During emulation, these
control words are read to instruct the processor on w hat to do during a specific step
[15].
The num ber of these instructions (i.e. the depth of the control store) is equal to
the m axim um num ber of em ulation steps needed in a single design clock cycle. The
fields of th e s e in s tru c tio n s are sh o w n in F ig u re 3.4 a n d d e sc rib e d in T a b le 3.1 w h ere

M is the size of the lookup table.
The num ber of bits dedicated for each field of the control word depends on two

30

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

3. SYST E M ARCHITECTURE AN D OPERATIO N

factors: the size of the internal and external stacks and the lookup table size. The
size of the internal and external stacks is equal to the maximum num ber of em ulation
steps in a single design cycle as would be discussed later.
Since the inputs of the logic element are located in the stacks, therefore the size
of the address of each of these inputs is equal to Log2(N ) where N is the num ber of
em ulation steps. The L U T size depends on the size of the lookup table inside the
logic element. More accurately L U T size is equal to 2M where M is the lookup table
size. T he size of the C h o o seln p u t field depends on the interconnection network, more
precisely on the num ber of processors (both logic and memory) and the num ber of
their ou tp u ts in the interconnection network. The size of the C h o o se ln p u t field is
Log2( P ) where P is the to tal num ber of outputs of all the processors sharing one
interconnection network. The only field th a t is independent of any external factors is
the Sel field. The size of this field is only one bit since it is used to choose between
only two types of stacks, either external or internal stack.
Therefore, the size of a single control word in bits is Log2(P) + M + 2M + M x
Log2(N ).

3 .3 .2

D a ta S tack s

T he data stacks are used to store one bit values provided as inputs to the logic element.
T he proposed design has two stacks: an internal stack and an external stack. Ideally
the two stacks are of the same w idth (one bit) and same depth. The depth of the
stacks is typically equal to the m axim um num ber of emulation steps executed in a
single design clock cycle.
T he internal stack is used to store values generated internally to the processor,
specifically values from previous operations done during different em ulation steps.
T he external stack is used to store values generated externally to the processor,
specifically values from other logic or m emory emulation processors.

B oth stacks

31

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

3. SYSTEM ARCHITECTURE AN D O PERATIO N

have one write p ort and M read p o rts which are provided as inputs to the lookup
table of the logic element.
At each em ulation step, the internal stack provides output values on its read
ports using the addresses {R A A ...R A M ) supplied to it from the control word. These
o u tp u ts are used as inputs to the logic element. Also during the same em ulation step,
the internal stack stores the value of the current operation (i.e. the ou tp u t of the
logic element) in the address derived from the step value.
Equivalently, during each em ulation step, the external stack provides ou tp u t values
on its read ports using addresses {R A A ...R A M ) supplied to it from the control words.
These outputs are used as inputs to the logic element. Also during the same em ulation
step, the external stack stores an input value external to the processor in the address
derived from the step value.
Note th a t both internal and external stacks supply the logic element w ith the
same num ber of inputs (M ) a t the same tim e. It is the task of the logic element to
choose between these inputs using the select values {Sel M) in the control word.

3 .3 .3

L ogic E lem en t

T he logic elem ent is used to calculate the logic output for the processor. The logic
element contains several m ultiplexers and a lookup table. The num ber of these m ulti
plexers is equal to the num ber of inputs to the lookup table, or lookup table size. The
Sel fields in the control word are used as selectors in these multiplexers to choose the
sources of inputs for th e lookup table (either internal or external stack). T he logic el
em ent also contains an M -input lookup table implemented as a 2M x 1 m emory array
and M -to-1 multiplexer. The elements of the array are filled by the L U T field in the
control word. By doing so we are defining the type of logical function to be em ulated
during a specific emulation step. Figure 3.5 gives an overview of the logic element.
Inputs shown in black are received from the internal stack, while inputs shown in grey

32

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

3. SYST E M ARCHITECTURE AN D OPERATION
SelA
LUT

A ----A ----SelB

M-lnput
Lookup Table

B -----

O utput

B ----

SelM

M----

Figure 3.5: Logic Element
are received from the external stack. Select values and L U T are supplied from the
control word. The lookup table of the logic element is identical to the one described
in chapter 2 and shown in Figure 2.4.

3 .3 .4

A r c h ite c tu r e an d O p era tio n

T he basic architecture of the logic processor is shown in Figure 3.3. The control store
is filled up w ith the control words prior to em ulation through dedicated wires (not
shown in figure). The processor has two external inputs and two external outputs.
The first external input is the step value which is identical for all processors in the
em ulation engine and the second external input is used as an input for the exter
nal stack where it is stored for subsequent operations. The first external o utput is
the C h o o seln p u t field of th e control word which is supplied to the interconnection

33

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

3. SYST E M ARCH ITECTURE AN D O PERATION

network to choose an input for the external stack as would be described later in
more detail. The second external ou tp u t is the output of the logic operation which
is supplied directly to the interconnection network to be used by other processors.
Preferably the depths of the control store, internal stack and external stack are
the same and equal to the m axim um num ber of emulation steps in a single design
clock cycle. This would ensure an entry to every operation output in the internal
stack, m aking this o utput available to any subsequent operations. As for the external
stack, its usage enables the processor to make use of other logical or m emory outputs
supplied to it through the interconnection network. Another advantage of having the
same num ber of entries in all three em bodim ents is th a t the step value is used both
as a read address for the control store and a write address for the stacks a t the same
tim e.
The operation of the logic processor is as follows. A step value is supplied to the
processor. The step value is used as an address to the control store where a control
word is read. Address fields, R A A ...R A M , are sent to the stacks and M bits are read
from each stack at the same tim e then sent to the logic element. Using the Sel fields
of the control word the logic element selects the sources of its inputs, either internal
stack or external stack. The lookup table, which is filled up using the L U T field from
the control word, performs the logic em ulation and supplies the output. The output
is then w ritten to the internal stack where the step value is used as a write address.
Also a t the same tim e, an external input is w ritten in the external stack. This input
is chosen among the outp uts of the other processors in the interconnection network
using the C h o o seln p u t fields of the control words. Again the step value here is used
as a w rite address.
In other words, the logic processor performs three operations in one emulation
step. It first executes a logical function using its lookup table then writes the output
of this function in the internal stack.

In addition to th a t the processor picks an

34

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

3. SYSTEM ARCHITECTURE AN D O PERATION

Design
Clock

Emulation
Steps
Emulation
Clock

Read word
from the
C ontrol Store
Read bits from
External and
Internal Stacks

W rite external
input to
External Stack
ar>d internal
inp ut to Internal
Stack
/

Figure 3.6: O peration of the Logic Processor

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYST E M ARCH ITECTURE AN D OPERATIO N

external input from the interconnection network and writes it to the external stack.
The operation described above requires three memory accesses which have to occur
in a single em ulation step b ut not simultaneously. These accesses are:
1. Reading a control word from the control store.
2. Reading inputs from the d a ta stacks.
3. W riting values to the d a ta stacks.
T he reason these accesses cannot be executed at the same tim e is because a certain
delay has to be given between each of them . To read inputs from the d a ta stacks
one has to wait for the address fields from the control word and to w rite values to
the d a ta stacks one has to wait for the o u tp u t of the logical operation to be ready.
To accom m odate th a t, both edges of the clock are used since each em ulation step
corresponds to only one clock period. As shown in Figure 3.6 a t the first falling edge
of the clock a control word is read using step value n. At the rising edge of the clock
and after sufficient tim e is given to fetch the control word, inputs to the lookup table
are read from the stacks using addresses derived from the control word. At the second
falling edge of the clock and after sufficient tim e is given for inputs from the stack to
be read and the logic function is executed, the output of this logic function and an
external input are w ritten to the stacks a t address n. Also a t the second falling edge
another control word is being read from the control store using address n + 1. This
scheme ensures th a t all the m emory accesses required for a logical operation are done
in a single em ulation step at different timings.

3.4

M em ory E m ulation P rocessor

In this section we present the m em ory processor, the second basic com ponent of
a complete processor-based em ulation system implemented on an FPG A . The sole

36

R ep ro d u ced with p erm ission o f the copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYSTE M ARCHITECTURE AN D OPERATIO N

Step

Inputs

Choose
External
Inputs

Control Store

Capture
Memory
Word
Unit

Memory
Store

R elease
Memory
Word
Unit

O utputs

Figure 3.7: Memory Em ulation Processor
purpose of the memory processor is to em ulate memory registers and their functions.
T he to ta l num ber of m emory bits th a t this processor can emulate depends on the
size of the embedded memory arrays and the number of emulation steps th a t are
com pleted during a design cycle. The proposed memory processor has four m ain
elements:
1. Control store
2. Memory store
3. C apture memory word unit
4. Release memory word unit
Figure 3.7 gives an architectural overview of the memory processor.

3.4 .1

C on trol S tore

T he control store is used to store a unique control program for each processor to
instruct the processor w hat to do during each emulation step.

The control store

contains several instructions of predeterm ined width. Similar to the instructions of
the logic processor, they are generated by an emulation compiler whose task is to

37

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYST E M ARCHITECTURE AN D OPERATIO N

MWA

W/R

C ll

CI2

CIQ

Figure 3.8: Memory Processor Control Word Fields
Table 3.2: Memory h'ocessor Control W ord Fields Description

C ontrol W ord Field

D escription

MWA

Memory word address in the memory store.

W /R

W rite (1) or read (0) a memory word.

C ll

Choose the 1st bit of the memory word from
the external inputs.

C IQ

Choose the Qth bit of the memory word from
the external inputs.

p artition a logic design given by the user into several clusters. These clusters are
formed such th a t each one can fit in a single emulation processor. T he em ulation
compiler then converts these clusters into a set of control words. The control store
is filled up w ith these words prior to emulation.

During em ulation these control

words are read by the processor to choose an operation to be performed in a specific
em ulation cycle [15].
The num ber of these instructions (i.e. the depth of the control store) is equal to
the m aximum num ber of em ulation steps done in a single design clock cycle. The
fields of these instructions are as shown in Figure 3.8 and described in Table 3.2 where
Q is the size of the m emory word.
The size of the control word and the num ber of bits dedicated for each field depends
on three factors: the word size of the memory store Q , the num ber of em ulation steps
in a single design clock cycle N and the to ta l number of outputs of all em ulation
processors in an interconnection network P .

38

R ep ro d u ced with p erm ission o f the copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYSTEM ARCHITECTURE AN D O PERATIO N

T he word w idth Q of the memory store is the same as the num ber of 1 bit inputs to
th e memory processor. This is because, ideally, we want an entry in the memory store
for each input of the processor. T he size of the memory word address M W A depends
on the size of the memory store. We propose a memory store of size equal to the
m axim um num ber of em ulation steps done in a single design clock cycle. Therefore
the size of M W A in bits is Log 2 (N). The size of choose input field C l depends on
the to ta l num ber of outputs of all processors sharing an interconnection network (P).
As a result the size of this field in bits is Log 2 (P).
Therefore, the size of a single control word in bits is Log 2 {N) + 1 + Q Log 2 {P).

3 .4 .2

M em o ry S tore

T he role of the m em ory store is to em ulate real memory functions; more precisely read
and write m emory operations. It contains several words of predeterm ined w idth and
num ber. The memory words can be from either of two sources: em ulation support
facilities or other em ulation processors.

Em ulation support facilities, such as an

em ulation compiler, fill up th e m em ory store prior to em ulation so th a t the filled
m emory words can be read during emulation. Also during emulation the ou tp u t of
other processors in the interconnection network might be w ritten to the m emory store.

3 .4 .3

R e lea se M em o ry W ord U n it

T he purpose of this com ponent is to break up the memory word read from the memory
store into one bit values. These bits are then supplied as outputs to the interconnec
tion network to be used as inputs to other processors.

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYSTEM ARCHITECTURE AN D O PERATION

3 .4 .4

C a p tu re M em o ry W ord U n it

T he purpose of this com ponent is to concatenate several external inputs into a single
m emory word. The m emory word th a t is formed after concatenation is entered into
the memory store and therefore is of the same size as the memory word.

3 .4 .5

A r c h ite c tu r e an d O p era tio n

T he basic architecture of the memory processor is shown in Figure 3.7. T he control
store and the memory store are filled up w ith the control and memory words prior to
em ulation through dedicated wires (not shown in figure). The processor has several
external inputs and outputs.

The first external input is the step value which is

identical for all processors in the em ulation engine. The rest of the external inputs
are values chosen from the interconnection network to be w ritten to the memory
store. The external outputs of this processor include the C l fields of the control word
which are supplied to the interconnection network to choose inputs for the memory
store. The rest of the external outputs are the bits read from the control store then
are broken up by the release memory word unit.
Preferably the depths of the control store and the memory store are the same and
equal to the m aximum num ber of steps (N ) in a single design clock cycle. This would
ensure an entry to every m emory word read or w ritten in the memory store.
T he operation of the memory processor is as follows. A step value is supplied
to the processor. The step value is used as an address to the control store where a
control word is read. The fields M W A and W /R are sent to the memory store. Using
M W A a memory word is read or w ritten and using the W /R field we determ ine if we
are reading (’O’) or writing (’1’) during this step. In case of a read, a m emory word
is read and supplied to the release memory word unit where it is broken up into Q
bits and ou tp u t to the interconnection network. In case of a write, a m emory word is
formed by concatenating Q input bits from the interconnection network. This is the

40

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYSTEM ARCHITECTURE AN D O PERATION

Design
Clock

Emulation
S teps
Emulation
Clock
Read word
from the
Control Store
Read m em ory
w ord from
M em ory Store
in case o f a
read operation

W rite m em ory
word to Mem ory
Store in case o f a
w rite operation

Figure 3.9: O peration of the Memory Processor

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout p erm ission .

3. SYST E M ARCH ITECTURE AN D O PERATION

task of the capture memory word unit which then supplies it to the m emory store to
be w ritten.
In other words, the m emory processor performs either one of two operations during
a single em ulation step: it can read a memory word from a certain address in the
m emory store and then supply it to the interconnection network as a set of single bits
or it can w rite a memory word supplied by the interconnection network a t a certain
address in the memory store.
The above description of the operation indicates th a t all the stages of the operation
described above have to occur during the same emulation step but not simultaneously.
A certain delay should be allowed between reading a control word from the control
store and reading a memory word from the memory store. Also another delay should
be allowed between reading a m emory word and writing a memory word to give tim e
for the bits to be read before they are w ritten. To accommodate th a t, b o th edges of
the clock were used similar to the logic processor. As shown in Figure 3.9. at the first
falling edge of the clock a control word is read using step address n. In case of a read,
at th e rising edge of the clock a memory word is read from the memory store after
enough tim e is given for the address to be derived from the control word. In case of
a write, a t the second falling edge of the clock several bits from the interconnection
netw ork are collected and w ritten to the memory store to ensure th a t sufficient tim e
was given for these bits to be read. Also a t the second falling edge of the clock a new
control word is read using step address n + 1.

3.5

E m ulation M od ule

The first level of hierarchy in our system is the emulation module, shown in Figure
3.10. It consists of R logic processors and S memory processors. Each processor in
one em ulation m odule is connected to every other processor in the same m odule to
ensure th a t the ou tp u t of any processor is readily available as an input to every other

42

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYST E M ARCH ITECTURE AN D O PERATION

LP

LP
External
Inputs

Module
Level
Routing
Switch

MP

External
O utputs

MP

Figure 3.10: Em ulation Module
processor. Moreover, each processor has an external input which could be connected
to other processors in other modules.
In addition to the processors, the em ulation module contains two other compo
nents: the interconnection network and the sequential filler. The interconnection
network is m ade up of the set of wires connecting all the processors together and the
m odule level routing switch.

3 .5 .1

M o d u le L evel R o u tin g S w itch

Connecting all the processors together in an em ulation module is an interconnection
netw ork controlled by the module level routing switch. This switch is basically made
up of ( R + Q x S ) ( R + Q x S ) - to-1 multiplexers; one multiplexer for each logic processor
and one for every input of each memory processor. The purpose of this switch is to
make the output of each processor readily available to every other processor to use.
Moreover, the switch can supply the processors inside the module w ith external inputs

43

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYSTEM ARCHITECTURE AN D O PERATION

th a t are derived from other modules.
The switch uses the Choose In p u t held supplied by each processor as the selection
bits of the multiplexers to route the signals between processors.

3 .5 .2

S eq u en tia l F iller

The lim ited num ber of pins on an F P G A makes it impossible for the user to fill up
the control and memory stores of all processors a t the same time. For this reason,
a sequential filler is created to fill up the stores in a sequential m anner. To choose
which processor to fill up, the sequential filler has two input signals: one to choose
which logic processor and the other to choose which memory processor. T he usage of
th e filler should not affect perform ance in any way since it is only used once prior to
em ulation and a t high speed.

3 .5 .3

A r c h ite c tu r e an d O p era tio n

T he basic architecture of the em ulation m odule is shown in Figure 3.10. Each pro
cessor has one external input and one external output. The external input could be
chosen from among all the ou tp u ts of all the processors in the same m odule or from a
different source outside the module. The choice of this source will be described later.
Each processor also supplies the interconnection network w ith an output. The wires
used to fill up the control and memory store along with the sequential filler are not
shown in the Figure 3.10.
All the processors in one em ulation module receive an identical step value during
an em ulation step.

The processors use this value to determ ine the operation as

described earlier.

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYST E M ARCH ITECTURE AN D O PERATION

Module
External
Inputs

Chip
Level
Routing
Switch

External
O utputs

Module

Figure 3.11: Em ulation Chip

3.6

E m ulation C hip

T he second level of hierarchy in this system is the emulation chip. It consists of
T identical em ulation modules and fits on one FPG A . Specific processors inside the
em ulation module are connected to specific processors in other em ulation modules
to ensure th a t their outputs are readily available as inputs. Moreover, an em ulation
chip has several external inputs and external outputs whose num bers are to be chosen
depending on the availability of pins on the FPG A . In addition to the modules, the
em ulation chip has one other com ponent, the chip level routing switch.

3.6 .1

C h ip L evel R o u tin g S w itch

The chip level routing switch connects all the module pins, external inputs and exter
nal outp u ts together. This switch is m ade up of several multiplexers; one m ultiplexer
for each input of every m odule and one for each external output. The num ber of
these m ultiplexers depends on the num ber of modules inside the em ulation chip. As
for their selection capacity, it depends on the resources available on the FPG A to
store their selection bits. More about these multiplexers will be described in the next
chapter.

45

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYST E M ARCHITECTURE AN D OPERATION

T he switch allows inputs of processors inside one module to choose among certain
outp u ts of other modules. The switch also allows external outputs to choose among
the ou tp u ts of certain processors or external inputs.

3 .6 .2

A r c h ite ctu re an d O p era tio n

T he basic architecture of the em ulation chip is shown in Figure 3.11. Each module
would be able to choose among several outputs from different modules or external
inputs. Similar to the module, the step value is identical for all processors in the
em ulation chip.

3.7

E m ulation E ngine

An emulation engine is the th ird and last level of hierarchy. The em ulation engine
contains a num ber of em ulation chips connected together in a m ulti-FPG A system.

3 .7 .1

M u lti-F P G A S y ste m

Several F PG A connection schemes are available today. The system shown in Figure
3.12 is an example of an 8-way mesh m ulti-FPG A system architecture while the
one shown in Figure 3.13 is an example of a fully connected m ulti-FPG A system
architecture.
Each FPG A contains an em ulation chip. The chip would have a certain num ber
of inputs and outputs through which it would communicate w ith other chips. Each
processor inside the chip can choose among several of the external inputs and each of
th e external outputs can choose among several outputs of the processors. This gives
each processor in the chip the capability to communicate w ith other processors in
other chips implem ented on other FPG As.

46

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYSTEM ARCH ITECTURE AN D O PERATIO N

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

Figure 3.12: 8-Way Mesh MFS

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

F ig u re 3.13: F u lly C o n n e c te d M FS

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYST E M ARCHITECTURE AN D OPERATION

Emulation
Clock

Figure 3.14: Clock D uty Cycle
In addition to th a t, some of these external outputs can choose among several of
the external inputs enabling the F PG A to act as a routing switch. This would become
useful in the case where two em ulation chips are implemented on two FPG A s th a t do
not share a direct connection. Here, interm ediate FPG A s would serve as routers of
th e signal from its source and until it reaches its destination. Note th a t in th e case
where full connectivity is ensured for the m ulti-FPG A system these outp u ts are not
necessarily useful.
We should note th a t the selection bits for the signal routings occurs a t the rising
edge of the clock in order to precede the writing operations th a t occur a t the falling
edge of the clock in all logic and m emory processors.

3 .7 .2

S ca la b ility Issu es

The basic challenge th a t we have to solve in the m ulti-FPG A system is the one th a t
deals w ith speed. The difference of tim e between the read and write operations is the
crucial factor when dealing w ith the speed. We have to make sure th a t the difference
between the rising and falling edges of the clock is long enough for the signal to
traverse throughout the system.

As m entioned above the first falling edge of the

clock is when we read the control word, the rising edge is when we read from the
stacks or the memory store and the second falling edge is when we write to the stacks
or th e memory stores. The challenge is to connect the FPG A s in such a way th a t
if a certain processor in one F PG A needs to w rite a signal (or output) from another
processor in a second F PG A the tim e between the rising edge and the second falling
edge is long enough for the signal to traverse from the first FPG A to the second.

48

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

3. SYST E M ARCHITECTURE AN D O PERATION

Assuming a non 50% duty cycle, the falling tim e of the clock is x and the rising
tim e of the clock is y, as shown in Figure 3.14. This means th a t the critical tim e
is y not x. T he falling edge of the clock x deals with intra-FP G A connections; the
rising edge of the clock y m ay have to deal w ith inter-FPG A connections. More on
the scalability issues will be discussed in the next chapter.

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

C hapter 4

Architecture Exploration and
Im plem entation Results

This chapter discusses the architecture exploration carried out and the implemen
tatio n results for the proposed logic em ulation system. The first section describes
the im plem entation target used in this research. Section 2 presents the architecture
exploration and the effect of changing key param eters on the area and performance
of the em ulator. Section 3 presents the im plem entation results.

4.1

Im plem en tation Target

T he F PG A used for im plem entation in this research is the A ltera S tratix EP1S40F780C5
FPG A . We now present a detailed description of this FPG A.

50

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

4 .1 .1

A lte r a S tr a tix F P G A

T he A ltera S tratix F PG A Family [3] contains th e following resources:
1. Logic Array Blocks (LABs)
2. M512 Blocks
3. M4K Blocks
4. M-RAM Blocks
5. DSP Blocks
6. I/O Elements
Each LAB block contains ten logic elements similar to the ones discussed in chapter
2 and shown in Figure 2.3. These logic elements are capable of em ulating virtually
any logic function. M512 is a m emory block which contains 512 program m able bits
plus parity bits. It can be configured as single-port or simple dual-port mode. The
M4K is another memory block which contains 4,096 program mable bits plus parity
bits and can be configured as single-port, simple dual-port or true dual-port mode.
T he third, and largest, memory block is the M-RAM which contains 512 kilobits of
program m able memory plus parity bits. This memory block can be configured as
single-port, simple dual-port or tru e dual-port mode. The DSP blocks of the Stratix
FP G A are used to implem ent several forms of multipliers while the I/O elements are
connected to the FPG A pins and support different I/O standards.
T he F PG A used in this research, the A ltera Stratix EP1S40F780C5 FPG A , con
tains 4,125 LABs or 41,250 LEs. It also contains 384 M512, 183 M4K and 4 M-RAM
blocks m aking the to ta l num ber of m emory bits 3,423,744. In addition to th a t, it
contains 14 DSP blocks and 616 I/O pins [3].

51

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

4.2

A rch itectu re E xploration

In chapter 3 we described the architecture and operation of the em ulation system
w ithout specifying certain values for im portant param eters such as the lookup table
size or the num ber of em ulation steps. In this section we describe architecture exper
im ents th a t were performed to determ ine th e effects of varying different architectural
param eters on the area and delay of the proposed emulator.

4 .2 .1

K e y P a ra m eters

T he key param eters th a t were presented in chapter 3 and explored in this design are:
1. M : lookup table size.
2. N: num ber of em ulation steps.
3. P: to ta l num ber of o u tp u ts of all processors in one em ulation module.
4. Q: m emory word size.
T he m ain goal of the exploration is to choose a value for each of the above pa
ram eters. The best way to accomplish this goal is to vary each of the param eters and
fix th e others while checking for effect on area and performance. To do th a t the logic
and memory processors were b o th implem ented after each change and the results were
recorded. It is im portant to note th a t the effects of the change of the param eters were
only considered for individual processors. The routing between these processors, and
in effect the hierarchy of the em ulator, were not taken into consideration due to the
complexity of the process. Instead the effect of each param eter on single processors
was assumed to be proportional to its effect on the whole system.

52

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IM PLEM ENTATION RESULTS

4 .2 .2

E ffect o f C h a n g in g P a ra m eters

In this section we aim to m onitor th e effect of each of the param eters on the area and
perform ance of the logic and m emory processors. For th a t reason, each param eter
under consideration was changed and its effect observed while the other param eters
were given fixed values. This process was repeated for each param eter on both pro
cessors. In w hat follows we show in graphs the effect of the change of each of the
param eters.

The effect

011

area is m easured by the number of logic elements and

memory bits each processor consumes when implemented on the F P G A while the
perform ance is m easured by em ulation clock speed.
It is im portant to note th a t the results here were obtained after im plem entation
and not from m athem atical equations. More about the im plem entation of each pro
cessor will be discussed in later sections of this chapter.

4.2.2.1

Effect o f C hanging Lookup Table Size

The size of the lookup table of the logic processor determines the logic capacity of
th e processor. In other words, it determ ines how m any logic gates each processor can
emulate. To determ ine how this param eter m ight affect the area and perform ance of
the processor, the size of the lookup table was increased by one starting w ith 2 and
ending w ith 8. To ensure th a t we are reading the effect of the lookup table size only,
th e other param eters were never changed. The number of em ulation steps and the
to ta l num ber of outputs were fixed a t 128 and 64 respectively. Note th a t varying the
lookup table size has

110

effect on the memory processor but on the logic processor

alone. The results are shown in Figures 4.1, 4.2 and 4.3.
It is clear from Figure 4.1 and Figure 4.2 th a t as the size of the lookup table
increased the area consumed by th e logic processor increased exponentially.

The

reason behind th a t could be m ainly a ttrib u ted to the effect of the lookup table size
on the control store and the logic elements. The size of the control word of the control

53

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCHITECTURE E X PLO R ATIO N AND IM PLEM ENTATION RESULTS

250

200
193

150
(A
UJ

_l

107
100

0

3

2

1

4

5

7

6

9

8

M: LUT S ize

Figure 4.1: LUT Size vs. Area in LP

50,000
45,000

43,776

40,000
35,000
w 30,000

s

26,112

I* 25,000
E

4>

s

20,000
16,640
15,000
11,264

10,000
7,936
5,000
3,840

0

1

2

3

4

5

6

7

8

9

M: LUT S ize

Figure 4.2: LUT Size vs. Memory Bits in LP

54

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. AR CH ITE CTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS
80
70.72

70

62.93
60

57.36

54.61
51.35

50
47.12
N

s
30

20

10

0

0

1

2

3

4

5

6

7

8

9

M: LUT S ize

Figure 4.3: LUT Size vs. Speed in LP
store is exponentially proportional to the lookup table size, the size of a single control
word in bits is Log2(P) + M + 2 M + M x Log 2 ( N), and thus increasing the lookup table
size will result in an exponential increase in the control store size. The exponential
increase in the num ber of logic elements could be explained in a similar way. It is
due to the fact th a t the lookup table is implem ented in the F P G A ’s logic elements
and its size increase m eant an increase in the num ber of logic elements consumed.
As for the effect of the lookup table size on the speed of the processor, it can
be seen in Figure 4.3 th a t as the lookup table size increased the perform ance of the
processor decreased gradually. This is predictable since the tim e for processing a
certain num ber of inputs inside a processor is likely to increase as the num ber of
these inputs increase.

55

R ep ro d u ced with p erm ission o f the copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

4.2.2.2

Effect o f C hanging N um ber o f E m ulation Steps

The second key param eter to be tested is the num ber of emulation steps. The num ber
of em ulation steps, N , is a critical param eter in both the logic and the memory
processor. Here we study its effect in b oth processors.
• Effect on the Logic Processor: The num ber of emulation steps was varied from
64 to 512 in steps of power of 2. T he size of the lookup table, M , was fixed at
4 while the num ber of to tal outputs, P , was fixed a t 64. The results are shown
in Figures 4.4, 4.5 and 4.6.
As shown in Figure 4.4 the change in the number of em ulation steps barely
had any effect on the num ber of logic elements used to implement the logic
processor. In contrast, as shown in Figure 4.5, the change in the num ber of
em ulation steps had a linear effect on the number of memory bits. This could
be explained by the fact th a t the num ber of emulation steps does not affect the
com binational p a rt of the processor but rather the size of the memory blocks,
control store and d a ta stacks.
As for the speed, it is clear from Figure 4.6 th a t changing the num ber of emu
lation steps had little effect on the speed of the processor.

• Effect on the M em ory Processor: The num ber of emulation steps also affects
the im plem entation of the m emory processor. Here, the num ber of steps was
also varied from 64 to 512 in steps of power of 2. The size of the memory word,
Q , was fixed a t 8 and the to ta l num ber of outputs, P . was fixed at 64. The;
results of the im plem entation are shown in Figures 4.7, 4.8 and 4.9.
Similar to the logic processor, the effect of the number of steps was only lim ited
to the num ber of m emory bits as shown in Figures 4.7, 4.8 and 4.9.

This

is expected because the num ber of em ulation steps determines the size of the

56

R ep ro d u ced with p erm ission o f the copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. AR CH ITE CTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

___ ___________________♦ 31
29

28

0

30

64

128

1 92

256

320

384

448

512

576

N: E m u latio n S te p s

Figure 4.4: Num ber of Em ulation Steps vs. Area in LP

40,000
35,840

35,000

30,000

25,000
(A
CD

I* 20,000
E
S

«

16,896

15,000

10,000
7,936
5,000
3,712

0

64

128

192

256

320

384

448

512

576

N: Em ulation S te p s

Figure 4.5: Num ber of Em ulation Steps vs. Memory Bits in LP

57

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS
80

3 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------

20

-

10

0 J
0

-

-

-

-

--------------------------

•

1----------------- 1------------------,------------------1------------------1----------------- .----------------- ,------------------1----------------64
128
192
256
320
384
448
512
576
N: Em ulation S te p s

Figure 4.6: Num ber of Em ulation Steps vs. Speed in LP
control store and the memory store and does not affect the com binational p art
of the processor.

4.2 .2 .3

Effect o f C hanging T otal N um ber o f O utputs

The th ird param eter to be checked for its effect is the to ta l num ber of outputs, P,
which will help determ ine the num bers of logic and memory processors packed in one
em ulation module.
• Effect on the Logic Processor: P was varied from 32 to 256 in steps of power of
2 while the lookup table size, M , was fixed at 4 and the num ber of emulation
steps, N , fixed at 128. The results are shown in Figures 4.10, 4.11 and 4.12.
As can be observed in Figures 4.10, 4.11 and 4.12 the param eter had virtually no
effect on the area and perform ance of the processor. This can be explained by
the fact th a t the only effect this param eter has is

011

the size of the C h o o seln p u t

58

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

___________________ — ♦ 76
74
70

0

64

128

192

256

320

384

448

512

576

N: E m u latio n S te p s

Figure 4.7: Num ber of Em ulation Steps vs. Area in M P

40,000

35,000

33,792

30,000

25,000

(A
CO

I* 20,000
E
OJ
S

'16,640
15,000

10,000
5,192

5,000

0

64

128

192

25 6

320

384

448

51 2

576

N: E m u latio n S te p s

Figure 4.8: Num ber of Em ulation Steps vs. Memory Bits in M P

59

R ep ro d u ced with p erm ission o f the copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS
100
90
80
70
60
N

|

50
40
30
20

10

0
0

64

128

192

256

320

384

448

512

576

N: E m u latio n S te p s

Figure 4.9: Num ber of Em ulation Steps vs. Speed in M P
field of the control word. Changing the number of outputs only increases this
field by one bit a t a time.

• Effect on the M em ory Processor: P was varied from 32 to 256 in steps of power
of 2 while the size of the m emory word, Q, was fixed at 8 and the num ber of
em ulation steps, N , fixed at 128. The results are shown in Figures 4.13, 4.14
and 4.15.
As shown in Figure 4.13 and Figure 4.14, the area consumed by the processor
increased as the num ber of outputs increased. This can be explained by the fact
th a t the C l field in the control word for each input bit increases as P increases.
As for the speed, it is shown in Figure 4.15 th a t changing the num ber of outputs
had no m ajor effect on speed.

60

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

- ♦ 31

32

64

128

160

192

224

256

288

P: No. o f O u tp u ts

Figure 4.10: Num ber of Total O utputs vs. Area in LP

>,000

>,000

,064
7,808

7,000

,000

g 5,000

§ 4,000 -

3,000

2,000

1,000

0

32

64

96

128

160

192

224

256

288

P: No. o f O u tp u ts

Figure 4.11: Num ber of T otal O utputs vs. Memory Bits in LP

61

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE E X PLO R ATIO N AND IM PLEM ENTATION RESULTS

80

3 0 --------

- ........ ........

2 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

10------------------------------------------------------------------------------------------------0 -I---------- ,---------- ,---------- ,---------- ,---------- ,---------- ,---------- ,---------- ,---------0

32

64

96

128

160

192

224

256

288

P: No. o f O u tp u ts

Figure 4.12: Num ber of Total O utputs vs. Speed in LP

100

80 - 70 - 60

0

32

64

96

128

160

192

224

256

288

P: No. o f O u tp u ts

Figure 4.13: Num ber of Total O utputs vs. Area in M P

62

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

12,000

10,240

10,000
,216
i.O O O

S
£o
E
o>
S

7,168
,000

•-

4,000

2,000

0

32

64

96

128

160

192

224

256

288

P: No. o f O u tp u ts

Figure 4.14: Num ber of Total O utputs vs. Memory Bits in M P

100
90

£ 76 "

80
70
60
50
40
30
20
10

32

64

128

160

192

224

256

288

P: No. o f O u tp u ts

Figure 4.15: Num ber of Total O utputs vs. Speed in M P

63

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IM PLEM ENTATION RESULTS
140
128
120

100

-23

0

2

4

6

8

10

12

14

16

18

Q: M em ory W ord Size

Figure 4.16: Memory W ord Size vs. Area in MP

4 .2 .2 .4

Effect o f C hanging M em ory W ord Size

The last key param eter to be checked for its effect is the memory word size, Q, which
only affects the memory processor. Here the num ber of emulation steps, N , was fixed
at 128 and the to ta l num ber of outputs, P , was fixed at 64. The size of the memory
word was varied from 1 to 16 in steps of power of 2. The results are shown in Figures
4.16, 4.17 and 4.18.
As can be seen in Figure 4.16 and Figure 4.17 increasing the size of the memory
word had a linear effect on th e im plem entation of the memory processor. T his is
expected since as the size of the memory word increases the com binational logic
required for the capture and release m emory word units increases. In addition, the
size of the memory store where the m emory word is stored increases as the size of the
word increases.
As shown in Figure 4.18, th e effect of th e memory word size on the speed of the
processor was limited; however, it was observed th a t there was a small decrease in

64

R ep ro d u ced with p erm ission o f th e copyright ow ner. Further reproduction prohibited w ithout perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

18,000 -r

16,000
15,360
14,000

12,000

g

10,000

5,000

>,000

'4,608
4,000
2,816
2,000
1,920

0

2

4

8

6

10

12

14

16

18

16

18

Q: M em ory W ord S ize

Figure 4.17: Memory W ord Size vs. Memory Bits in MP

100

89.67

1.84

i.42

87.41

70 -

N

X

s

0

2

4

6

8

10

12

14

Q: M em ory W ord Size

Figure 4.18: Memory W ord Size vs. Speed in MP

65

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

4. ARCH ITECTU RE E X PLO R ATIO N AND IM PLEM ENTATION RESULTS

processor speed when the memory word size was increased from 8 to 16. This decrease
in speed could be a ttrib u ted to the fact th a t the processing tim e for the m emory word
will take longer as its size increases.

4 .2 .3

C h oice o f P a ra m eters

The choice of the param eters used in our implem entation is based on the results
obtained above. The following values were chosen:
• M = 4. The lookup table was chosen to be 4 because it provides good emulation
speed and a low area cost.
• N = 128. The num ber of em ulation steps was chosen to be 128. T he reason
behind this decision was the fact th a t FPG A resources are limited. M4K and
M512 blocks both are of lim ited size and 128 words stored in each of them seems
a reasonable size.
• Q = 8.

The size of the m emory word was chosen to be 8 m ainly because

of the em ulation speed. As noted earlier the emulation speed was relatively
stable until the memory word size was increased from 8 to 16 where the speed
com paratively fell more.
• P = 64. The to ta l num ber of outp u ts was chosen to be 64 which in effect m eant
32 logic processors and 4 m emory processors were packaged together in one
m odule. The exploration did not show th a t any specific value of this param eter
had a m ajor effect on the area and performance of any of the processors.

4.3

Im p lem en tation R esu lts

In this section we discuss the im plem entation results of the system. Note th a t the
values for the param eters used here were the ones chosen above. The design tool th at

66

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

4. ARCH ITECTU RE E X PLO R ATIO N AND IMPLEM ENTATION RESULTS

was used was Q uartus II which is supplied by A ltera [2], The hardw are description
language th a t was used was VHDL [28].

4 .3 .1

L ogic P ro c esso r

T he elements of the logic processor were implem ented as follows:
• Control Store: using M = 4, N = 128 and P = 64 the size of the control word
would be 54 bits. This m eans th a t the size of the control store is 128 x 54.
The control store has no com binational logic and only needs to be implem ented
as a memory block. The m emory block chosen to implement the store was
the M4K. To save on m em ory bits the control stores of two processors were
combined together and implem ented in 3 M4K blocks (each M4K is of size
128 x 36). A decoder was created to later separate the two words from each
other.
• Data Stacks: since the num ber of em ulation steps is 128 then the size of each
stack is 128 x 1. Similar to the control store, both the internal and external
stacks need no com binational logic and are implemented as memory blocks
inside the FPG A . The memory blocks chosen to implement the stacks were the
M512 blocks. Since each M512 block can supply at most two outputs a t a time,
each M512 was duplicated to ensure th a t 4 outputs can be supplied a t the same
time.
• Logic Element: the logic element is m ade up of purely com binational logic and
requires no memory blocks. The 2-to-l multiplexers and the lookup table, which
is in tu rn a 4-to-l m ultiplexer, were implem ented in standard VHDL code used
typically to describe multiplexers. The memory array of the lookup table was
also implem ented in logic elements and no memory blocks were used.

67

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

T he im plem entation of each logic processor requires 1.5 M4K blocks, 4 M512
blocks and 29 FPG A logic elements.

4 .3 .2

M em o ry P r o c e sso r

T he elements of the memory processor were implemented as follows:
• Control Store-, using Q = 8, N = 128 and P = 64 the size of the control word
would be 56 bits. This makes the control store of size 128 x 56. The control
store needs no com binational logic and is implemented in 2 M4K blocks.
• M em ory Store: since the size of the memory word is 8 and the num ber of
em ulation steps is 128 then the size of the memory store is 128 x 8. Similar
to the control store, the m emory store needs no combinational logic and is
implem ented in one M4K block.
• Capture and Release M em ory Word Units: the two units only require combi
national logic. Their functional behavior was described using standard VHDL
statem ents used in typical concatenation and breakup instructions.
The im plem entation of each m emory processor requires 3 M4K blocks and 72
F P G A logic elements.

4 .3 .3

E m u la tio n M o d u le

Each em ulation module contains 32 logic processors and 4 memory processors together
having a to ta l num ber of 64 outputs and sharing one interconnection network. Aside
from these processors the m odule contains two other elements: the sequential filler
and the m odule level routing switch. Both of these elements were implem ented in
VHDL and require no memory blocks.

68

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

4. AR CH ITE CTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS

Since the routing switch is m ade up of multiplexers, the VHDL code used to de
scribe its functional behavior is th a t which is typically used to describe the function
ality of multiplexers. As for the sequential filler its functional behavior was described
in a series of conditional statem ents which determ ine which processor is being filled
up before em ulation starts.

4 .3 .4

E m u la tio n C h ip

Because of the lim ited resources of the F PG A each emulation chip in our design
contains three modules. It requires all 384 M512 blocks, 180 M4K blocks (98% of all
M4K blocks) and all 4 M-RAM blocks. T he M-RAM blocks were used to store the
selection bits for the m ultiplexers of the chip level routing switch.
T he chip level routing switch, shown in Figure 4.19, connects all the m odule pins,
external inputs and external ou tp u ts together. This switch is m ade up of 256 4-to-l
m ultiplexers and 64 2-to-l multiplexers; one m ultiplexer for each input of the three
m odules and one for each external output. The switch allows inputs of processors
inside one m odule to choose among certain outputs of other modules. For example,
the input of processor 0 of m odule 0 can choose between: o utput of processor 0 in
m odule 1, ou tp u t of processor 0 in m odule 2, external input 0 or external input 1.
T he switch also allows external outp u ts to choose among outputs of certain processors
or external inputs. For example, external output 0 can choose between: output of
processor 0 in module 0, ou tp u t of processor 0 in module 1, output of processor 0 in
module 2, or external input 0.
The em ulation chip consumes 10,579 logic elements and 933,888 m emory bits.
This puts the FPG A logic utilization a t 25% and memory utilization a t 27%. The
key lim itation in resources was due to the memory blocks, m ainly the M512 and
M 4K blocks which were alm ost fully utilized by our design. T he reason th a t the total
m emory utilization shows only 27% is due to the fact th a t the m ajority of the memory

69

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

4. ARCH ITECTU RE E X PLO R ATIO N AND IM PLEM ENTATION RESULTS

O u tp u t 0 o f Module 1 ■
O u tp u t 0 o f Module 2
External Input 0

M
U
X

Input 0 o f M odule 0

M
U
X

Input 1 o f M odule 0

M
U
X

External O u tp u t 63

M
U
X

External O u tp ut 64

M
U
X

External O u tp u t 65

M
U
X

External O u tp ut 127

External Input 1

O u tp ut 1 o f Module 1 ■
O u tp ut 1 o f Module 2 ■
External In pu t 2
External Input 3

O u tp ut 63 o f Module 1
O u tp ut 63 o f Module 2
O u tp ut 63 o f M odule 3
External Input 127

External Input 0
External In pu t 1

External Input 2
External Input 3

External Input 126
External Input 127

Figure 4.19: M odule Level Routing Switch

70

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IM PLEM ENTATION RESULTS

bits available are stored in the M-RAM blocks which were only partially utilized. In
addition to the logic elements and memory blocks, each em ulation chip requires 531
pins, m aking the pin utilization 86%.
T he whole design, one em ulation chip, was made up of almost 6,100 VHDL lines
describing the functional behavior of the combinational logic. The memory blocks
were designed and implem ented by m eans of megafunctions, a design tool supplied
by Q uartus II to save the tim e required to w rite the code in VHDL. T he design
contains 7 m egafunctions [24].
Each em ulation chip is capable of em ulating w hat is equivalent to 98,304 ASIC
gates per design cycle. This was calculated by assuming th a t each logic processor
w ith a 4-input lookup table can implem ent 8 ASIC gates per em ulation step and
1,024 ASIC gates per design cycle [10]. The em ulation chip can also em ulate 12,288
m emory bits, which is the sum of all the bits stored in all the memory stores of the
memory processors.
Lastly the em ulation clock frequency of a single emulation chip is 24.04 MHz. The
em ulated design can run a t 187.8 KHz or more depending on the number of em ulation
steps used in the design cycle.

4.4

Im plem en tation E stim ates for E m ulation En
gine

T he im plem entation of this design only involved the second level of hierarchy, the
em ulation chip. The highest level of hierarchy, the emulation engine, was not imple
m ented. In this section we give some estim ates of the im plem entation of this engine.
A typical em ulation engine would be m ade up of a fully connected m ulti-FPG A
system, as the one shown in Figure 3.13. Here six FPG A s are connected together to
act as an em ulation engine. T he logic capacity of this engine is equivalent to 589,824

71

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

4. ARCH ITE CTU RE E X PLO R ATIO N AND IMPLEM ENTATION RESULTS

ASIC gates and 73,728 memory bits.
As m entioned in chapter 3 the m ain issue we would need to deal w ith in such a
system is the speed. The high pulse of the clock, symbolized by y in Figure 3.14, need
to be long enough for the signal to traverse the longest path delay of the m ulti-FPG A
system. On one Printed Circuit Board (PCB) we can assume th a t this delay is on
average the same for all the connections.
To calculate this delay we assume th a t the dielectric used for the PCB is FR-4,
th e m ost widely used dielectric for PC B s [5]. This means th a t the propagation speed
on the PC B is 1.48 X 108 m / s [1], Therefore, the tim e needed by a signal to traverse
half a m eter, a typical size of a PCB, is approxim ately 3.4 ns. Following this logic
y should be a t least 3.4 ns to ensure th a t the signal has enough tim e to reach its
destination.
If we choose to have a 50% duty cycle then the period of one clock cycle should
be around 7 ns for the signal to traverse the longest p ath delay. However, it was
m entioned before th a t the em ulation clock frequency is 24.04 MHz and its period is
41.6 ns. It is clear th a t em ulation clock period is much longer th an the longest path
delay and therefore the PCB connections would not add any extra delay and should
not decrease the speed of the clock if the board remained of reasonable size.

4.5

E m ulation E xam ple

To illustrate and verify the operation of the em ulator we chose to em ulate a four bit
m ultiplier on a single em ulation chip.

4 .5 .1

F o u r - B it M u lt i p lie r

T he m ultiplier has two inputs each of size 4 bits and has one ou tp u t of size 8 bits.
Figure 4.20 shows the m ultiplication process. The goal is to give each operation of this

72

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

4. AR CH ITE CTU RE EX PLO R ATIO N AND IMPLEM ENTATION RESULTS
A3
B3
C3c“
D3

G4
H3GH2

K4

A2
B2

A1
B1

C2cd1

c i CD0

D2

D1

E2EF0
FI

AO
BO
CO
DO

M ultiplicand
M ultiplier
Initial Partial Product
M ultiply A by BO

El
FO

E0

Add
M ultiply A by B1

E4
F3EF2

E3ef1

G3gh1

G2gh°

E0

HI

G1
HO

GO

H2

Add
M ultiply A by B2

11

10

GO

E0

Add
M ultiply A by B3

10

GO

E0

Add

I2|jo

F2

14

I3U1

J3'12

J2

J1

JO

K3

K2

K1

KO

Figure 4.20: O peration of the Four-Bit M ultiplier
m ultiplication to one logic processor in a process known as scheduling. The symbols
shown as superscripts are the overflow from the previous operations.

4 .5 .2

S ch ed u lin g and Im p le m en ta tio n

Tables 4.1 and 4.2 shows the scheduling of the eight processors used for em ulating the
four bit multiplier. Normally the scheduling process would be autom ated b u t since
our design lacks the CAD tools associated w ith it, the scheduling was done manually.
It is im portant to note th a t this schedule might not be the most efficient one since
th e aim here is only to verify the functionality of the emulator. Cap and Cal in the
tables stand for capture value and calculate value respectively.

Step

LPO

0

Cap(AO)

C ap(A l)

1

Cap(BO)

Cap(BO)

2

Cal(DO)

C al(D l)

3

Cal(EO),

C al(E l),

C ap(B l)

C ap(B l)

4

Cal(FO),

LP1

Cap(F3)

LP2

C al(F l),

LP3

C ap(E l)

Cap(FO)

73

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

4. ARCH ITECTU RE E X PLO R ATIO N AND IMPLEM ENTATION RESULTS

5

Cap(F3)

Cap(FO)

Cap(B2)

Cal(GO),

Cal(EFO)

Cap(B2)
6

7

Cap(EF2)

C ap(EF2)

8

Cal(G3)

Cal(G4),

Cap(G4)

Cap(G4)

C al(H l),

Cap(H3)

Cap(G3)
9

Cal(HO),

Cap(H2)

Cap(H2)

Cap(H3)

10

11

C ap(G H l)

C ap(G H l)

12

Cal(I2),

Cal(GH2),

Cap(B3)

Cap(I2)

13

14

Cal(JO),

C a p (Jl)

C a p (Jl)

Cap(GH2)

Cal(I3),

Cal(I4),

Cap(B3)

Cap (13)

C a l(Jl),

Cap(J2)

Cap(J2)

15

Cap(IJO)

Cap (I JO)

16

C al(K l)

C a l(IJl)

17

Cap(GH2)

C al(IJl)

C al(IJl)

Cal(K2)

Cal(IJ2)

18
Table 4.1: M ultiplier Scheduling for Processors 0-3

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

4. ARCH ITECTU RE EX PLO R ATIO N AND IM PLEM ENTATION RESULTS

S te p

LP4

0

Cap(A2)

Cap(A3)

1

Cap(BO)

Cap(BO)

2

Cal(D2)

Cal(D3)

3

Cal(E2),

LP5

Cap(E2)

C ap(B l)
4

Cal(F2),

LP6

Cal(E3),

L P7

Cap(E3)

C ap(B l)
C a p (F l)

C ap (F l)

Cal(F3),

Cap(F2)

Cap(F2)

5

Cap(EFO)

Cap(EFO)

6

C al(G l),

C a l(E F l),

Cap(B2)

C ap (G l)

7

C ap (E F l)

C ap (E F l)

Cal(G2),

C al(EF2),

Cap(B2)

Cap(G2)

Cal(H3),

C ap(H l)

8

9

Cal(H2),

Cap(HO)

Cap(HO)
10

Cal(IO),

C ap(H l)
Cap(GHO)

Cap(GHO)

Cap(GHO)

C al(Il),

C al(G H l),

Cap(B3)

C ap(Il)

Cap(B3)
11

12

13

Cap(I4)

Cap (14)

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

4. ARCH ITE CTU RE E X PLO R ATIO N AND IMPLEM ENTATION RESULTS

14

C al(J2),

Cap(J3)

C ap(J3)

Cal(J3),

Cap(JO)

Cap (JO)

15

Cal(KO)

Cal(IJO)

16
17

C ap(IJ2)

C ap(IJ2)

18

Cal(K3)

Cal(K4)

Table 4.2: M ultiplier Scheduling for Processors 4-7

A fter scheduling, the control words for each processor were generated and down
loaded onto the processors. Several values were tested and the m ultiplier gave the
correct results proving the validity of our design.
We should note the em ulation of the m ultiplier was sim ulated and not downloaded
on the FPG A . T he reason behind th a t is th a t we lack the connection circuitry with
the F P G A pins and building such a circuitry would be very tim e consuming. A more
im portant reason is th a t we do not have a D ata C apture Unit to read the outputs of
the processors and therefore we cannot verify the operation.

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C hapter 5

Conclusion and Future Work

The first section of this chapter summ arizes the contributions m ade by this research.
In section 2 we present a brief com parison between our design and previous processorbased em ulator designs. We conclude in section 3 with some rem arks on possible
future work.

5.1

R esearch C ontrib ution s

T he m ain contribution m ade by this research is the design and im plem entation of a
low-cost processor-based logic em ulation system. To reduce cost, the design was im
plem ented using F P G A technology. Before implem entation, architecture exploration
experim ents were conducted in order to choose suitable values for key architecture
param eters. T he proposed em ulator can verify the functionality of logic designs at
relatively high speeds and in real operating environments.
To increase logic capacity a fully connected m ulti-FPG A system can be used.

77

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

5. CONCLUSION AN D FU TU RE W O R K

Each FPG A is program m ed to act as an emulation chip. The full (m ulti-FPG A )
design of the em ulator was not implem ented in this research. Only one em ulation
chip was implem ented using a single FPG A . Each of these em ulation chips is capable
of em ulating around 98 thousand ASIC gates and 12 thousand memory bits. It can
run a t a speed of alm ost 187 KHz per design cycle or more, depending upon the
num ber of instruction cycles needed in one design cycle.
A four-bit m ultiplier was em ulated to verify the correctness of the proposed em
ulator design. Because we lack the CAD tools for the bit stream generation, all the
tasks of the CAD flow were carried out manually. The multiplier was em ulated and
verified the correct operation of the emulator.

5.2

C om parisons w ith O ther System s

To provide a context regarding the contributions m ade by this research, we present a
brief comparison w ith previously proposed processor-based logic em ulation systems.
In [15], a processor-based em ulator was implemented using custom m ade chips.
T he building block of the system is a processor which can emulate both logic and
m em ory functions. The two m ain differences between this design and ours are in ar
chitecture and im plem entation. In term s of architecture, the processor in this system
performs both logic and m em ory emulation. In our design, two different processors
are used one to em ulate logic functions and the other to emulate memory functions.
As for im plem entation, this design was implem ented on custom m ade chips which
makes it very expensive. In contrast, our processor-based em ulator was implem ented
on FPG A s which would effectively make it a much lower cost system. We should
note th a t there are several other sim ilarities and differences in term s of operation and
hierarchy.
In [10], a processor-based em ulator was implemented on FPG A s. The em ulator
contains several kinds of processors. The m ain differences between this design and

78

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

5. CONCLUSION AN D FU TU RE W O R K

ours are in the design architecture and hierarchy. This design contains different kinds
of processors in addition to logic and m emory processors; our design only contains
those two kinds. A nother difference in architecture is in the interconnection network.
T he designer in this case chose to use buffers in the interconnections between the
processors to give more flexibility to the CAD tools in term s of routing. We thought
th a t using such buffers would consume the lim ited resources of the F PG A and decided
to leave a tighter constraint on the CAD tools.
In comparison w ith commercially available emulators like the Incisive Palladium II
[18] our em ulator reached alm ost one eighth Palladium ’s top speed. A lthough speed
m ight be a drawback, our design has a lower cost because of FPG A implem entation.

5.3

Future W ork

In our research we focused on the hardw are architecture of a processor-based logic
em ulation system. The other m ain p a rt is the m apping CAD tools th a t are required
for a real world em ulation system. The next step would be to design and develop
the m apping CAD tools for this system. The m apping CAD tools would compile the
logic design of the D U T and generate the b it stream which could be downloaded to
the program m able hardware.
A nother future work in the hardw are p art of the project might involve designing
a d a ta capture unit which would help the designer in finding errors and autom ate the
checking process.

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

R eferences
[1] A ltera Corporation. High Speed Board Designs, November 2001.
[2] A ltera C orporation. Introduction to Quartus II Version 5.0, April 2005.
[3] A ltera Corporation. Stratix Device Handbook, July 2005.
[4] A ltera Corporation, h ttp ://w w w .a lte ra .c o m /, Accessed August 2007.
[5] A ltera Corporation. Stratix II Device Handbook, May 2007.
[6] A ltera Corporation. Stratix I I I Device Handbook, May 2007.
[7] J. Babb, T. Russell, M. Dahl, S. Z. Hanono, D. M. Hoki, and A. Agrawal. Logic
em ulation w ith virtual wires. IE E E Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 16(6):609-626, June 1997.
[8] K. Banovic, M. A. S. Khalid, and E. Abdel-Raheem. FPG A -based rapid pro
totyping of digital signal processing systems. In Proc. o f the 48th M id-W est
Sym posium on Circuits and System s, pages 647-650, August 2005.
[9] M. B utts. Future directions of dynamically reprogram mable systems. In Proc.
o f IE E E Custom Integrated Circuits Conference, pages 487-494, 1995.
[10] M. R. B utts. Logic m ultiprocessor for FPG A implem entation. U.S. P aten t Ap
plication 2004/0123258 A l, June 2004.
[11] Cadence Design Systems Incorporated, http://w w w .cadence.com /, Accessed Au
gust 2007.
[12] E. M. Clarke and R. P. K urshan. Com puter-aided verification. IE E E Spectrum,
33(6):61-67, June 1996.
[13] K. C om pton and S. Hauck. Reconfigurable computing: A survey of system s and
software. A C M Computing Surveys, 34(2):171-210, June 2002.

80

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

REFERENCES

[14] C. Edwards. Tracking down th e chip killers. IE E Review, 50(12):44-46, December
2004.
[15] Beausoleil et al. M ultiprocessor for hardw are emulation. U.S. P aten t 5551013,
August 1996.
[16] H. Goldstein. Checking the play in plug-and-play. IE E E Spectrum, 39(6) :50—55,
June 2002.
[17] S. Hauck. The roles of F P G A ’s in reprogram m able systems. Proceedings of the
IEEE, 86(4):615—638, April 1998.
[18] Incisive Palladium II. http://w w w .cadence.com /products/functional_ver/
aceLem ul/pseries.aspx, Accessed August 2007.
[19] P. H. Kelly, K. J. Page, and P. M. Chau. Rapid prototyping of ASIC based
system. In Proc. o f the 31st A C M /IE E E Design A utom ation Conference, pages
460-465, June 1994.
[20] C. K ern and M. R. Greenstreet. Formal verification in hardw are design: A survey.
A C M Transactions on Design A utom ation o f Electronic System s, 4(2):123-193,
April 1999.
[21] M. A. S. Khalid and J. Rose. A novel and efficient routing architecture for multiFPG A systems. IE E E Transactions on Very Large Scale Integration (V L SI)
Systems, 8(l):30-39, February 2000.
[22] D. MacMillen, M. B utts, R. Cam posano, D. Hill, and T.W . Williams. An in
dustrial view of electronic design autom ation. IE E E Transactions on Com puter
Aided Design o f Integrated Circuits and Systems, 19(12): 1428—1448, December
2000 .

[23] K. L. McMillan. F ittin g formal m ethods into the design cycle. In Proc. of the
31st A C M /IE E E Design A utom ation Conference, pages 314-319, June 1994.
[24] A ltera Megafunctions, h ttp ://w w w .a lte ra .c o m /p ro d u cts/ip /alte ra /m eg a .h tm l,
Accessed August 2007.
[25] M entor Graphics Corporation, h ttp ://w w w .m entor.com /, Accessed August 2007.
[26] G. E. Moore. Cram m ing more com ponents onto integrated circuits. Proceedings
o f the IEEE, 86(l):82-85, January 1998.
[27] R. M urgai and M. Fujita. Some recent advances in software and hardw are logic
sim ulation. In Proc. o f the 1 Oth IE E E International Conference on V L SI Design,
pages 232-238, January 1997.

81

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

REFERENCES

[28] In stitu te of Electrical and Electronics Engineers. IEEE standard VHDL language
reference m anual, A N SI/IE E E Std 1076-1993, 1993.
[29] C. Pixley, A. C hittor, F. Meyer, S. M cM aster, and D. Benua. Functional ver
ification 2003: Technology, tools and methodology. In Proc. of the 5th IE E E
International Conference on A SIC , pages 1-5, October 2003.
[30] J. Rose, A. El Gam al, and A. Sangiovanni-Vincentelli. A rchitecture of fieldprogram m able gate arrays. Proceedings of the IE E E , 81 (7): 1013—1029, July 1993.
[31] L. Soule and T. Blank. Parallel logic sim ulation on general purpose machines.
In Proc. of the 25th A C M /IE E E Design Autom ation Conference, pages 166-171,
June 1988.
[32] J. Varghese, M. B utts, and J. Batcheller. An efficient logic em ulation system.
IE E E Transactions on Very Large Scale Integration (V L SI) Systems, 1 (2): 171—
174, June 1993.
[33] V StationPR O . h ttp ://w w w .m en to r.co m /products/fv/em ulation/vstation_pro/,
Accessed August 2007.
[34] Xilinx Incorporated, http://w w w .x ilin x .co m /, Accessed August 2007.
[35] Xilinx Incorporated. Vertex-5 Family Overview - LX, L X T , and S X T Platforms,
May 2007.

82

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

VITA A U C T O R I S

M arwan K anaan was born in H aret Hreik, Lebanon, in 1983. He received his B.E.
in com puter and comm unications engineering in 2005 from the Am erican University of
B eirut in Beirut, Lebanon. He is currently a candidate in the electrical and com puter
engineering M.A.Sc. program at the University of W indsor. His research interests
include logic em ulation systems, field-program mable technologies and digital design.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

