A low-cost processor-based logic emulation system using FPGAs by Kanaan, Marwan
University of Windsor 
Scholarship at UWindsor 
Electronic Theses and Dissertations Theses, Dissertations, and Major Papers 
2007 
A low-cost processor-based logic emulation system using FPGAs 
Marwan Kanaan 
University of Windsor 
Follow this and additional works at: https://scholar.uwindsor.ca/etd 
Recommended Citation 
Kanaan, Marwan, "A low-cost processor-based logic emulation system using FPGAs" (2007). Electronic 
Theses and Dissertations. 4615. 
https://scholar.uwindsor.ca/etd/4615 
This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor 
students from 1954 forward. These documents are made available for personal study and research purposes only, 
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution, 
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder 
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would 
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or 
thesis from this database. For additional inquiries, please contact the repository administrator via email 
(scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208. 
A Low-Cost Processor-Based Logic 
Em ulation System  U sing FP G A s
by
M arw an K anaan
A Thesis
Subm itted to  the Faculty of Graduate Studies 
through Electrical and Computer Engineering 
in Partia l Fulfillment of the Requirements for the 
Degree of M aster of Applied Science at the 
University of Windsor
W indsor, Ontario, Canada 
2007







395 Wellington Street 
Ottawa ON K1A 0N4 
Canada
Your file Votre reference 
ISBN: 978-0-494-34930-4 




395, rue Wellington 
Ottawa ON K1A 0N4 
Canada
NOTICE:
The author has granted a non­
exclusive license allowing Library 
and Archives Canada to reproduce, 
publish, archive, preserve, conserve, 
communicate to the public by 
telecommunication or on the Internet, 
loan, distribute and sell theses 
worldwide, for commercial or non­
commercial purposes, in microform, 
paper, electronic and/or any other 
formats.
AVIS:
L'auteur a accorde une licence non exclusive 
permettant a la Bibliotheque et Archives 
Canada de reproduire, publier, archiver, 
sauvegarder, conserver, transmettre au public 
par telecommunication ou par I'lnternet, preter, 
distribuer et vendre des theses partout dans 
le monde, a des fins commerciales ou autres, 
sur support microforme, papier, electronique 
et/ou autres formats.
The author retains copyright 
ownership and moral rights in 
this thesis. Neither the thesis 
nor substantial extracts from it 
may be printed or otherwise 
reproduced without the author's 
permission.
L'auteur conserve la propriete du droit d'auteur 
et des droits moraux qui protege cette these.
Ni la these ni des extraits substantiels de 
celle-ci ne doivent etre imprimes ou autrement 
reproduits sans son autorisation.
In compliance with the Canadian 
Privacy Act some supporting 
forms may have been removed 
from this thesis.
While these forms may be included 
in the document page count, 
their removal does not represent 
any loss of content from the 
thesis.
Conformement a la loi canadienne 
sur la protection de la vie privee, 
quelques formulaires secondaires 
ont ete enleves de cette these.
Bien que ces formulaires 
aient inclus dans la pagination, 
il n'y aura aucun contenu manquant.
i * i
Canada
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
©  2007 Marwan Kanaan
All Rights Reserved. No P art of this document may be reproduced, stored or oth­
erwise retained in a retreival system or transm itted  in any form, on any medium by 
any means without prior w ritten permission of the author.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
Logic emulation systems are used to verify the functionality of logic designs targeted 
for integrated circuit implementation. In this thesis, the design and implementation 
of a low-cost processor-based logic emulation system is presented. It contains multi­
ple processors interconnected together and packaged in one emulation engine. It is 
capable of emulating combinational and sequential logic at relatively high speeds of 
187 KHz or more, in real operating environments and with predictable compile time. 
The implementation was done on an FPG A  to reduce cost. The proposed system is 
scalable to a m ulti-FPGA system where several of these identical FPGAs could be 
connected together to increase the logic capacity of the system.
The architecture and operation of the emulator is first described. Architecture 
exploration experiments were conducted in order to choose suitable values for different 
architecture param eters for implementation on the target FPGA. The design was 
implemented on an Altera S tratix  FPGA. A four-bit multiplier was emulated to verify 
correct operation of the proposed emulation system.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
To my family for their unending love and support.
V
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A cknow ledgments
I thank  God Almighty th a t this thesis has been completed. I stand here humbly at 
the end of this accomplishment confident th a t I would not have been able to do it 
without His support and help. I ask Him, once and again, to continue to  shed light 
on each and every path  I take.
I would like to thank my supervisor, Dr. Mohammed Khalid, for his support, 
guidance and determ ination throughout the course of this work. I am deeply and 
forever grateful for all the invaluable efforts he made. I would also like to thank Dr. 
Abdel-Raheem and Dr. Zhang for sitting on my committee and reviewing my thesis 
and Dr. Kar for sitting in as Chair of Defense.
Thanks to my family for all their love, support and advice. To my mom and 
dad, thanks for all the encouragements, prayers, help and patience. I am what I am 
today largely because of my parents and for th a t I am thankful. To my sister and my 
grandm other, I am thankful for all the encouragements, prayers and care.
Thanks to all my friends and fellow graduate students at the University of W ind­
sor. Jay and Ian, I ’ll never forget all the times we spent together. It was wonderful 
to  have you as officemates. My thanks also go to the current and former members 
of our research group: Amir, Kevin, Raymond, Omar, Junsong, Aws, Hongmei and
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOWLEDGMENTS
Thuan. I would also like to  thank my friends Harb, Ali, Bahador, Payman, Ashkan, 
Mahzad, and Amr for their friendship over the past two years.
Funding and technical support for this research was provided by the Natural 
Sciences and Engineering Research Council (NSERC) of Canada, the University of 
W indsor, the Canadian Microelectronics Corporation (CMC) and Altera Corporation. 
Their contributions are gratefully acknowledged.




A cknowledgm ents vi
List o f Figures xii
List o f Tables xiv
List o f A bbreviations xv
List o f Symbols xvi
1 Introduction 1
1.1 Thesis O b je c tiv e s ............................................................................................... 3
1.2 Thesis O rganization ...........................................................................................  4
2 Background and Previous Work 5
2.1 Design V e rif ic a tio n ...........................................................................................  5
2.1.1 F o rm al V e r i f ic a t io n ...................................................................................... 6
2.1.2 Software Simulation ..........................................................................  7
2.1.3 Hard ware-Accelerated S im u la tio n ..................................................  8
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CONTENTS
2.1.4 Rapid P r o to ty p in g ..............................................................................  8
2.1.5 Logic Emulation .................................................................................  9
2.2 Logic Emulation S y stem s.................................................................................  10
2.2.1 FPGA-Based Em ulation S y s te m s ...................................................  12
2.2.1.1 Field Program mable Gate A r ra y s ..................................  12
2.2.1.2 Architecture and CAD for FBEs ..................................  17
2.2.2 Processor-Based Em ulation System s................................................  19
2.2.2.1 Em ulation Processors ......................................................  19
2.2.2.2 Architecture and CAD for P B E s ................................... 20
2.2.3 Commercially Available Logic E m u la to rs ......................................  23
3 System  A rchitecture and Operation 24
3.1 Introduction and M o tiv a tio n ........................................................................... 24
3.2 Levels of H ie ra rc h y ............................................................................................ 25
3.3 Logic Emulation P ro c e sso r ..............................................................................  27
3.3.1 Control S t o r e ......................................................................................... 29
3.3.2 D ata S ta c k s ............................................................................................  31
3.3.3 Logic E le m e n t......................................................................................... 32
3.3.4 Architecture and O p e r a t io n .............................................................. 33
3.4 Memory Emulation P r o c e s s o r ........................................................................ 36
3.4.1 Control S t o r e ......................................................................................... 37
3.4.2 Memory S to r e ......................................................................................... 39
3.4.3 Release Memory Word U n i t .............................................................. 39
3.4.4 Capture Memory Word U n i t .............................................................. 40
3.4.5 Architecture and O p e r a t io n .............................................................. 40
3.5 Emulation M o d u le ............................................................................................  42
3.5.1 Module Level Routing S w i t c h ........................................................... 43
3.5.2 Sequential F ille r ...................................................................................... 44
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CONTENTS
3.5.3 Architecture and O p e r a t io n ............................................................. 44
3.6 Emulation C h i p .................................................................................................. 45
3.6.1 Chip Level Routing S w itch ................................................................  45
3.6.2 Architecture and O p e r a t io n ............................................................. 46
3.7 Emulation E n g in e ............................................................................................... 46
3.7.1 M ulti-FPGA S y s te m ..........................................................................  46
3.7.2 Scalability I s s u e s .................................................................................  48
4 A rchitecture Exploration and Im plem entation R esults 50
4.1 Implementation T a r g e t ..................................................................................... 50
4.1.1 Altera S tratix  F P G A ..........................................................................  51
4.2 Architecture E x p lo ra tio n .................................................................................  52
4.2.1 Key P a ra m e te rs ..................................................................................... 52
4.2.2 Effect of Changing P a ram e te rs .......................................................... 53
4.2.2.1 Effect of Changing Lookup Table Size .......................  53
4.2.2.2 Effect of Changing Number of Emulation Steps . . .  56
4.2.2.3 Effect of Changing Total Number of O utputs . . . .  58
4.2.2.4 Effect of Changing Memory Word S i z e .............. 64
4.2.3 Choice of P a ra m e te rs ........................................................................... 66
4.3 Implementation Results .................................................................................  66
4.3.1 Logic P ro c e s s o r .....................................................................................  67
4.3.2 Memory Processor ..............................................................................  68
4.3.3 Emulation M o d u le ..............................................................................  68
4.3.4 Emulation C h ip .....................................................................................  69
4.4 Implementation Estim ates for Emulation E n g in e ......................................  71
4.5 Emulation E x am p le ............................................................................................  72
4.5.1 Four-Bit M u ltip lie r ..............................................................................  72
4.5.2 Scheduling and Im p le m e n ta tio n .......................................................  73
X
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CONTENTS
5 Conclusion and Future Work 77
5.1 Research C o n trib u tio n s ................................................................................... 77
5.2 Comparisons with O ther S y s te m s ..............................................................  78
5.3 Future W o rk ....................................................................................................... 79
R eferences 80
V ITA AUCTO R IS 83
xi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
2.1 Logic Emulation S y s t e m .................................................................................  11
2.2 A Generic FPG A  A rch itec tu re .......................................................................  14
2.3 Internal Structure of a Logic E le m e n t........................................................  15
2.4 Internal Structure of a Lookup T a b l e ........................................................  15
2.5 M ulti-FPGA S y s t e m ........................................................................................  17
2.6 CAD Flow for F B E s ........................................................................................  18
2.7 Processor-Based Em ulation S y s te m .............................................................  21
2.8 CAD Flow for P B E s ........................................................................................  22
3.1 Emulation Design C y c l e .................................................................................  26
3.2 System H ie ra rc h y ............................................................................................... 28
3.3 Logic Emulation P ro c e sso r .............................................................................. 29
3.4 Logic Processor Control Word F i e l d s .......................................................... 30
3.5 Logic E le m e n t...................................................................................................... 33
3.6 Operation of the Logic Processor ................................................................. 35
3.7 Memory Emulation P r o c e s s o r ......................................................................  37
3.8 Memory Processor Control Word F ie ld s .....................................................  38
3.9 O p e ra tio n  of th e  M em ory  P r o c e s s o r ....................................................................  41
3.10 Emulation M o d u le ............................................................................................ 43
3.11 Emulation C h i p ..................................................................................................  45
xii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF FIGURES
3.12 8-Way Mesh M F S ..............................................................................................  47
3.13 Fully Connected MFS ....................................................................................  47
3.14 Clock Duty C y c le ..............................................................................................  48
4.1 LUT Size vs. Area in L P ................................................................................. 54
4.2 LUT Size vs. Memory Bits in L P ................................................................ 54
4.3 LUT Size vs. Speed in L P .............................................................................  55
4.4 Number of Emulation Steps vs. Area in LP ........................  57
4.5 Number of Emulation Steps vs. Memory Bits in L P ............ 57
4.6 Number of Emulation Steps vs. Speed in L P ............................  58
4.7 Number of Emulation Steps vs. Area in M P ............................ 59
4.8 Number of Emulation Steps vs. Memory Bits in M P ...........  59
4.9 Number of Emulation Steps vs. Speed in MP ......................... 60
4.10 Number of Total O utputs vs. Area in L P .................................................  61
4.11 Number of Total O utputs vs. Memory Bits in L P ..................................  61
4.12 Number of Total O utputs vs. Speed in L P ...............................................  62
4.13 Number of Total O utputs vs. Area in MP ................................................ 62
4.14 Number of Total O utputs vs. Memory Bits in M P ................................... 63
4.15 Number of Total O utputs vs. Speed in M P ...............................................  63
4.16 Memory Word Size vs. Area in M P ............................................................  64
4.17 Memory Word Size vs. Memory Bits in MP ............................................ 65
4.18 Memory Word Size vs. Speed in MP .........................................................  65
4.19 Module Level Routing S w i t c h ......................................................................  70
4.20 Operation of the Four-Bit Multiplier.............................................................  73
xiii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
3.1 Logic Processor Control Word Fields Description ................................. 30
3.2 Memory Processor Control Word Fields D escrip tio n .............................. 38
4.1 Multiplier Scheduling for Processors 0 -3 .....................................................  74
4.2 Multiplier Scheduling for Processors 4 -7 .....................................................  76
xiv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List o f Abbreviations
Abbreviation Definition
ASIC Application-Specific Integrated Circuit
CAD Computer Aided Design
DUT Design Under Test
EDA Electronic Design Autom ation
FBE FPGA-Based Logic Emulation System
FF Flip-Flop
FPG A Field Programmable Gate Array
IC Integrated Circuit
I /O Inpu t/O u tpu t
LE Logic Element
LP Logic Emulation Processor
LUT Lookup Table
MFS M ulti-FPGA System
MP Memory Emulation Processor
MUX Multiplexer
PBE Processor-Based Logic Emulation System
PCB Printed Circuit Board
VHDL Very High Speed Integrated Circuit Hardware Description Language
VLSI Very Large Scale Integration
XV
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Symbols
Symbol Definition
M  Lookup table size.
N  Total number of emulation steps in one design cycle.
P  Total number of outputs of all processors in one module.
Q Memory word size.
R  The number of logic emulation processors in one emulation module.
S  The number of memory emulation processors in one emulation module.
T  The number of emulation modules in one emulation chip.
xvi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
In this day and age, electronic devices, ranging from cell phones to  personal com­
puters, play an essential role in our daily lives. Designing such devices and verifying 
their functionality could be an excruciating task for engineers if the necessary tools 
are missing. These tools, known as Computer-Aided Design (CAD) tools, have long 
been a vital part of the research in chip design where considerable research efforts 
have been made and are always being carried out to  ensure th a t designers have the 
most reliable and efficient of these tools.
One task of these design tools is design verification, the process where the func­
tionality of an electronic device is validated. In the past three decades verification 
has become one of the most crucial parts of the design cycle. Its im portance is due 
to  the fact th a t it is absolutely necessary for designers to make sure th a t their design 
is correct prior to fabrication. A simple error discovered after production is very ex­
pensive to  fix thus potentially costing the manufacturing company millions of dollars 
in losses [12]. In addition to tha t, the increase in chip size [26] and the need to reduce
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. INTRODUCTION
tim e-to-m arket require more capable and robust design verification tools to  cope with 
the growing industry.
Several design verification tools are available 011 the market today. The most 
effective of these tools are logic emulators. A logic emulator is a design verification 
tool where a reprogrammable system im itates the functional behavior of a logic design. 
The system can be programmed to act exactly as a desired chip and thus gives 
the user the ability to check the logic design in real time circuit environments and 
conditions before m anufacturing [9, 17]. By doing so the designer could verify the 
Design under Test (DUT) by running tests th a t the real chip would have to pass. 
This process could be repeated several times and in this manner most errors could be 
identified and corrected. Logic emulators give the user the ability to catch almost all 
functional errors in a logic design, however we should note th a t timing requirements 
and constraints cannot be verified using this tool.
Currently there are two types of logic emulators, FPGA-Based Logic Emulation 
Systems (FBEs) and Processor-Based Logic Emulation Systems (PBEs). In FBEs, 
several reprogrammable chips known as Field Programmable Gate Arrays (FPGAs) 
are connected together to  emulate the functional behavior of a logic design. While 
FBEs are considered to  be low-cost and efficient emulators they face a m ajor problem 
when it comes to their CAD tools. The second type of logic emulation systems are 
processor-based emulators where multiple emulation processors are packaged together 
in an emulation engine capable of emulating a logic design of significant size and 
complexity. PBE systems are considered to be an efficient verification tool and do 
not suffer from problematic CAD tools, however they are implemented on custom 
made chips and are therefore very expensive.
The motivation behind this thesis is to design a logic emulation system th a t would 
combine the advantages of FBEs and PBEs; a system th a t would be as efficient as 
a PBE and as inexpensive as an FBE. To achieve this goal, one solution would be
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. INTRODUCTION
to implement the processor-based emulation system on an FPGA, a reprogrammable 
chip known for its low cost. In this thesis we explore the architecture of such a system. 
The proposed emulation system would run at relatively high speeds and be capable 
of emulating designs of significant logic capacity and complexity.
1.1 Thesis O bjectives
The main objective of this thesis is to  explore FPG A design and implementation of 
a low-cost processor-based emulation system. To achieve this goal, an architecture of 
a logic emulation system is explored and implemented. This thesis has the following 
objectives:
1. Explore the architecture of a processor-based logic emulation system th a t can 
be implemented on an FPGA.
2. Implement the emulator by targeting a specific FPGA.
3. Ensure th a t the PBE is scalable. Several FPGAs should be able to  connect 
together in a m ulti-FPG A system to increase logic capacity.
4. Verify the system by emulating a logic design.
To satisfy the first objective, an architecture of a processor-based emulator was 
explored in terms of cost, functionality, area and speed upon which key design pa­
ram eters were chosen accordingly. To satisfy the second objective, an Altera Stratix 
FPG A  was targeted. The implementation was tuned specifically for this FPGA. To 
address the third objective a scalability study was done and results are presented. 
For the fourth objective, a four-bit multiplier was designed, scheduled and emulated 
on the designed system to verify its correctness.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. INTRODUCTION
1.2 Thesis O rganization
The rest of this thesis is organized as follows. Chapter 2 discusses the background 
and previous work done on the subject. Chapter 3 presents the architecture and 
operation of the proposed system and its various components. Chapter 4 presents 
the architectural exploration and implementation results. Lastly, chapter 5 concludes 
with some discussion of possible future work.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2 
Background and Previous Work
This chapter presents the background for the research done in this thesis and briefly 
describes related previous work. The first section begins by defining design verification 
and its significance in today’s industry. It then briefly discusses the five m ajor types 
of verification tools available on the market today, along with their main advantages 
and disadvantages. The second section of this chapter focuses on logic emulation 
systems. A brief introduction is given with a discussion of the two main types of 
logic emulators. The chapter concludes by presenting some examples of commercially 
available logic emulation systems.
2.1 D esign Verification
Design verification is the process whereby a logic design is checked for functional 
errors. In this part of the design cycle, the functional behavior of a logic design is 
validated. As chips increased in size and complexity, design verification tools became
5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
more complicated and required more efforts. Today the design verification process 
alone may consume up to 60% of the whole design cycle in terms of time, resources 
and manpower making it the bottleneck for design development [14, 29, 16].
Over the past few decades design verification has evolved from simple m athem at­
ical techniques tha t were carried out by manual calculations to test the validity of 
small designs to multi-million dollar machines capable of verifying a design consisting 
of millions of gates.
There are several types of design verification tools available on the m arket today, 
each with advantages and disadvantages, and in general they can be categorized into 






In the following sections we introduce each type of these tools and we describe 
their capabilities and weaknesses.
2.1 .1  Form al V erification
In form al verification the designers prove the validity of a logic design using formal 
methods; all or part of the design is modeled in a mathematical framework after 
which the designer would solve the m athem atical equations to verify the correctness 
of the design [20, 23].
The main advantage of formal verification is th a t it is highly effective in catching 
design errors. Since it relies on a m athem atical approach, formal verification is almost
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
completely guaranteed to  find any functional error. The main disadvantage, however, 
is th a t it is very time consuming. Proving the validity of a design using formal 
verification requires extended periods of time from expert designers and hence it is 
impractical to use in large IC designs [20].
Despite tha t, formal verification is probably the most comprehensive of all verifi­
cation tools but due to its heavy cost in terms of time it could be only used in small 
circuits or in specific parts of large designs.
2.1 .2  Softw are S im ulation
Software simulation is w ithout doubt the most popular and widely used verification 
tool [29]. It is widely available, inexpensive and above all user friendly. In simulation 
the Design under Test (DUT) is represented in software models which the designer 
would test for correctness by applying input test vectors to them  then reading the 
outputs to  check for errors [27],
Simulation has many advantages over other verification tools. It is generally easy 
to  use; the user’s task is only to choose the input vectors after which he or she has 
to  wait for outputs. Simulation is also inexpensive since it requires only a software 
platform. It provides the user with high visibility and flexible debugging; the user can 
observe each signal traversing the design to  check for errors. But probably the most 
im portant advantage is the flexibility th a t simulation provides. Since it is software- 
based, changing and modifying parts or all the design is relatively easy to  do.
Software simulation also has several disadvantages. The degree of accuracy of the 
verification process depends heavily on the user’s choice of input test vectors. The 
choice of these vectors should be comprehensive enough to cover all aspects of the de­
sign or else some functional behavior of the design might be missed and go unchecked 
for errors. A second m ajor disadvantage, and perhaps the most im portant, is tha t 
simulation is relatively slow [27]. Because of the sequential nature of software pro­
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
cessing, simulating a large design, especially in its real world operating environment, 
could literarily take days or even weeks.
2 .1 .3  H ardw are-A ccelerated  Sim ulation
Hardware-accelerated simulation  shares the same basic principles with software sim­
ulation. The motivation behind this m ethod was to simply overcome the slow speed 
problem of software simulation. A logic design is still modeled in software, however 
this tim e the simulation is executed on custom made hardware rather on a software 
platform  running on a single processor. The processing power achieved by hard­
ware accelerates the simulation and gives the advantage of faster simulation speed 
[27, 22, 31].
Despite the speed acceleration th a t this m ethod provides it still suffers from the 
same problem th a t software simulation suffers from: the degree of accuracy of the 
verification process still depends on the designer’s choice of inputs. In addition, the 
speedup provided by the hardware accelerators is still limited by the communication 
media between the host computer and the hardware accelerator itself [29]. The time 
needed for the input vectors to  be generated and the output signals to be read is still 
restricted by the connective devices.
2 .1 .4  R apid  P ro to typ in g
As the name suggests, in rapid prototyping, a custom made prototype of the logic 
design is built by the designer to  verify the functionality of a design [19, 8]. Usually a 
custom m ulti-FPG A system is built for each prototype. In such systems, the FPGAs 
are programmed to im itate the functional behavior of the design and perm anent 
connections are established between them  to ensure design connectivity.
The main advantage of prototyping is speed. Since the whole design is imple­
mented in hardware, rapid prototyping achieves the fastest verification speeds of all
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
verification tools. In addition to  speed, another advantage of rapid prototyping is tha t 
it provides the user with the capability of testing the prototype in its real operating 
environment. Rather than  using input test vectors to  test the system, real inputs are 
supplied from the surrounding target system, thus giving the user higher confidence 
in the validity of the design.
Nevertheless, rapid prototyping has a m ajor disadvantage when it comes to  cost. 
Once a prototype is built for a specific design it cannot be modified to  implement 
another design; in other words it is a throw-away effort after the user is done with 
only one design. This basically means th a t the system is not reusable making its cost 
very high.
2.1 .5  Logic E m ulation
The newest type of design verification and the most efficient one is logic emulation. 
A logic emulator is a reprogrammable system th a t can be programmed and repro­
grammed to emulate logic designs a t relatively high speeds. Once programmed, an 
emulator would function exactly as the desired hardware without the need for fabri­
cation. In doing so, an emulator would be combining the advantages of software and 
hardware together; because it is reprogrammable it is as flexible as software and since 
it utilizes hardware it achieves very high speeds. However, it is im portant to note 
th a t although an emulator is programmable it is still quite different from software 
simulation. The hardware here is not being modeled in software; in fact, it is actually 
implemented on reprogrammable hardware.
Compared to other verification tools logic emulation has many advantages. It is 
as flexible as software simulation yet much faster and although it is not as fast as 
rapid prototyping yet it is not as costly because it is reprogrammable. But the main 
advantages of the logic emulation would have to  be in-circuit emulation, the capability 
to  function like an actual IC chip in real world operating environments. After being
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
programmed an emulator could be connected to a target system and tested, giving 
the user the opportunity to  verify the operation before fabrication. This removes the 
need to generate input test vectors, and like rapid prototyping, gives the user higher 
confidence in his or her design by relying on real inputs supplied by the target system.
Logic emulation still has some disadvantages, mainly its cost. Logic emulation 
systems are still very expensive and can only be afforded by big companies. Designing 
and manufacturing such a system is still a costly process.
Since it is the main focus of this research, in the following sections of this chapter 
we describe the main types of logic emulators and we discuss their advantages and 
disadvantages in detail.
2.2 Logic Em ulation System s
A typical logic emulation system, shown in Figure 2.1, contains three main elements:
1. Em ulation engine (or emulator for short)
2. Emulation support facilities
3. Interface circuitry
An emulator is basically a reprogrammable hardware system th a t can implement 
any logic design. This reprogrammable system could be a set of FPGAs or emulation 
processors connected together. Some details about the architecture of the emulator 
would be discussed in later sections of this chapter.
Emulation support facilities include a host computer along with an emulation 
compiler [15]. The task of the host computer is to  act as an interface between the 
user and the emulator. The compiler is responsible for converting the DUT supplied 
by the user into a bit stream  to be downloaded to  the programmable hardware. The
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.










Figure 2.1: Logic Emulation System
emulation support facilities might also include some other components like a D ata 
Capture Unit used to read the outputs from the emulator and relay them  to the user.
The interface circuitry of the logic emulation system is used to  connect the emu­
lator to a target system to perform in-circuit emulation.
To use the system the user supplies the compiler with a logic design (e.g. written 
in a hardware description language). The compiler compiles the design and generates 
a bit stream  which can then be downloaded onto the emulation engine. At this time 
an emulator is working exactly as the desired chip would work. Using the interface 
circuitry the user could connect the emulator to  a target system and test the design. 
This is the main advantage of emulation: the ability to test a design in its typical 
operating environment with real inputs.
To illustrate this consider the example where the designers are verifying the func­
tionality of a video card for a personal computer. In this case the DUT is the logic 
design for the video card and the target system is the personal computer. To perform 
in-circuit emulation, the designers would program the emulator with the design of the 
video card and connect it to the personal computer using the interface circuitry. The
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
personal computer would then be powered up with the emulator acting as its video 
card. In this way the emulator could be checked thoroughly for errors.
Logic emulation systems are currently considered to be the most effective and 
fastest m ethod for design verification. They are used by most top semiconductor 
vendors to  test IC designs before fabrication. The price for such systems varies from 
tens of thousands to  millions of dollars depending on the type of the system, capacity 
and speed. Currently there are two main types of logic emulation systems available 
on the market:
1. FPGA-Based Emulators (FBEs)
2. Processor-Based Emulators (PBEs)
In what follows we present each of those types, their architectures, design tools 
and operation.
2.2 .1  F P G A -B a sed  E m ulation  S ystem s
The basic building block of an FBE system is an FPGA. In this emulator, several 
FPG As are connected together to  emulate (im itate) the functional behavior of a logic 
design. Before discussing the architecture of this system we first introduce FPG As in 
detail.
2.2.1.1 Field Program m able G ate Arrays
A field programmable gate array is a reprogrammable chip th a t was first introduced 
in the 1980s [17]. By means of reprogrammable logic embedded inside, an FPGA 
can virtually implement any logic design. The main advantages of FPGAs can be 
summed up in two main points. The first advantage is that they are inexpensive; the 
price for a single FPG A starts  from a few dollars. In addition, the reprogrammable 
capability of the FPG A  makes it reusable for many designs which lowers its cost even
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
more. The second main advantage of FPGAs is th a t they have a fast time-to-market. 
Unlike custom made chips where every single design has to  be handled individually, 
FPGAs, because they are not custom made, are available off the shelf.
The above mentioned properties or advantages have given great importance to 
FPG As in the industry. More and more designs are being implemented on FPGAs 
to save money and time. Companies could use FPGAs for their designs instead 
of Application-Specific Integrated Circuits (ASIC) chips to  go around the lengthy 
and costly process of designing and building custom made chips. However, these 
gains do not come without a price; FPGAs are still bigger and slower than  their 
counterpart ASIC chips. Because of their programmable nature and since they are 
not built to suit a specific design but rather any design, FPGAs still suffer from a 
decrease in logic utilization, i.e. bigger area, and slower speed. FPG A  manufacturers 
are addressing this problem now more than ever and with the emergence of modern 
more sophisticated FPGAs these problems are becoming of lesser im portance and the 
advantages of FPGAs are outweighing any disadvantages they have.
The programmable ability of an FPG A  is derived from the use of programmable 
logic elements able to  emulate or im itate the functional behavior of any logic func­
tion. Several architectures for FPG As have been proposed, however, they all share 
some basic components. Figure 2.2 is a simplified illustration of a typical FPG A 
architecture [30].
FPG As are made up of several m ajor components. The two most im portant 
components are logic elements and routing resources. A logic element in the FPGA 
is responsible for emulating the behavior of a logical function. In other words a 
logic element could im itate the function of any logic gate. A typical logic element, 
shown in Figure 2.3, contains three main elements: lookup table, flip-flop and a 2-to-l 
multiplexer. It is the task of the lookup table to  operate as a logic gate. A typical 
lookup table is shown in Figure 2.4. The lookup table shown in the figure has four
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
I/O I/O I/O I/O I/O I/OI/O
I/O I/O I/OI/O I/O I/OI/O
L =  Logic Block 
C =  Connection Block 
S = Switching Block 
I/O =  Input/O utput Pad
Figure 2.2: A Generic FPGA Architecture
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.















■ . . . . . .
Inputs —
Figure 2.4: Internal Structure of a Lookup Table
inputs. It contains a memory array and each array element is connected to an input 
of a 16-to-l multiplexer. The selection bits for this multiplexer are the inputs of the 
lookup table, i.e. the presumed inputs of the logic gate. To program the table the 
compiler sets the bits of the memory array. Based on the selection bits (inputs) of 
the multiplexer one of those array elements is chosen. The lookup table shown in the 
figure is an example of a four-input AND gate; only when all the inputs are l ’s is the 
last element of the array chosen and the output is 1.
The lookup table only handles the combinational part of the logic element. To
15





Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
accommodate for sequential logic, the logic element contains a flip-flop whose input 
is the output of the lookup table. The output of this flip-flop is fed into a 2-to-l 
multiplexer whose selection bit is reconfigured by the compiler. This selection bit 
decides the output of the logic element.
Besides the logic elements, FPGAs contain routing resources to  connect these 
elements together. The routing resources are basically made up of connection blocks, 
switching blocks and a set of wires th a t run vertically and horizontally across the 
FPG A. Connection blocks situated between the logic elements can be programmed 
to connect the outputs of these logic elements to  any vertical or horizontal wire. 
Switching blocks situated between the connection blocks can in turn  be programmed 
to connect the wires together [13].
By programming the logic elements, connection blocks and switching blocks a user 
can implement any logic design on the FPGA. Nevertheless, mapping a logic design 
onto an FPG A  is not an easy task and is the m ajor challenge in FPG A  research.
In addition to the logic elements and routing resources, FPGAs contain two other 
im portant components: embedded memory blocks and inpu t/ou tpu t pads. Typically 
the FPG A  would have several memory blocks of different sizes to store data  th a t 
would be used to implement memory arrays or registers in a logic design. I /O  pads, 
on the other hand are used to  connect the FPG A  to the outside world. Both memory 
blocks and I/O  pads are connected to other elements of the FPG A via the routing 
resources mentioned above.
Currently there are two m ajor FPG A  vendors: Altera Corporation and Xilinx In­
corporated [4, 34], The latest FPG As produced by these companies contains hundreds 
of thousands of logic elements capable of emulating what is equivalent to millions of 
ASIC logic g a te s  [6, 35]. The ro le of FPGAs in  th e  in d u s try  is grow ing a n d  significant 
research is being carried out to enhance their performance and capabilities.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.









Figure 2.5: M ulti-FPGA System 
2.2.1.2 A rchitecture and C AD  for FBEs
In an FPGA-based emulation system several FPGAs are connected together to  be able 
to  emulate a design of significant size. Several architectures were proposed to create 
the M ulti-FPGA System  (MFS)[21, 32, 7]. A typical architecture is shown in Figure 
2.5. Here eight FPGAs are connected to  each other by means of a programmable 
interconnection network. Such a system is highly flexible since all inter-FPG A  con­
nections are programmable.
The CAD flow, shown in Figure 2.6, for a multi-FPGA system is as follows. The 
user supplies the compiler with a logic design. After performing logic synthesis and 
technology mapping, the compiler partitions the design into several parts such tha t 
each part could fit on one FPGA. Then each part of the design would be assigned 
to  a specific FPGA inside the MFS and the compiler starts routing the signals or 
connections between all the FPGAs, this is known as inter-FPGA routing. After 
inter-FPG A  routing is done the compiler starts  placing and routing each part of the 
design in its specific FPGA, this is known as intra-FPGA placement and routing. 
The final step would be to  generate a bit stream  of the design and download it to  the 
FPG As [13].
FBEs are an efficient verification tool; they can emulate any design and can ac-
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
0=0
/   >








I Placem ent and Routing
( Generate Bit Stream
Figure 2.6: CAD Flow for FBEs
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
commodate any logic capacity by simply increasing the number of FPGAs. FBEs also 
have a relatively high emulation speed because they exploit parallelism in hardware. 
Another m ajor advantage is th a t they have a relatively low price starting from only 
several thousand dollars.
Nonetheless, FBEs still face a m ajor problem: CAD tools. Mapping a logic design 
to a m ulti-FPG A system by programming the FPGAs and the inter-FPGA routing 
resources is very problematic. Partitioning a design, placing it on FPGAs and then 
routing the signals have been one of the main focuses of FPG A research. Although 
many algorithms have been proposed and architectures suggested, CAD tools for 
FPG As are still very complex. Compiling a design for a m ulti-FPGA system has an 
unpredictable compile tim e and may never even succeed. In addition, FBEs have very 
limited visibility and debugging support, which makes it very difficult for the designer 
to  catch errors. Also, if an error was discovered and fixed the change might trigger 
a chain reaction in the whole system and the design would need to be compiled and 
downloaded again.
2.2 .2  P rocessor-B ased  E m ulation  System s
The second m ajor type of logic emulation systems is processor-based emulators. The 
basic building block of a PBE is what is known as an emulation processor th a t can 
emulate a large number of logic gates and memory functions. Several of these emula­
tion processors are connected together and run in parallel to emulate the functional 
behavior of a logic design [15]. Before we discuss in detail the architecture of this 
system it is useful to briefly describe the emulation processors and their operation.
2.2.2.1 Em ulation Processors
Similar to  a logic element in an FPG A, an emulation processor can perform the 
logical operation for any given function. Although it is made from custom hardware
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
it is still programmable; th a t is achieved because embedded inside this processor is 
a reconfigurable lookup table. The structure of this lookup table is exactly the same 
as th a t of the one inside the FPGA. The main difference between this processor and 
the logic element of the FPG A  is th a t a logic element of the FPG A  is programmed 
only once before emulation starts  and therefore can only implement one logic function 
during the whole emulation cycle. This is in contrast with this processor which can 
reprogram  its lookup table during emulation to emulate different logical functions. 
The array elements for the lookup table would be stored inside the processor and 
then loaded into its lookup table during emulation to change the logic function at 
any time. The ability to change its operation type during emulation is what gives the 
processor its advantage over the logic element of the FPGA. More on the processor’s 
architecture and operation will be described in later chapters.
It is worth noting th a t inside a PBE there might be several kinds of emulation 
processors. A PBE could have all homogeneous processors, in which case each proces­
sor would have to  be able to perform any function of the logic design. Alternatively, 
a PB E could have heterogeneous processors, in which case specific processors would 
perform specific tasks (e.g. several processors would perform logic operations while 
others would perform memory functions) [15, 10].
2.2.2.2 A rchitecture and C AD for PB E s
A typical architecture of a processor-based emulator is shown in Figure 2.7. The emu­
lation processors are connected together via a programmable interconnection network 
to ensure th a t a signal could traverse from one processor to another. It should be 
noted th a t unlike FBEs where the interconnections are fixed during emulation, this 
interconnection can be reprogrammed during emulation. The reader should keep in 
mind th a t reprogramming the processors during emulation is quite different from 
programming them  prior to emulation. The former is done by the emulation support
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.




Figure 2.7: Processor-Based Emulation System
facilities while the latter is done by the emulator itself with no connection to the 
facilities.
The CAD flow for PBE, shown in Figure 2.8, is described as follows. The user 
supplies the compiler with the logic design. The compiler performs logic synthesis and 
technology mapping then partitions the design into several parts such th a t each part 
would be able to fit in one emulation processor. After each of those parts is assigned 
to  a specific processor, a process called scheduling starts. During scheduling different 
logic functions which have been assigned to  each processor are allotted different time 
slots throughout the emulation period. For example, an emulation processor would 
perform a logical AND during a specific tim e slot and a logical OR during another 
time slot. After scheduling is done a b it stream  is generated and downloaded onto 
the processors before emulation starts  [15].
PBEs have several advantages. They have very efficient and fast CAD tools com­
pared to  FBEs. In addition to tha t, they have much better visibility and debugging 
support. Finding an error and fixing it in a PBE is a standard procedure and usually 
does not trigger a chain reaction in the whole emulator. W hen an error is found 
the designer would only have to  fix the specified processor and not the whole design 
unlike FBEs. CAD tools in PBEs are much less complicated than  FBEs and have a 
well predictable compile time.
PBEs also have some disadvantages. They are comparatively slower than  FBEs. 
Because processors in PBEs have to  reprogram themselves periodically this leaves an
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
GD
Logic Synthesis and 
Technology Mapping
Partitioning, 




Figure 2.8: CAD Flow for PBEs
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. BACKGROUND AND PREVIOUS W ORK
effect on the emulation speed. Nonetheless, current PBEs are becoming faster and 
faster and are able to compete with their FBE counterparts. The m ajor disadvantage 
of PBEs is their price which is due to  the fact th a t the whole system is built on 
custom hardware. Their prices are currently in the order of millions of dollars.
2 .2 .3  C om m ercia lly  A vailab le Logic Em ulators
To give the reader an idea of the emulation technology on the market today, we 
present two examples of logic emulators manufactured by leading Electronic Design 
Autom ation (EDA) companies, Cadence Design Systems and Mentor Graphics [11, 
25].
The Incisive Palladium I I  is a PBE system supplied by Cadence Design Systems 
[18]. This machine is capable of simulation acceleration and in-circuit emulation and 
can reach a speed up to 1.5 MHz. This emulator can compile up to 30 million gates 
per hour on a single workstation and has a maximum capacity of 256 million gates.
The VStationPRO  is an example of an FPGA-based emulation system [33]. This 
product is m anufactured by Mentor Graphics. It has a scalable capacity from 1.6 to 
120 million gates and can reach a speed up to  1 MHz. This emulator can compile at 
a rate  of 5 million gates per hour.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3 
System  Architecture and 
Operation
This chapter presents the architecture and operation of the proposed logic emulation 
system. The first and second sections include an introduction and a general view of 
the emulator. Sections 3 and 4 discuss the two basic components, logic and memory 
emulation processors, in detail. Sections 5, 6 and 7 discuss the emulation module, 
emulation chip and emulation engine respectively.
3.1 Introduction and M otivation
Chapter 2 introduced the two m ajor types of logic emulation systems, FPGA-based 
em u la to rs  a n d  p rocesso r-based  em u la to rs , a long  w ith  th e  ad v an tag es a n d  d isad v an ­
tages of each one of them. Keeping th a t in mind, the motivation behind this work 
is to  design an emulator th a t combines the two most im portant advantages of both
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. SYSTEM  ARCHITECTURE AND OPERATION
systems: the low cost of FBEs and the high efficiency of PBEs. To achieve th a t we 
have to  design a PBE th a t can be implemented on an FPGA.
It is im portant to  note th a t this research only deals with the hardware part of 
this proposed system. The main goal is to  design an efficient architecture for a PBE. 
The CAD tools necessary to operate this emulator are not the focus of this research 
and are beyond the scope of this work.
Before delving into the details of the system architecture, it is im portant to high­
light one im portant aspect of a processor-based emulator. The main clock in an 
emulator, known as the design clock, is shown on top of Figure 3.1. It is the fre­
quency of this clock th a t determines the speed of a PBE. During each clock period 
of this design clock a number, known as the emulation step, increments from zero 
to  a specific number (127 in Figure 3.1). Shown at the bottom  of the figure is the 
emulation clock whose clock period corresponds to a single emulation step.
During each emulation step, emulation processors will perform a different opera­
tion type which in effect means th a t a single processor could perform a maximum of 
128 different operations given th a t the number of emulation steps in a single design 
cycle is 128.
We now discuss the details of the architecture and operation of the logic emulation 
system starting  with the basic components. We should note th a t the architecture pro­
posed for this design is based on the architectures of [15] and [10] but has substantial 
differences with them.
3.2 Levels o f Hierarchy
To enhance scalability, the design contains three levels of hierarchy connected together 
using different topologies. These levels are:
1. Emulation module
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.







Figure 3.1: Emulation Design Cycle
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. SYSTEM  ARCHITECTURE AND OPERATION
2. Emulation chip
3. Emulation engine
The building blocks of this emulation system are the logic emulation processor 
and the memory emulation processor. A specific number of each type of these two 
processors are connected together by an interconnection network to form an emulation 
module making it the first level of hierarchy.
The second level of hierarchy is the emulation chip which contains a certain num­
ber of identical emulation modules. All the modules inside one emulation chip are 
connected by an interconnection network similar to the one inside the emulation 
module itself. Each emulation chip would fit on one FPGA, hence the name chip.
The th ird  level of hierarchy is the emulation engine. To increase logic capacity, 
several emulation chips would be implemented in a specially designed m ulti-FPGA 
system. This m ulti-FPGA system is known as the emulation engine which is capable 
of emulating a design of significant size.
Figure 3.2 gives an overview of the hierarchy of the system. Here, the emulation 
engine is made up of 8 emulation chips and each of those chips contains 8 emulation 
modules. Inside each of those modules is a number of logic and memory processors. 
Note th a t the interconnections between the chips and between the modules are not 
shown.
3.3 Logic Em ulation Processor
The most basic component of the system is the Logic Emulation Processor (LP). The 
sole purpose of this processor is to  emulate the functional behavior of logic gates. 
Each gate is represented as a lookup table th a t can be programmed to im itate any 
desired logic function. The to ta l number of logic gates tha t a single processor can
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. SYSTEM  ARCHITECTURE AND OPERATION
E m ula tion  
Eng ine  v
E m ula tion  
Chip 0








E m ula tion  
Chip 4
E m ula tion  
Chip 6
E m ula tion  
\ c h i p  7
E mulation  
Chip 8
E m ula tion  
M odule  0
E m ula tion  
M odule  1
Emulation  
Module 2
E m ula tion  
M odule  3
Emulation  
Module 5
E m ula tion  
M odule  4
Emula tion  
Module 8
Em ula tion  
M odule  6
E m ula tion  
M odule  7
Figure 3.2: System Hierarchy
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.












Figure 3.3: Logic Emulation Processor
emulate depends on its lookup table size and the number of emulation steps executed 
in a single design cycle. The proposed logic processor has three main elements:
1. Control store
2. D ata stacks
3. Logic element
An architectural overview of this processor is shown in Figure 3.3.
3 .3 .1  C ontrol Store
The control store is used to store a unique control program for each processor to 
determine the operation type during each emulation step. The control store contains 
several instructions of predetermined width th a t are generated by an emulation com­
piler whose task is to  partition a logic design given by the user into several clusters.
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. SYSTEM  ARCHITECTURE AND OPERATION
Choose Input LUT SelA RAA SelB RAB SelM RAM
Figure 3.4: Logic Processor Control Word Fields 
Table 3.1: Logic Processor Control Word Fields Description
Control Word Field D escription
C hooselnput Picks an external input from the interconnection network.
L U T The array elements of the lookup table.
Se lA Selects the source of the first input to  the lookup table 
(internal or external stack).
R A A Read Address A: the address for the first input of 
the lookup table.
S e lM Selects the source of the M th input to the lookup table 
(internal or external stack).
R A M Read Address M: the address for the M th input of 
the lookup table.
These clusters are formed such th a t each one can fit into a single emulation processor. 
The emulation compiler then converts these clusters into a set of control words. The 
control store is filled up with these words prior to  emulation. During emulation, these 
control words are read to instruct the processor on what to do during a specific step 
[15].
The number of these instructions (i.e. the depth of the control store) is equal to 
the maximum number of emulation steps needed in a single design clock cycle. The 
fields of th e se  in s tru c tio n s  are  show n in F ig u re  3.4 an d  described  in  T ab le  3.1 w here 
M  is the size of the lookup table.
The number of bits dedicated for each field of the control word depends on two
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. SYSTEM  ARCHITECTURE AND OPERATION
factors: the size of the internal and external stacks and the lookup table size. The 
size of the internal and external stacks is equal to the maximum number of emulation 
steps in a single design cycle as would be discussed later.
Since the inputs of the logic element are located in the stacks, therefore the size 
of the address of each of these inputs is equal to Log2(N) where N  is the number of 
emulation steps. The L U T  size depends on the size of the lookup table inside the 
logic element. More accurately L U T  size is equal to 2M where M  is the lookup table 
size. The size of the C hooselnput field depends on the interconnection network, more 
precisely on the number of processors (both logic and memory) and the number of 
their outputs in the interconnection network. The size of the C hooselnput field is 
Log2(P ) where P  is the to tal number of outputs of all the processors sharing one 
interconnection network. The only field th a t is independent of any external factors is 
the Sel field. The size of this field is only one bit since it is used to  choose between 
only two types of stacks, either external or internal stack.
Therefore, the size of a single control word in bits is Log2(P) + M  + 2M +  M  x 
Log2(N ).
3 .3 .2  D a ta  Stacks
The data stacks are used to store one bit values provided as inputs to  the logic element. 
The proposed design has two stacks: an internal stack and an external stack. Ideally 
the two stacks are of the same width (one bit) and same depth. The depth of the 
stacks is typically equal to the maximum number of emulation steps executed in a 
single design clock cycle.
The internal stack is used to  store values generated internally to the processor, 
specifically values from previous operations done during different emulation steps. 
The external stack is used to  store values generated externally to  the processor, 
specifically values from other logic or memory emulation processors. Both stacks
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. SYSTEM  ARCHITECTURE AND OPERATION
have one write port and M  read ports which are provided as inputs to  the lookup 
table of the logic element.
At each emulation step, the internal stack provides output values on its read 
ports using the addresses {R A A ...R A M )  supplied to it from the control word. These 
outputs are used as inputs to the logic element. Also during the same emulation step, 
the internal stack stores the value of the current operation (i.e. the output of the 
logic element) in the address derived from the step value.
Equivalently, during each emulation step, the external stack provides output values 
on its read ports using addresses {R A A ...R A M )  supplied to it from the control words. 
These outputs are used as inputs to  the logic element. Also during the same emulation 
step, the external stack stores an input value external to the processor in the address 
derived from the step value.
Note th a t both internal and external stacks supply the logic element with the 
same number of inputs (M ) at the same time. It is the task of the logic element to 
choose between these inputs using the select values {SelM)  in the control word.
3 .3 .3  Logic E lem ent
The logic element is used to calculate the logic output for the processor. The logic 
element contains several multiplexers and a lookup table. The number of these multi­
plexers is equal to the number of inputs to the lookup table, or lookup table size. The 
Sel fields in the control word are used as selectors in these multiplexers to  choose the 
sources of inputs for the lookup table (either internal or external stack). The logic el­
ement also contains an M -input lookup table implemented as a 2M x 1 memory array 
and M - to-1 multiplexer. The elements of the array are filled by the L U T  field in the 
control word. By doing so we are defining the type of logical function to  be emulated 
during a specific emulation step. Figure 3.5 gives an overview of the logic element. 
Inputs shown in black are received from the internal stack, while inputs shown in grey
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.












Figure 3.5: Logic Element
are received from the external stack. Select values and L U T  are supplied from the 
control word. The lookup table of the logic element is identical to  the one described 
in chapter 2 and shown in Figure 2.4.
3 .3 .4  A rch itectu re  and O peration
The basic architecture of the logic processor is shown in Figure 3.3. The control store 
is filled up with the control words prior to  emulation through dedicated wires (not 
shown in figure). The processor has two external inputs and two external outputs. 
The first external input is the step value which is identical for all processors in the 
emulation engine and the second external input is used as an input for the exter­
nal stack where it is stored for subsequent operations. The first external output is 
the C hooselnput field of the control word which is supplied to  the interconnection
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3. SYSTEM  ARCHITECTURE AND OPERATION
network to choose an input for the external stack as would be described later in 
more detail. The second external output is the output of the logic operation which 
is supplied directly to the interconnection network to be used by other processors.
Preferably the depths of the control store, internal stack and external stack are 
the same and equal to the maximum number of emulation steps in a single design 
clock cycle. This would ensure an entry to  every operation output in the internal 
stack, making this output available to any subsequent operations. As for the external 
stack, its usage enables the processor to  make use of other logical or memory outputs 
supplied to it through the interconnection network. Another advantage of having the 
same number of entries in all three embodiments is tha t the step value is used both 
as a read address for the control store and a write address for the stacks a t the same 
time.
The operation of the logic processor is as follows. A step value is supplied to the 
processor. The step value is used as an address to the control store where a control 
word is read. Address fields, R A A ...R A M , are sent to the stacks and M  bits are read 
from each stack at the same tim e then sent to  the logic element. Using the Sel fields 
of the control word the logic element selects the sources of its inputs, either internal 
stack or external stack. The lookup table, which is filled up using the L U T  field from 
the control word, performs the logic emulation and supplies the output. The output 
is then w ritten to the internal stack where the step value is used as a write address. 
Also at the same time, an external input is w ritten in the external stack. This input 
is chosen among the outputs of the other processors in the interconnection network 
using the C hooselnput fields of the control words. Again the step value here is used 
as a write address.
In other words, the logic processor performs three operations in one emulation 
step. It first executes a logical function using its lookup table then writes the output 
of this function in the internal stack. In addition to th a t the processor picks an
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.













Read bits from  ar>d internal
External and input to Internal
Stack /Internal Stacks
Figure 3.6: Operation of the Logic Processor
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM  ARCHITECTURE AND OPERATION
external input from the interconnection network and writes it to the external stack.
The operation described above requires three memory accesses which have to occur 
in a single emulation step but not simultaneously. These accesses are:
1. Reading a control word from the control store.
2. Reading inputs from the data  stacks.
3. W riting values to the da ta  stacks.
The reason these accesses cannot be executed at the same time is because a certain 
delay has to be given between each of them. To read inputs from the da ta  stacks 
one has to  wait for the address fields from the control word and to  write values to 
the data  stacks one has to  wait for the output of the logical operation to be ready. 
To accommodate tha t, both edges of the clock are used since each emulation step 
corresponds to only one clock period. As shown in Figure 3.6 a t the first falling edge 
of the clock a control word is read using step value n. At the rising edge of the clock 
and after sufficient tim e is given to  fetch the control word, inputs to the lookup table 
are read from the stacks using addresses derived from the control word. At the second 
falling edge of the clock and after sufficient time is given for inputs from the stack to 
be read and the logic function is executed, the output of this logic function and an 
external input are written to the stacks a t address n. Also a t the second falling edge 
another control word is being read from the control store using address n  +  1. This 
scheme ensures th a t all the memory accesses required for a logical operation are done 
in a single emulation step at different timings.
3.4 M em ory Em ulation Processor
In this section we present the memory processor, the second basic component of 
a complete processor-based emulation system implemented on an FPGA. The sole
36
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

















Figure 3.7: Memory Emulation Processor
purpose of the memory processor is to emulate memory registers and their functions. 
The to ta l number of memory bits th a t this processor can emulate depends on the 
size of the embedded memory arrays and the number of emulation steps th a t are 




3. Capture memory word unit
4. Release memory word unit
Figure 3.7 gives an architectural overview of the memory processor.
3.4 .1  C ontrol Store
The control store is used to store a unique control program for each processor to 
instruct the processor what to do during each emulation step. The control store 
contains several instructions of predetermined width. Similar to the instructions of 
the logic processor, they are generated by an emulation compiler whose task is to
37
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM  ARCHITECTURE AND OPERATION
MWA W/R C ll CI2 CIQ
Figure 3.8: Memory Processor Control Word Fields
Table 3.2: Memory h'ocessor Control Word Fields Description
Control Word Field D escription
M W A Memory word address in the memory store.
W /R W rite (1) or read (0) a memory word.
C l l Choose the 1st bit of the memory word from 
the external inputs.
C IQ Choose the Qth bit of the memory word from 
the external inputs.
partition a logic design given by the user into several clusters. These clusters are 
formed such th a t each one can fit in a single emulation processor. The emulation 
compiler then converts these clusters into a set of control words. The control store 
is filled up with these words prior to  emulation. During emulation these control 
words are read by the processor to  choose an operation to be performed in a specific 
emulation cycle [15].
The number of these instructions (i.e. the depth of the control store) is equal to 
the maximum number of emulation steps done in a single design clock cycle. The 
fields of these instructions are as shown in Figure 3.8 and described in Table 3.2 where 
Q is the size of the memory word.
The size of the control word and the number of bits dedicated for each field depends 
on three factors: the word size of the memory store Q , the number of emulation steps 
in a single design clock cycle N  and the to ta l number of outputs of all emulation 
processors in an interconnection network P.
38
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM  ARCHITECTURE AND OPERATION
The word width Q  of the memory store is the same as the number of 1 bit inputs to 
the memory processor. This is because, ideally, we want an entry in the memory store 
for each input of the processor. The size of the memory word address M W A  depends 
on the size of the memory store. We propose a memory store of size equal to the 
maximum number of emulation steps done in a single design clock cycle. Therefore 
the size of M W A  in bits is Log2 (N).  The size of choose input field C l  depends on 
the to ta l number of outputs of all processors sharing an interconnection network (P).  
As a result the size of this field in bits is Log2 (P).
Therefore, the size of a single control word in bits is Log2 {N)  +  1 +  QLog2 {P).
3.4 .2  M em ory Store
The role of the memory store is to  emulate real memory functions; more precisely read 
and write memory operations. It contains several words of predetermined width and 
number. The memory words can be from either of two sources: emulation support 
facilities or other emulation processors. Emulation support facilities, such as an 
emulation compiler, fill up the memory store prior to emulation so th a t the filled 
memory words can be read during emulation. Also during emulation the output of 
other processors in the interconnection network might be written to  the memory store.
3 .4 .3  R elease M em ory W ord U n it
The purpose of this component is to  break up the memory word read from the memory 
store into one bit values. These bits are then supplied as outputs to  the interconnec­
tion network to be used as inputs to  other processors.
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM  ARCHITECTURE AND OPERATION
3 .4 .4  C apture M em ory W ord U n it
The purpose of this component is to concatenate several external inputs into a single 
memory word. The memory word th a t is formed after concatenation is entered into 
the memory store and therefore is of the same size as the memory word.
3 .4 .5  A rch itectu re  and O peration
The basic architecture of the memory processor is shown in Figure 3.7. The control 
store and the memory store are filled up with the control and memory words prior to 
emulation through dedicated wires (not shown in figure). The processor has several 
external inputs and outputs. The first external input is the step value which is 
identical for all processors in the emulation engine. The rest of the external inputs 
are values chosen from the interconnection network to be written to the memory 
store. The external outputs of this processor include the C l  fields of the control word 
which are supplied to the interconnection network to  choose inputs for the memory 
store. The rest of the external outputs are the bits read from the control store then 
are broken up by the release memory word unit.
Preferably the depths of the control store and the memory store are the same and 
equal to the maximum number of steps (N ) in a single design clock cycle. This would 
ensure an entry to  every memory word read or written in the memory store.
The operation of the memory processor is as follows. A step value is supplied 
to  the processor. The step value is used as an address to  the control store where a 
control word is read. The fields M W A  and W /R  are sent to the memory store. Using 
M W A  a memory word is read or w ritten and using the W /R  field we determine if we 
are reading (’O’) or writing (’1’) during this step. In case of a read, a memory word 
is read and supplied to the release memory word unit where it is broken up into Q 
bits and output to  the interconnection network. In case of a write, a memory word is 
formed by concatenating Q input bits from the interconnection network. This is the
40
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.








from  the 
Control Store
Write memory 
word to Memory 
Store in case of a 
write operation
Read m emory 
word from 
Memory Store 
in case of a 
read operation
Figure 3.9: Operation of the Memory Processor
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM  ARCHITECTURE AND OPERATION
task  of the capture memory word unit which then supplies it to the memory store to 
be written.
In other words, the memory processor performs either one of two operations during 
a single emulation step: it can read a memory word from a certain address in the 
memory store and then supply it to the interconnection network as a set of single bits 
or it can write a memory word supplied by the interconnection network at a certain 
address in the memory store.
The above description of the operation indicates th a t all the stages of the operation 
described above have to occur during the same emulation step but not simultaneously. 
A certain delay should be allowed between reading a control word from the control 
store and reading a memory word from the memory store. Also another delay should 
be allowed between reading a memory word and writing a memory word to give time 
for the bits to be read before they are written. To accommodate tha t, both  edges of 
the clock were used similar to  the logic processor. As shown in Figure 3.9. at the first 
falling edge of the clock a control word is read using step address n. In case of a read, 
at the rising edge of the clock a memory word is read from the memory store after 
enough time is given for the address to be derived from the control word. In case of 
a write, a t the second falling edge of the clock several bits from the interconnection 
network are collected and w ritten to the memory store to ensure th a t sufficient time 
was given for these bits to  be read. Also at the second falling edge of the clock a new 
control word is read using step address n  +  1.
3.5 Em ulation M odule
The first level of hierarchy in our system is the emulation module, shown in Figure 
3.10. It consists of R  logic processors and S  memory processors. Each processor in 
one emulation module is connected to  every other processor in the same module to 
ensure th a t the output of any processor is readily available as an input to every other
42
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.













Figure 3.10: Emulation Module
processor. Moreover, each processor has an external input which could be connected 
to  other processors in other modules.
In addition to the processors, the emulation module contains two other compo­
nents: the interconnection network and the sequential filler. The interconnection 
network is made up of the set of wires connecting all the processors together and the 
module level routing switch.
3.5 .1  M od u le  L evel R ou tin g  Sw itch
Connecting all the processors together in an emulation module is an interconnection 
network controlled by the module level routing switch. This switch is basically made 
up of ( R + Q x S )  ( R + Q x S ) - to-1 multiplexers; one multiplexer for each logic processor 
and one for every input of each memory processor. The purpose of this switch is to 
make the output of each processor readily available to every other processor to use. 
Moreover, the switch can supply the processors inside the module with external inputs
43
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM ARCHITECTURE AND OPERATION
th a t are derived from other modules.
The switch uses the Choose In p u t  held supplied by each processor as the selection 
bits of the multiplexers to route the signals between processors.
3.5 .2  Sequential F iller
The limited number of pins on an FPG A  makes it impossible for the user to fill up 
the control and memory stores of all processors a t the same time. For this reason, 
a sequential filler is created to fill up the stores in a sequential manner. To choose 
which processor to  fill up, the sequential filler has two input signals: one to choose 
which logic processor and the other to  choose which memory processor. The usage of 
the filler should not affect performance in any way since it is only used once prior to 
emulation and at high speed.
3 .5 .3  A rch itectu re  and O peration
The basic architecture of the emulation module is shown in Figure 3.10. Each pro­
cessor has one external input and one external output. The external input could be 
chosen from among all the outputs of all the processors in the same module or from a 
different source outside the module. The choice of this source will be described later. 
Each processor also supplies the interconnection network with an output. The wires 
used to  fill up the control and memory store along with the sequential filler are not 
shown in the Figure 3.10.
All the processors in one emulation module receive an identical step value during 
an emulation step. The processors use this value to determine the operation as 
described earlier.
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.











Figure 3.11: Emulation Chip
3.6 Em ulation Chip
The second level of hierarchy in this system is the emulation chip. It consists of 
T  identical emulation modules and fits on one FPGA. Specific processors inside the 
emulation module are connected to  specific processors in other emulation modules 
to  ensure th a t their outputs are readily available as inputs. Moreover, an emulation 
chip has several external inputs and external outputs whose numbers are to be chosen 
depending on the availability of pins on the FPGA. In addition to  the modules, the 
emulation chip has one other component, the chip level routing switch.
3.6 .1  C hip Level R ou tin g  Sw itch
The chip level routing switch connects all the module pins, external inputs and exter­
nal outputs together. This switch is made up of several multiplexers; one multiplexer 
for each input of every module and one for each external output. The number of 
these multiplexers depends on the number of modules inside the emulation chip. As 
for their selection capacity, it depends on the resources available on the FPG A to 
store their selection bits. More about these multiplexers will be described in the next 
chapter.
45
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM  ARCHITECTURE AND OPERATION
The switch allows inputs of processors inside one module to  choose among certain 
outputs of other modules. The switch also allows external outputs to  choose among 
the outputs of certain processors or external inputs.
3 .6 .2  A rch itectu re and O peration
The basic architecture of the emulation chip is shown in Figure 3.11. Each module 
would be able to  choose among several outputs from different modules or external 
inputs. Similar to the module, the step value is identical for all processors in the 
emulation chip.
3.7 Em ulation Engine
An emulation engine is the th ird  and last level of hierarchy. The emulation engine 
contains a  number of emulation chips connected together in a m ulti-FPG A system.
3.7 .1  M u lti-F P G A  S ystem
Several FPG A  connection schemes are available today. The system shown in Figure 
3.12 is an example of an 8-way mesh m ulti-FPGA system architecture while the 
one shown in Figure 3.13 is an example of a fully connected m ulti-FPGA system 
architecture.
Each FPG A contains an emulation chip. The chip would have a certain number 
of inputs and outputs through which it would communicate with other chips. Each 
processor inside the chip can choose among several of the external inputs and each of 
the external outputs can choose among several outputs of the processors. This gives 
each processor in the chip the capability to communicate with other processors in 
other chips implemented on other FPGAs.
46
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.













F ig u re  3.13: F ully  C o n n ec ted  M FS
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM  ARCHITECTURE AND OPERATION
Emulation
Clock
Figure 3.14: Clock Duty Cycle
In addition to that, some of these external outputs can choose among several of 
the external inputs enabling the FPG A  to act as a routing switch. This would become 
useful in the case where two emulation chips are implemented on two FPGAs th a t do 
not share a direct connection. Here, intermediate FPGAs would serve as routers of 
the signal from its source and until it reaches its destination. Note th a t in the case 
where full connectivity is ensured for the m ulti-FPGA system these outputs are not 
necessarily useful.
We should note th a t the selection bits for the signal routings occurs a t the rising 
edge of the clock in order to precede the writing operations tha t occur a t the falling 
edge of the clock in all logic and memory processors.
3.7 .2  Scalability  Issues
The basic challenge th a t we have to  solve in the multi-FPGA system is the one th a t 
deals with speed. The difference of time between the read and write operations is the 
crucial factor when dealing with the speed. We have to make sure th a t the difference 
between the rising and falling edges of the clock is long enough for the signal to 
traverse throughout the system. As mentioned above the first falling edge of the 
clock is when we read the control word, the rising edge is when we read from the 
stacks or the memory store and the second falling edge is when we write to  the stacks 
or the memory stores. The challenge is to connect the FPGAs in such a way tha t 
if a certain processor in one FPG A  needs to  write a signal (or output) from another 
processor in a second FPG A  the tim e between the rising edge and the second falling 
edge is long enough for the signal to traverse from the first FPG A to the second.
48
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. SYSTEM  ARCHITECTURE AND OPERATION
Assuming a non 50% duty cycle, the falling time of the clock is x  and the rising 
tim e of the clock is y, as shown in Figure 3.14. This means tha t the critical time 
is y  not x. The falling edge of the clock x  deals with intra-FPG A  connections; the 
rising edge of the clock y may have to  deal with inter-FPGA connections. More on 
the scalability issues will be discussed in the next chapter.
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Chapter 4
Architecture Exploration and 
Implementation Results
This chapter discusses the architecture exploration carried out and the implemen­
tation  results for the proposed logic emulation system. The first section describes 
the implementation target used in this research. Section 2 presents the architecture 
exploration and the effect of changing key param eters on the area and performance 
of the emulator. Section 3 presents the implementation results.
4.1 Im plem entation Target
The FPG A  used for implementation in this research is the Altera Stratix  EP1S40F780C5 
FPGA. We now present a detailed description of this FPGA.
50
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
4 .1 .1  A ltera  S tra tix  F P G A
The Altera S tratix  FPG A  Family [3] contains the following resources:





6. I /O  Elements
Each LAB block contains ten logic elements similar to the ones discussed in chapter 
2 and shown in Figure 2.3. These logic elements are capable of emulating virtually 
any logic function. M512 is a memory block which contains 512 programmable bits 
plus parity bits. It can be configured as single-port or simple dual-port mode. The 
M4K is another memory block which contains 4,096 programmable bits plus parity 
bits and can be configured as single-port, simple dual-port or true dual-port mode. 
The third, and largest, memory block is the M-RAM which contains 512 kilobits of 
programmable memory plus parity bits. This memory block can be configured as 
single-port, simple dual-port or true dual-port mode. The DSP blocks of the Stratix 
FPG A  are used to implement several forms of multipliers while the I /O  elements are 
connected to  the FPG A pins and support different I/O  standards.
The FPG A  used in this research, the Altera Stratix EP1S40F780C5 FPGA, con­
tains 4,125 LABs or 41,250 LEs. It also contains 384 M512, 183 M4K and 4 M-RAM 
blocks making the to ta l number of memory bits 3,423,744. In addition to tha t, it 
contains 14 DSP blocks and 616 I/O  pins [3].
51
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
4.2 A rchitecture Exploration
In chapter 3 we described the architecture and operation of the emulation system 
without specifying certain values for im portant parameters such as the lookup table 
size or the number of emulation steps. In this section we describe architecture exper­
iments th a t were performed to  determine the effects of varying different architectural 
param eters on the area and delay of the proposed emulator.
4.2 .1  K ey  P aram eters
The key param eters th a t were presented in chapter 3 and explored in this design are:
1. M: lookup table size.
2. N: number of emulation steps.
3. P: to ta l number of outputs of all processors in one emulation module.
4. Q: memory word size.
The main goal of the exploration is to choose a value for each of the above pa­
rameters. The best way to accomplish this goal is to vary each of the param eters and 
fix the others while checking for effect on area and performance. To do th a t the logic 
and memory processors were both  implemented after each change and the results were 
recorded. It is im portant to  note th a t the effects of the change of the param eters were 
only considered for individual processors. The routing between these processors, and 
in effect the hierarchy of the emulator, were not taken into consideration due to the 
complexity of the process. Instead the effect of each param eter on single processors 
was assumed to  be proportional to  its effect on the whole system.
52
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
4 .2 .2  Effect o f C hanging P aram eters
In this section we aim to monitor the effect of each of the param eters on the area and 
performance of the logic and memory processors. For tha t reason, each param eter 
under consideration was changed and its effect observed while the other param eters 
were given fixed values. This process was repeated for each param eter on both pro­
cessors. In what follows we show in graphs the effect of the change of each of the 
param eters. The effect 011 area is measured by the number of logic elements and 
memory bits each processor consumes when implemented on the FPG A  while the 
performance is measured by emulation clock speed.
It is im portant to note th a t the results here were obtained after implementation 
and not from m athem atical equations. More about the implementation of each pro­
cessor will be discussed in later sections of this chapter.
4.2.2.1 Effect of Changing Lookup Table Size
The size of the lookup table of the logic processor determines the logic capacity of 
the processor. In other words, it determines how many logic gates each processor can 
emulate. To determine how this param eter might affect the area and performance of 
the processor, the size of the lookup table was increased by one starting with 2 and 
ending with 8. To ensure th a t we are reading the effect of the lookup table size only, 
the other param eters were never changed. The number of emulation steps and the 
to ta l number of outputs were fixed at 128 and 64 respectively. Note th a t varying the 
lookup table size has 110 effect on the memory processor but on the logic processor 
alone. The results are shown in Figures 4.1, 4.2 and 4.3.
I t is clear from Figure 4.1 and Figure 4.2 th a t as the size of the lookup table 
increased the area consumed by the logic processor increased exponentially. The 
reason behind th a t could be mainly a ttribu ted  to the effect of the lookup table size 
on the control store and the logic elements. The size of the control word of the control
53
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.









90 1 2 4 6 7 83 5
M: LUT Size



















96 7 81 4 50 2 3
M: LUT Size
Figure 4.2: LUT Size vs. Memory Bits in LP
54
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

















0 1 2 3 5 6 7 8 94
M: LUT Size
Figure 4.3: LUT Size vs. Speed in LP
store is exponentially proportional to the lookup table size, the size of a single control 
word in bits is Log2(P) + M + 2 M + M  x  Log2 (N),  and thus increasing the lookup table 
size will result in an exponential increase in the control store size. The exponential 
increase in the number of logic elements could be explained in a similar way. It is 
due to  the fact th a t the lookup table is implemented in the FPG A ’s logic elements 
and its size increase meant an increase in the number of logic elements consumed.
As for the effect of the lookup table size on the speed of the processor, it can 
be seen in Figure 4.3 th a t as the lookup table size increased the performance of the 
processor decreased gradually. This is predictable since the time for processing a 
certain number of inputs inside a processor is likely to increase as the number of 
these inputs increase.
55
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
4.2.2.2 Effect of Changing Num ber of Em ulation Steps
The second key param eter to be tested is the number of emulation steps. The number 
of emulation steps, N,  is a critical param eter in both the logic and the memory 
processor. Here we study its effect in both processors.
• Effect on the Logic Processor: The number of emulation steps was varied from 
64 to  512 in steps of power of 2. The size of the lookup table, M , was fixed at 
4 while the number of to tal outputs, P, was fixed at 64. The results are shown 
in Figures 4.4, 4.5 and 4.6.
As shown in Figure 4.4 the change in the number of emulation steps barely 
had any effect on the number of logic elements used to  implement the logic 
processor. In contrast, as shown in Figure 4.5, the change in the number of 
emulation steps had a linear effect on the number of memory bits. This could 
be explained by the fact th a t the number of emulation steps does not affect the 
combinational part of the processor but rather the size of the memory blocks, 
control store and data  stacks.
As for the speed, it is clear from Figure 4.6 tha t changing the number of emu­
lation steps had little effect on the speed of the processor.
• Effect on the Memory Processor: The number of emulation steps also affects 
the implementation of the memory processor. Here, the number of steps was 
also varied from 64 to  512 in steps of power of 2. The size of the memory word, 
Q , was fixed at 8 and the to ta l number of outputs, P. was fixed at 64. The; 
results of the implementation are shown in Figures 4.7, 4.8 and 4.9.
Similar to  the logic processor, the effect of the number of steps was only limited 
to  the number of memory bits as shown in Figures 4.7, 4.8 and 4.9. This 
is expected because the number of emulation steps determines the size of the
56
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS




0 64 128 1 92 256 320 384 448 512 576
N: Em ulation S tep s
















448 512 576256 320 3840 64 128 192
N: Em ulation S tep s
Figure 4.5: Number of Emulation Steps vs. Memory Bits in LP
57
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
80
3 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------------
20  - -         -   --------------------------
10  -      -  •
0 J 1----------------- 1------------------,------------------1------------------1----------------- .----------------- ,------------------1-----------------
0 64 128 192 256 320 384 448 512 576
N: Em ulation S tep s
Figure 4.6: Number of Emulation Steps vs. Speed in LP
control store and the memory store and does not affect the combinational part 
of the processor.
4.2.2.3 Effect of Changing Total Num ber of Outputs
The th ird  param eter to  be checked for its effect is the to ta l number of outputs, P, 
which will help determine the numbers of logic and memory processors packed in one 
emulation module.
• Effect on the Logic Processor: P  was varied from 32 to 256 in steps of power of 
2 while the lookup table size, M , was fixed at 4 and the number of emulation 
steps, N , fixed at 128. The results are shown in Figures 4.10, 4.11 and 4.12.
As can be observed in Figures 4.10, 4.11 and 4.12 the param eter had virtually no 
effect on the area and performance of the processor. This can be explained by 
the fact th a t the only effect this param eter has is 011 the size of the C hooselnput
58
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
___________________ — ♦  76
74
70
0 64 128 192 256 320 384 448 512 576
N: Em ulation S te p s
















512 576128 44 80 192 256 320 38464
N: Em ulation S tep s
Figure 4.8: Number of Emulation Steps vs. Memory Bits in MP
59
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.













0 64 128 192 256 320 384 448 512 576
N: Em ulation S tep s
Figure 4.9: Number of Emulation Steps vs. Speed in MP
field of the control word. Changing the number of outputs only increases this 
field by one bit a t a time.
•  Effect on the Memory Processor: P  was varied from 32 to  256 in steps of power 
of 2 while the size of the memory word, Q, was fixed at 8 and the number of 
emulation steps, N , fixed at 128. The results are shown in Figures 4.13, 4.14 
and 4.15.
As shown in Figure 4.13 and Figure 4.14, the area consumed by the processor 
increased as the number of outputs increased. This can be explained by the fact 
th a t the C l  field in the control word for each input bit increases as P  increases.
As for the speed, it is shown in Figure 4.15 th a t changing the number of outputs 
had no m ajor effect on speed.
60
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
- ♦  31
32 64 128 160 
P: No. of O utputs
192 224 256 288












256 288128 192 2240 64 96 16032
P: No. of O utputs
Figure 4.11: Number of Total O utputs vs. Memory Bits in LP
61
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
80
3 0 --------              - ........ ........
2 0 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10-------------------------------------------------------------------------------------------------
0 -I---------- ,---------- ,---------- ,---------- ,---------- ,---------- ,---------- ,---------- ,----------
0 32 64 96 1 28 160 192 224 256 288
P: No. o f  O utputs





0 32 64 96 128 160 192 224 256 288
P: No. of O utputs
Figure 4.13: Number of Total O utputs vs. Area in MP
62
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.










,000  • -
4,000
2,000
0 32 64 96 128 160 192 224 256 288
P: No. of O u tpu ts












32 64 128 160 192
P: No. o f O u tpu ts
224 256 288
Figure 4.15: Number of Total O utputs vs. Speed in MP
63
R eproduced  with perm ission of the copyright owner. Further reproduction prohibited without perm ission.






0 2 4 6 8 10 12 14 16 18
Q: M emory W ord Size
Figure 4.16: Memory Word Size vs. Area in MP 
4.2.2.4 Effect of Changing M em ory Word Size
The last key param eter to  be checked for its effect is the memory word size, Q, which 
only affects the memory processor. Here the number of emulation steps, N , was fixed 
at 128 and the to ta l number of outputs, P, was fixed at 64. The size of the memory 
word was varied from 1 to 16 in steps of power of 2. The results are shown in Figures 
4.16, 4.17 and 4.18.
As can be seen in Figure 4.16 and Figure 4.17 increasing the size of the memory 
word had a linear effect on the implementation of the memory processor. This is 
expected since as the size of the memory word increases the combinational logic 
required for the capture and release memory word units increases. In addition, the 
size of the memory store where the memory word is stored increases as the size of the 
word increases.
As shown in Figure 4.18, the effect of the memory word size on the speed of the 
processor was limited; however, it was observed th a t there was a small decrease in
64
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.














0 2 4 8 10 12 14 186 16
Q: M em ory W ord Size






180 2 8 10 12 14 164 6
Q: M em ory W ord Size
Figure 4.18: Memory Word Size vs. Speed in MP
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
processor speed when the memory word size was increased from 8 to 16. This decrease 
in speed could be a ttribu ted  to  the fact th a t the processing time for the memory word 
will take longer as its size increases.
4 .2 .3  C hoice o f P aram eters
The choice of the param eters used in our implementation is based on the results 
obtained above. The following values were chosen:
• M  =  4. The lookup table was chosen to  be 4 because it provides good emulation 
speed and a low area cost.
•  N  = 128. The number of emulation steps was chosen to be 128. The reason 
behind this decision was the fact th a t FPG A resources are limited. M4K and 
M512 blocks both are of limited size and 128 words stored in each of them  seems 
a reasonable size.
• Q =  8. The size of the memory word was chosen to be 8 mainly because 
of the emulation speed. As noted earlier the emulation speed was relatively 
stable until the memory word size was increased from 8 to 16 where the speed 
comparatively fell more.
•  P  = 64. The to ta l number of outputs was chosen to be 64 which in effect meant 
32 logic processors and 4 memory processors were packaged together in one 
module. The exploration did not show th a t any specific value of this param eter 
had a m ajor effect on the area and performance of any of the processors.
4.3 Im plem entation R esults
In this section we discuss the implementation results of the system. Note th a t the 
values for the param eters used here were the ones chosen above. The design tool that
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
was used was Quartus II which is supplied by Altera [2], The hardware description 
language th a t was used was VHDL [28].
4 .3 .1  Logic P rocessor
The elements of the logic processor were implemented as follows:
•  Control Store: using M  = 4, N  =  128 and P  = 64 the size of the control word 
would be 54 bits. This means th a t the size of the control store is 128 x 54. 
The control store has no combinational logic and only needs to be implemented 
as a memory block. The memory block chosen to implement the store was 
the M4K. To save on memory bits the control stores of two processors were 
combined together and implemented in 3 M4K blocks (each M4K is of size 
128 x 36). A decoder was created to later separate the two words from each 
other.
•  Data Stacks: since the number of emulation steps is 128 then the size of each 
stack is 128 x 1. Similar to  the control store, both the internal and external 
stacks need no combinational logic and are implemented as memory blocks 
inside the FPGA. The memory blocks chosen to implement the stacks were the 
M512 blocks. Since each M512 block can supply at most two outputs a t a time, 
each M512 was duplicated to  ensure th a t 4 outputs can be supplied at the same 
time.
•  Logic Element: the logic element is made up of purely combinational logic and 
requires no memory blocks. The 2-to-l multiplexers and the lookup table, which 
is in tu rn  a 4-to-l multiplexer, were implemented in standard VHDL code used 
typically to describe multiplexers. The memory array of the lookup table was 
also implemented in logic elements and no memory blocks were used.
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
The implementation of each logic processor requires 1.5 M4K blocks, 4 M512 
blocks and 29 FPG A logic elements.
4 .3 .2  M em ory P rocessor
The elements of the memory processor were implemented as follows:
•  Control Store-, using Q = 8, N  = 128 and P  = 64 the size of the control word 
would be 56 bits. This makes the control store of size 128 x 56. The control 
store needs no combinational logic and is implemented in 2 M4K blocks.
• Memory Store: since the size of the memory word is 8 and the number of 
emulation steps is 128 then the size of the memory store is 128 x 8. Similar 
to  the control store, the memory store needs no combinational logic and is 
implemented in one M4K block.
•  Capture and Release M emory Word Units: the two units only require combi­
national logic. Their functional behavior was described using standard VHDL 
statem ents used in typical concatenation and breakup instructions.
The implementation of each memory processor requires 3 M4K blocks and 72 
FPG A  logic elements.
4 .3 .3  E m ulation  M odule
Each emulation module contains 32 logic processors and 4 memory processors together 
having a to ta l number of 64 outputs and sharing one interconnection network. Aside 
from these processors the module contains two other elements: the sequential filler 
and the module level routing switch. Both of these elements were implemented in 
VHDL and require no memory blocks.
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
Since the routing switch is made up of multiplexers, the VHDL code used to de­
scribe its functional behavior is th a t which is typically used to describe the function­
ality of multiplexers. As for the sequential filler its functional behavior was described 
in a series of conditional statem ents which determine which processor is being filled 
up before emulation starts.
4 .3 .4  E m ulation  Chip
Because of the limited resources of the FPG A  each emulation chip in our design 
contains three modules. It requires all 384 M512 blocks, 180 M4K blocks (98% of all 
M4K blocks) and all 4 M-RAM blocks. The M-RAM blocks were used to  store the 
selection bits for the multiplexers of the chip level routing switch.
The chip level routing switch, shown in Figure 4.19, connects all the module pins, 
external inputs and external outputs together. This switch is made up of 256 4-to-l 
multiplexers and 64 2-to-l multiplexers; one multiplexer for each input of the three 
modules and one for each external output. The switch allows inputs of processors 
inside one module to choose among certain outputs of other modules. For example, 
the input of processor 0 of module 0 can choose between: output of processor 0 in 
module 1, output of processor 0 in module 2, external input 0 or external input 1. 
The switch also allows external outputs to  choose among outputs of certain processors 
or external inputs. For example, external output 0 can choose between: output of 
processor 0 in module 0, output of processor 0 in module 1, output of processor 0 in 
module 2, or external input 0.
The emulation chip consumes 10,579 logic elements and 933,888 memory bits. 
This puts the FPG A logic utilization a t 25% and memory utilization at 27%. The 
key lim itation in resources was due to  the memory blocks, mainly the M512 and 
M4K blocks which were almost fully utilized by our design. The reason th a t the total 
memory utilization shows only 27% is due to the fact that the m ajority of the memory
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
O utput 0 o f Module 1 ■ 
O utput 0 o f Module 2 





O utput 1 of Module 1 ■ 
O utput 1 of Module 2 ■ 





Output 63 of Module 1 
Output 63 of Module 2 




















Input 0 of Module 0
Input 1 of Module 0
External O utput 63
External Output 64
External O utput 65
External Output 127
Figure 4.19: Module Level Routing Switch
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
bits available are stored in the M-RAM blocks which were only partially utilized. In 
addition to the logic elements and memory blocks, each emulation chip requires 531 
pins, making the pin utilization 86%.
The whole design, one emulation chip, was made up of almost 6,100 VHDL lines 
describing the functional behavior of the combinational logic. The memory blocks 
were designed and implemented by means of megafunctions, a design tool supplied 
by Q uartus II to save the tim e required to write the code in VHDL. The design 
contains 7 megafunctions [24].
Each emulation chip is capable of emulating what is equivalent to 98,304 ASIC 
gates per design cycle. This was calculated by assuming th a t each logic processor 
with a 4-input lookup table can implement 8 ASIC gates per emulation step and 
1,024 ASIC gates per design cycle [10]. The emulation chip can also emulate 12,288 
memory bits, which is the sum of all the bits stored in all the memory stores of the 
memory processors.
Lastly the emulation clock frequency of a single emulation chip is 24.04 MHz. The 
emulated design can run at 187.8 KHz or more depending on the number of emulation 
steps used in the design cycle.
4.4 Im plem entation E stim ates for Em ulation En­
gine
The implementation of this design only involved the second level of hierarchy, the 
emulation chip. The highest level of hierarchy, the emulation engine, was not imple­
mented. In this section we give some estimates of the implementation of this engine.
A typical emulation engine would be made up of a fully connected m ulti-FPGA 
system, as the one shown in Figure 3.13. Here six FPGAs are connected together to 
act as an emulation engine. The logic capacity of this engine is equivalent to 589,824
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS
ASIC gates and 73,728 memory bits.
As mentioned in chapter 3 the main issue we would need to deal with in such a 
system is the speed. The high pulse of the clock, symbolized by y  in Figure 3.14, need 
to  be long enough for the signal to  traverse the longest path delay of the m ulti-FPGA 
system. On one Printed Circuit Board (PCB) we can assume th a t this delay is on 
average the same for all the connections.
To calculate this delay we assume th a t the dielectric used for the PCB is FR-4, 
the m ost widely used dielectric for PCBs [5]. This means th a t the propagation speed 
on the PCB is 1.48 X 108 m /s  [1], Therefore, the time needed by a signal to traverse 
half a meter, a typical size of a PCB, is approximately 3.4 ns. Following this logic 
y  should be a t least 3.4 ns to ensure th a t the signal has enough time to  reach its 
destination.
If we choose to have a 50% duty cycle then the period of one clock cycle should 
be around 7 ns for the signal to  traverse the longest path  delay. However, it was 
mentioned before tha t the emulation clock frequency is 24.04 MHz and its period is 
41.6 ns. It is clear th a t emulation clock period is much longer than the longest path 
delay and therefore the PCB connections would not add any extra delay and should 
not decrease the speed of the clock if the board remained of reasonable size.
4.5 Em ulation Exam ple
To illustrate and verify the operation of the emulator we chose to emulate a four bit 
multiplier on a single emulation chip.
4 .5 .1  F o u r -B it  M u lt ip lie r
The multiplier has two inputs each of size 4 bits and has one output of size 8 bits. 
Figure 4.20 shows the multiplication process. The goal is to give each operation of this
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.



















Initial Partial Product 










M ultiply A by B1
G4 G3gh1 G2gh° G1 GO E0 Add
H3GH2 H2 H I HO M ultiply A by B2
14 I3U1 I2|jo 11 10 GO E0 Add
J3'12 J2 J1 JO M ultiply A by B3
K4 K3 K2 K1 KO 10 GO E0 Add
Figure 4.20: Operation of the Four-Bit Multiplier
multiplication to one logic processor in a process known as scheduling. The symbols 
shown as superscripts are the overflow from the previous operations.
4 .5 .2  Scheduling and Im p lem en tation
Tables 4.1 and 4.2 shows the scheduling of the eight processors used for emulating the 
four bit multiplier. Normally the scheduling process would be autom ated but since 
our design lacks the CAD tools associated with it, the scheduling was done manually. 
It is im portant to  note th a t this schedule might not be the most efficient one since 
the aim here is only to verify the functionality of the emulator. Cap and Cal in the 
tables stand for capture value and calculate value respectively.
Step LPO LP1 LP2 LP3
0 Cap(AO) Cap(Al)
1 Cap(BO) Cap(BO)






4 Cal(FO), Cap(F3) C al(F l), Cap(FO)
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.




























C ap (Jl) C al(Jl),
Cap(J2)
Cap(J2)
15 Cap(IJO) Cap (I JO)
16 C al(K l) C al(IJl) C al(IJl) C al(IJl)
17 Cal(K2) Cal(IJ2)
18
Table 4.1: Multiplier Scheduling for Processors 0-3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. ARCHITECTURE EXPLORATION AND IMPLEMENTATION RESULTS















6 C al(G l),
Cap(B2)
C al(E F l),
C ap(G l)



















13 Cap(I4) Cap (14)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.










Table 4.2: Multiplier Scheduling for Processors 4-7
After scheduling, the control words for each processor were generated and down­
loaded onto the processors. Several values were tested and the multiplier gave the 
correct results proving the validity of our design.
We should note the emulation of the multiplier was simulated and not downloaded 
on the FPGA. The reason behind th a t is th a t we lack the connection circuitry with 
the FPG A  pins and building such a circuitry would be very time consuming. A more 
im portant reason is th a t we do not have a D ata Capture Unit to read the outputs of 
the processors and therefore we cannot verify the operation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
Conclusion and Future Work
The first section of this chapter summarizes the contributions made by this research. 
In section 2 we present a brief comparison between our design and previous processor- 
based emulator designs. We conclude in section 3 with some remarks on possible 
future work.
5.1 Research C ontributions
The main contribution made by this research is the design and implementation of a 
low-cost processor-based logic emulation system. To reduce cost, the design was im­
plemented using FPG A  technology. Before implementation, architecture exploration 
experiments were conducted in order to choose suitable values for key architecture 
param eters. The proposed emulator can verify the functionality of logic designs at 
relatively high speeds and in real operating environments.
To increase logic capacity a fully connected multi-FPGA system can be used.
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. CONCLUSION AND FUTURE W ORK
Each FPG A is programmed to act as an emulation chip. The full (multi-FPGA) 
design of the emulator was not implemented in this research. Only one emulation 
chip was implemented using a single FPGA. Each of these emulation chips is capable 
of emulating around 98 thousand ASIC gates and 12 thousand memory bits. It can 
run at a speed of almost 187 KHz per design cycle or more, depending upon the 
number of instruction cycles needed in one design cycle.
A four-bit multiplier was emulated to verify the correctness of the proposed em­
ulator design. Because we lack the CAD tools for the bit stream  generation, all the 
tasks of the CAD flow were carried out manually. The multiplier was emulated and 
verified the correct operation of the emulator.
5.2 Com parisons w ith  Other System s
To provide a context regarding the contributions made by this research, we present a 
brief comparison with previously proposed processor-based logic emulation systems.
In [15], a processor-based emulator was implemented using custom made chips. 
The building block of the system is a processor which can emulate both logic and 
memory functions. The two main differences between this design and ours are in ar­
chitecture and implementation. In term s of architecture, the processor in this system 
performs both logic and memory emulation. In our design, two different processors 
are used one to emulate logic functions and the other to emulate memory functions. 
As for implementation, this design was implemented on custom made chips which 
makes it very expensive. In contrast, our processor-based emulator was implemented 
on FPGAs which would effectively make it a much lower cost system. We should 
note th a t there are several other similarities and differences in terms of operation and 
hierarchy.
In [10], a processor-based emulator was implemented on FPGAs. The emulator 
contains several kinds of processors. The main differences between this design and
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5. CONCLUSION AND FUTURE W ORK
ours are in the design architecture and hierarchy. This design contains different kinds 
of processors in addition to logic and memory processors; our design only contains 
those two kinds. Another difference in architecture is in the interconnection network. 
The designer in this case chose to  use buffers in the interconnections between the 
processors to give more flexibility to  the CAD tools in terms of routing. We thought 
th a t using such buffers would consume the limited resources of the FPG A  and decided 
to  leave a tighter constraint on the CAD tools.
In comparison with commercially available emulators like the Incisive Palladium  II
[18] our emulator reached almost one eighth Palladium ’s top speed. Although speed 
might be a drawback, our design has a lower cost because of FPG A implementation.
5.3 Future Work
In our research we focused on the hardware architecture of a processor-based logic 
emulation system. The other main part is the mapping CAD tools tha t are required 
for a real world emulation system. The next step would be to design and develop 
the mapping CAD tools for this system. The mapping CAD tools would compile the 
logic design of the DUT and generate the b it stream  which could be downloaded to 
the programmable hardware.
Another future work in the hardware part of the project might involve designing 
a da ta  capture unit which would help the designer in finding errors and autom ate the 
checking process.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
References
[1] Altera Corporation. High Speed Board Designs, November 2001.
[2] Altera Corporation. Introduction to Quartus II  Version 5.0, April 2005.
[3] Altera Corporation. Stratix Device Handbook, July 2005.
[4] Altera Corporation, h ttp ://w w w .altera .com /, Accessed August 2007.
[5] A ltera Corporation. Stratix II  Device Handbook, May 2007.
[6] Altera Corporation. Stratix I I I  Device Handbook, May 2007.
[7] J. Babb, T. Russell, M. Dahl, S. Z. Hanono, D. M. Hoki, and A. Agrawal. Logic 
emulation with virtual wires. IEEE  Transactions on Computer-Aided Design of 
Integrated Circuits and Systems, 16(6):609-626, June 1997.
[8] K. Banovic, M. A. S. Khalid, and E. Abdel-Raheem. FPGA-based rapid pro­
totyping of digital signal processing systems. In Proc. o f the 48th M id-W est 
Symposium on Circuits and Systems, pages 647-650, August 2005.
[9] M. Butts. Future directions of dynamically reprogrammable systems. In Proc. 
o f IEEE  Custom Integrated Circuits Conference, pages 487-494, 1995.
[10] M. R. Butts. Logic multiprocessor for FPG A implementation. U.S. Patent Ap­
plication 2004/0123258 A l, June 2004.
[11] Cadence Design Systems Incorporated, http://w w w .cadence.com /, Accessed Au­
gust 2007.
[12] E. M. Clarke and R. P. Kurshan. Computer-aided verification. IEEE  Spectrum, 
33(6):61-67, June 1996.
[13] K. Compton and S. Hauck. Reconfigurable computing: A survey of systems and 
software. A C M  Computing Surveys, 34(2):171-210, June 2002.
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
REFERENCES
[14] C. Edwards. Tracking down the chip killers. IEE  Review, 50(12):44-46, December 
2004.
[15] Beausoleil et al. Multiprocessor for hardware emulation. U.S. Paten t 5551013, 
August 1996.
[16] H. Goldstein. Checking the play in plug-and-play. IEEE Spectrum, 39(6) :50—55, 
June 2002.
[17] S. Hauck. The roles of FPG A ’s in reprogrammable systems. Proceedings of the 
IEEE, 86(4):615—638, April 1998.
[18] Incisive Palladium II. http://w w w .cadence.com /products/functional_ver/ 
aceLemul/pseries.aspx, Accessed August 2007.
[19] P. H. Kelly, K. J. Page, and P. M. Chau. Rapid prototyping of ASIC based 
system. In Proc. o f the 31st A C M /IE E E  Design Automation Conference, pages 
460-465, June 1994.
[20] C. Kern and M. R. Greenstreet. Formal verification in hardware design: A survey. 
A C M  Transactions on Design Automation o f Electronic Systems , 4(2):123-193, 
April 1999.
[21] M. A. S. Khalid and J. Rose. A novel and efficient routing architecture for multi- 
FPG A  systems. IEEE  Transactions on Very Large Scale Integration (VLSI) 
Systems, 8(l):30-39, February 2000.
[22] D. MacMillen, M. Butts, R. Camposano, D. Hill, and T.W . Williams. An in­
dustrial view of electronic design autom ation. IEEE Transactions on Computer 
Aided Design o f Integrated Circuits and Systems, 19(12): 1428—1448, December 
2000 .
[23] K. L. McMillan. F itting  formal m ethods into the design cycle. In Proc. of the 
31st A C M /IE E E  Design Automation Conference, pages 314-319, June 1994.
[24] A ltera Megafunctions, h ttp ://w w w .altera .com /products/ip /altera/m ega.h tm l, 
Accessed August 2007.
[25] Mentor Graphics Corporation, h ttp ://w w w .m entor.com /, Accessed August 2007.
[26] G. E. Moore. Cramming more components onto integrated circuits. Proceedings 
of the IEEE, 86(l):82-85, January 1998.
[27] R. Murgai and M. Fujita. Some recent advances in software and hardware logic 
simulation. In Proc. o f the 1 Oth IE E E  International Conference on VLSI Design, 
pages 232-238, January 1997.
81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
REFERENCES
[28] Institu te  of Electrical and Electronics Engineers. IEEE standard VHDL language 
reference manual, A N SI/IEEE Std 1076-1993, 1993.
[29] C. Pixley, A. Chittor, F. Meyer, S. McMaster, and D. Benua. Functional ver­
ification 2003: Technology, tools and methodology. In Proc. of the 5th IEEE  
International Conference on ASIC, pages 1-5, October 2003.
[30] J. Rose, A. El Gamal, and A. Sangiovanni-Vincentelli. Architecture of field- 
programmable gate arrays. Proceedings of the IEEE , 81 (7): 1013—1029, July 1993.
[31] L. Soule and T. Blank. Parallel logic simulation on general purpose machines. 
In Proc. of the 25th A C M /IE E E  Design Automation Conference, pages 166-171, 
June 1988.
[32] J. Varghese, M. Butts, and J. Batcheller. An efficient logic emulation system. 
IEEE  Transactions on Very Large Scale Integration (VLSI) Systems, 1 (2): 171— 
174, June 1993.
[33] VStationPRO. h ttp ://w w w .m entor.com /products/fv/em ulation/vstation_pro/, 
Accessed August 2007.
[34] Xilinx Incorporated, http://w w w .xilinx.com /, Accessed August 2007.
[35] Xilinx Incorporated. Vertex-5 Family Overview - LX, LX T , and S X T  Platforms, 
May 2007.
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VITA A U C T O R IS
Marwan Kanaan was born in Haret Hreik, Lebanon, in 1983. He received his B.E. 
in computer and communications engineering in 2005 from the American University of 
Beirut in Beirut, Lebanon. He is currently a candidate in the electrical and computer 
engineering M.A.Sc. program at the University of Windsor. His research interests 
include logic emulation systems, field-programmable technologies and digital design.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
