The co-design methodologies on click router application system by Li, Dan
Université de Montréal
The Co-Design Methodologies
On Click Router Application System
Présenté par:
Dan Li
Département d’informatique et de recherche opérationnel le
Faculté des arts et des sciences
Mémoire présenté à la Faculté des études supérieures
en vue de l’obtention du grade de
Maître en Informatique
Décembre, 2004
© Dan Li, 2004
L)5
2Q5 -
\I i_: ‘J
o
Université
de Montréal
Direction des bibliothèques
AVIS
L’auteur a autorisé l’Université de Montréal à reproduire et diffuser, en totalité
ou en partie, par quelque moyen que ce soit et sur quelque support que ce
soit, et exclusivement à des fins non lucratives d’enseignement et de
recherche, des copies de ce mémoire ou de cette thèse.
L’auteur et les coauteurs le cas échéant conservent la propriété du droit
d’auteur et des droits moraux qui protègent ce document. Ni la thèse ou le
mémoire, ni des extraits substantiels de ce document, ne doivent être
imprimés ou autrement reproduits sans l’autorisation de l’auteur.
Afin de se conformer à la Loi canadienne sur la protection des
renseignements personnels, quelques formulaires secondaires, coordonnées
ou signatures intégrées au texte ont pu être enlevés de ce document. Bien
que cela ait pu affecter la pagination, il n’y a aucun contenu manquant.
NOTICE
The author of this thesis or dissertation has granted a nonexclusive license
allowing Université de Montréal to reproduce and publish the document, in
part or in whole, and in any format, solely for noncommercial educational and
research purposes.
The author and co-authors if applicable retain copyright ownership and moral
rights in this document. Neither the whole thesis or dissertation, nor
substantial extracts from it, may be printed or otherwise reproduced without
the author’s permission.
In compliance with the Canadian Privacy Act some supporting forms, contact
information or signatures may have been removed from the document. While
this may affect the document page count, it does not represent any Ioss of
content from the document.
Université de Montréal
Faculté des études supérieures
Ce mémoire intitulé
The Co-Design Methodologies
on Click Router Application System
Présenté par:
Dan Li
a éte évalué par un jury composé des personnes suivantes:
Jean Pierre David Président-rapporteur
El Mostapha Aboulharnid Directeur de recherche
Gabriela Nicolescu Co-directeur
Claude Frasson Membre du jury
Mémoire accepté “ i’u11 2005
Résumé
Afin d’obtenir des meilleurs performances pour les applications actuels de plus en plus
complexes, aujourd’hui le domaine de co-conception logiciel/matériel a pris une grande
ampleur. En même temps, plusieurs architectures multiprocesseurs efficaces ont été proposées
récemment et elles sont devenue la solution la plus populaire dans le domaine des systèmes
embarques logiciels/matériels.
Des méthodologies innovateurs de conception sont nécessaires actuellement pour respecter les
contraintes fortes de coût, performance et temps de mise sur le marché. Un des points clé de ces
méthodologies et un des défies actuelles des concepteurs est donné par l’étape de
partitiomiernent logiciel/rnatériel en partant d’une spécification de haut niveau pour l’application
à implanter.
Dans ce contexte, les contributions de notre travail de recherche sont
(1) La proposition d’une méthodologie de partitionnernent logiciel/matériel. Nous avons utilisé
comme application le routeur IPV4 Click fourni par l’Université Berkeley et nous avons
analyse l’impact de la méthodologie sur les performances de ce système.
(2) Le développement et l’évaluation d’une architecture multiprocesseur intégrant 3 processeurs,
dans le cas du routeur IPV4 à l’aide de l’outil StepNP de la compagnie STMicroelectronics.
Cette approche peut être utilisée pour la conception de nouvelles architectures, intégrant plus
de 3 processeurs.
Notre évaluation montre un gain en performances dans le cas des architectures
logicielles/matériels (notre systeme a été deux fois plus rapide quand une partie de fonctions
logiciels ont été implantées en matériel). L’architecture multiprocesseur présente une vitesse de
traitement des paquets deux fois plus grande qu’une architecture monoprocesseur.
Keyword Conception logiciel/matériel, partitionnement, systèmes multiprocesseur, routeur
Click, StepNP, SystemC
11
Abstract
To improve the overail performance of the current increasingly complex applications, the
combination of hardware and software solutions must be valued in co-design area. In addition,
multiple processors architectures lias been introduced as a generic model to design many specific
application and they becarne the most efficient and popular solution in the field of
hardware/software systems.
In order to respect the straiglit constraints of cost, performances and time-to-market imposed for
multiprocessor embedded systems designers, a mew generation of methodologies is required.
One of the key points and current important challenge in such methodologies is to perforrn
hardware/software partitioning starting from a high level specification of the application to be
irnplemented.
In this context, the contributions ofour work are:
(1) Proposing a rnethodology for hardware/software partitioning. We use the Click IPv4 software
router (provided by Berkelcy University) as an application and analyze the methodology impact
on the packets processing performance of Click IPv4 router.
(2) Developing and evaluating an architecture implementing Click IPV4 software router on three
ARM CPUs with the help of StepNP platform provided by STMcirolectronics Company. This
approach may be extended to implementing Click software on more than three multi-processors.
As a result, the speedup of IPv4 Click router application system almost doubles in the help of
combination of hardware/software co-design comparing with using only software
implementation on Click router. In multiprocessors architecture for Click router integrating tbree
processors, the overall speed to process a packet is about 1.84 faster than that of the Click router
single processor architecture.
Keyword : Hardware/Software co-design, Partitioning, Multiprocessors, Click router, StepNP,
SystemC
111
Table of Contents
Résumé.
Abstract ii
Table of Contents iii
List of figures vi
List of Tables vii
Acknowledgement ix
CHAPTER 1 Introduction 1
1 .1 Context and motivations I
1.2 Objective 2
1.3 Contribution 2
1.4 Thesis structure 2
CHAPTER 2 Hardware /Software Co-Design 3
2.1 Definition & Motivation for hardware/software co-design 3
2.2 The hardware/software co-Design flow 4
2.3 Target architectures 6
2.4 Partitioning: definition and problems 7
2.4.1 Recent problems on hardware/software co-design 7
2.4.2 Partitioning concems 10
2.4.3 The strategies ofpartitioning 13
2.5 Our work on the hardware/software co-design 14
2.6 Our partitioning methodology 15
CHAPTER 3 Multiprocessor Concepts 16
3.1 The basic knowledge on multiprocessors 16
3.2 The related researches on multiprocessors 23
iv
3.3 Our work on the multiprocessor.27
3.4 Our methodology for Click implementation on the multiprocessor 29
CHAPTER 4 Hardware/Software Co-Design Methodology on Partitioning 30
4.1 The general presentation for rnethodology 31
4.2 StepNP platform introduction 33
4.3 Click router 35
4.3.1 Click router element connection 35
4.3.2 Click router features 36
4.3.3 Click router general forwarding principle 37
4.3.4 Click configuration for IPv4 router 37
4.4 Partitioning methodology 40
4.4.1 Partitioning implementation considerations 40
4.4.2 Partitioning - simulation considerations 42
4.5 Algorithm analysis 44
CHAPTER 5 The Methodology to Imptement Click Router on Multiprocessor 51
5.1 General presentation ofthe rnethodology 51
5.1.1 The architecture description 51
5.1.2 The benefit ofthe rnethodology 54
5.2 Mapping the program on three processors 55
5.3 Exchanging data between three processors 5$
5.3.1 Hardware architecture building 5$
5.3.2 Software ( Element) 62
5.4 The overail description ofthe communication between processors 66
CHAPTER 6 Performance Evaluatïon 69
6.1 Evaluation ofthe hardware/software co-design methodology in Chapter 4 69
6.1.1 Synthesis tools 69
6.1.2 hiitial constraints for synthesis 69
6.1.3 Synthesis result 70
6.1.4 Performance analysis by resuit 70
V6.1.5 Software/hardware analysis
.71
6.1.6 Cick Wv4 router performancd in co-design 72
6.2 Evaluation ofthe methodology explained in Chapter 5 74
6.2.1 Experimental tools and approach 74
6.2.2 Experimental resuit 75
6.2.3 The discussion ofresuits 77
CHAPTER 7 Conclusion and Future Work $0
REFERENCE 81
Appendix A CSDECIPTTL - FM State Graph Style for Process $5
Appendix B CSFRAGMENT - FM State GRAPH Style for Process 86
Appendix C CSSTRTP - FM State Graph Style for Process 88
Appendix D CSCHECSDM
- fM State Graph Style for Process 89
Appendix E CSDECIPTTL
- Schedule Surnrnary 91
Appendix f CSFRAGMGENTER - Schedule Surnmary 92
Appendix G CSLOOKUP ROUTEG TABLE
- Schedule Sumrnary 93
Appendix H CSSTRIP
- Schedule Surnrnary 94
Appendix I CSIPFRAGMENTER-SysternC Code 95
Appendix J CSDECIPTTL -SysternC Code 97
vi
List of Figures
Figure 2-1 Hardware/software co-design frarnework 5
Figure 2-2 a single-processor architecture 6
Figure 2-3 a typical multiprocessor architecture 6
figure 2-4 Hardware/software mapping 11
Figure 2-5 Hardware sharing 11
Figure 2-6 Functional scheduling 12
Figure 2-7 Interfacing 13
Figure 3-1 A distributed system on embedded chip 17
Figure 3-2 A single-bus multiprocessor 1 8
Figure 3-3 multiprocessor architecture with Network connection 19
Figure 3-4 An example to describe ihe influence ofprocess partitioning
on distributed system performance 22
Figure 3-5 how process allocation affects the performance
ofdistributed computing system 23
Figure 3-6 Generic architecture model 25
Figure 3-7 the generic design methodoÏogy of
a multiprocessor SoC architecture 26
Figure 4-1 the general framework ofthe approach
on how to map software onto hardware 32
Figure 4-2 SimplePacket platform in StepNP 34
Figure 4-3 Click Configuration for Ipv4 Router 38
Figure 4-4 The crucial path in Click Pv4 router 41
Figure 4-5 Hard/software Co-design for Click router on SimplePacket Platform 42
Figure 4-6 The code description ofDecIPTTL element 45
Figure 4-7 The code description of IPFragmenter Element 46
Figure 4-8 The code description of Strip element 47
vii
Figure 4-9 The code description oflookupiPRouter Element .48
figure 4-10 Tue methodology about longest matching prefix in routing table 49
Figure 5-1 the general description on how Click works on three ARM processors 52
Figure 5-2 Click router code modification for multiprocessor 56
Figure 5-3-1 ARM Processor modification in StepNP 57
figure 5-3-2 the instance of Click on three ARM7 Processors architecture 58
Figure 5-4 Hardware architecture with shared lock structure on stepNP 59
Figure 5-5-l The internal operations of an ARIVI processor 60
Figure 5-5-2 The internai operations of an ARM processor 61
Figure 5-6 the segment code to write packets into
the shared hardware on StepNP platform 63
Figure 5-7 the segment code to read packets from
the shared hardware on StepNP platforrn 64
Figure 5-8 IPv4 original file and IPv4 configuration drawing with two new elernents 65
f igure 5-9 The communication between three processors
with the help of software & hardware 66
figure 6-i The code segment in IPFragment element 72
Figure 6-2 The Speedup ofthe Performance on Multi_Processors 78
viii
List of Tables
Table 4-1 the click element computational cycles from simulation 43
Table 6-1 Result from CoCentric SystemC Complier 70
Table 6-2 Cycle Number in Strip module execution 71
Table 6-3 Cycle Number in DecIPTTL module execution 71
Table 6-4 Cycle Number in lPFragmenter module execution 71
Table 6-5 Cycle Number in Lookup Routing Table module execution 72
Table 6-6 Irnproved status for each elernent in IPv4 Click router 73
Table 6-7 The cycles counted for packets passing the
Click router on two architectures respectively 76
Table 6-8 The comparison ofthe packet processing performances
on the two Click router architectures 77
ix
Acknowledgement
This thesis work has been a challenge in my whole study period. Throughout this period, I
received much support and help from the professors and friends.
Firstly, I would like to appreciate Prof. E! Mostapha Aboulharnid, my director, for his continued
guidance, bis encouragement, his care and his support through the whole course of this thesis
work. Hc bas constantly given much precious advice on the thesis work by bringing up relevant
accomplishment in this research area. I also owe rny gratitude to Prof Gabriela Nicolescu, my
co-director, for her continued support, ber encouragement, her thoughtful advice and her
meticulous work for this thesis work.
Great appreciation is also given to David Quinn, and Mortimer Hubin, who have given me
generous belp on this tbesis work. Based on their work on the hardware/software co-design of
Click router application system, I can continue to complete this thesis work. Especially I
received a lot ofbenefit from the discussion with David Quinn, who gave bis precious time and
invaluable effort for this thesis work.
Many thanks to ail of my friends Qin Lisheng and Zhang Hong for their encouragement, and
their valuable comments on my thesis report.
finaily, I would like to give a special acknowledgernent to my sister, Li Wei, for her consistently
patience, faith on me. T also would like to give such acknowledgement to my parents for their
understanding, and making all ofthis corne true.
Chapter J
Introductïon
1.1 Context ami motivations
Due to the popularity of the Internet, the traffic on the Internet increases rapidly for the last ten to
fifteen years. The performance of router, a core part of network, is considered to be an important
factor to affect the overali performance of the whole network. TypicaÏly, a router accepts data-
grams and relays them on toward their destination and is responsible for deterrnining the route[13].
Ever-increasing requirements from network applications, however, drive routers to supply more
new network services such as packet tagging, application-level proxies, applications-specific packet
dropping, performance monitoring, intrusion detections, and various filter and firewails rapidly{4].
While rnost backbone routers improve their forwarding process of packets using appïications
specific integrated circuits (ASIC’s), software based edged routers[2][5] hecome attractive due to
high-performance network processors.
Thus, modular components routers architecture take advantage over that of conventional routers on
flexibility and extension. Such a typical system is MITs Click router[ 1] which is a modular
architecture built from different software modular components(elements). Component-based
routers make software networking easier to program.
However, component tecimiques, which spiit a complicated, big computation task into some small,
simple computation units(components) and build the linkage to pass the messages(changes)
between component possessing reuse and flexibility, suffer inefficiencies that monolithic software
can avoid [1lJ. One way to avoid modularity overhead is to consider applying hardware/software
co-design methodology to increase the computation in a component and to offset the cost of the
communications among components. This methodology is implemented on StepNP[6j platform
that we use in our work
21.2 Objective
In the presented context, the main objectives of our work are:
1) Bring a new solution for the application partitioning onto hardware/software architecture using
high-level simulation in co-design.
2) Exploring how to implement the Click router application system (see Chapter 4.3) on
multiprocessors and evaluating the performance of such an architecture.
1.3 Contribution
In order to reach our objective the proposed contributions are:
1) Proposing a rnethodology for applications partitioning. We use the Click Pv4 software router as
an application and analyze the methodology impact on the packets processing performance of Click
IPv4 router.
2) Developing an exccutahle application-specific architecture (sec Chapter S1.1) ofimplernenting
Click software router on three ARM CPUs with the help of StepNP platforrn and giving the
evaluation ofthe method. This approacli may be extended to implementing Click software on more
than three multi-processors.
1.4 Thesis structure
In the next chapters of the thesis, we introduce basic knowledge of HW/SW codesign in Chapter 2
and multiprocessors in Chapter 3. Then, we present a co-design methodology on how to partition
Click router into software parts and hardware parts, and on how to implernent those parts
respectively on StepNP platform ( Chapter 4), and a methodology implementing Click on three
processors in Chapter 5. Chapter 6 will discuss the evaluation of those two methodologies by
simulation means. In the final section, we will discuss future work as well.
________________________
3
Chapter 2
HardwarelSoftware Co-Design
2.1 Definition & motivation for hardware/software co-design
Hardware/software co-design refers to the collaborative design of hardware and software.
considering the hardware and software conculTently to design computing systems that meet ail
performance requirernents and minimize the arnount of hardware resources[14]. Two main
considerations stimulate the need for hardware/software co-design in digital systems design.
From the hardware aspect, the fact that dedicated computing systems need to react external input
instantly and to deal with more complicated computation of digital systems raises higher demands
on Ihe performance to today’s digital systems. With the help of ASÏCs processing data in parallel
with processors. hardware solution may shorten execution rime. improving the overail performance
of digital systems. However. ASICs often are flot high-volurne produced. They are ofien designed
to meet the performance need for the application systems. Thus, hardware solution may be more
expensive than software solution [15] due to the development cost ofASICs.
From the software aspect, software is likely reused on variotis processors and modified more easily.
Such possibility may reduce design efforts and shorten time-to-market. Therefore, while
considering cost constraints, designers may values the re-programmabiÏity and flexibility of
software solutions. However, software ofien performs tasks in serial order and may flot complete
some operations while respecting time constrains[15].
In summary. both software and hardware components are interesting for complex systems
implementations. Currently, all systems include both types of components, and the design of these
types of systems requires new methodologies to reduce the design efforts and to shorten the time
to-market.
42.2 The Hardware/software co-design flow
The hardware/software co-design process may contain creating hardware/software modeling,
validating to meet the desired design, partitioning and implementing software (through
compilation) and physical hardware(through synthesis). The overali design flow is depicted in
Figure 2-1.
The first step in HW/SW co-design is system-level specification, defining the functionality of the
system at high abstract level. Then, the process covers some major steps such as HW/SW
partitioning stage, which will be discussed in more detail in section 2.4.
The next specification refinement step is actually what we eau modeling, putting abstract
specification into program implementation. Software may be coded in C language and hardware
may be described in VHDL language. The interface between hardware and software is buik at this
step. System validation may be done at this phase, using formal verification, simulation and
emulation methods. Formal verification is the techniques allowing designers to proof the properties
oftheir design [14, 22]. This is done using mathernatical and Iogical equations. Simulation refers to
check whether the functions of designs match up the initial specification by lirnited input data. By
loading hardware units into programmable hardware (e.g. FPGA) and simulating in the more
practical testing environrnent, emulation helps to provide more practical prototypes, making system
designs doser to final products.
After co-synthesis, a set of ASICs are synthesized and software are compiled into the executable
codes targeted at specific processors. In fact, validation can be used in any abstract levels in design
process. Partitioning may be done in a design process to fulfill performance and cost
constrains.
5Systern Specificatiori
1
Co Estimation J
HWISW Partitioninçj
HW Parts Intertace Parts
>?
—----
8W Parts
Validation
FormaiÇat) Veatiofl-
Bpecication Refinernent
N
N
HW Spectflcations 5W Speciffcathins
Co-Synthesis
Ij HW Syrhesis 8HDornpiIation
Où-Simulation
no
Desiçjn ok?
4es
STO P
Figure 2-1 Hardware/software co-design framework 114, 22]
62.3 Target architectures
Target system architectures[271 may be classified into two categories:
Single-Processor architectures with one or more than one ASIC and Multi-Processors Architectures
[14].
processor memory (Aslc(1 )accl) ( ASIC(n)accl
, iv
processr-n HW HW
component component(e.g accl(1 accl(n
MC6B000)
executable software on
processors (CPUs)
L
des
Figure 2-2 a single-processor architecture [14]
The overali tirne of a program executed on a single-Processor may be shortened by adding ASICs
to the system ( called also coprocessors) [14]. The coprocessors are dedicated to implernent some
particular functions in hardware in parallel with the software components running in the processor.
Some Co-design tools (e.g. Cosyma, Mickey, and Vulcan [14] ) support such architectures. Figure
2-2 describes a typical frarne conceming such target architectures.
various possible hardware
implementation
C om po n e n ts
(ASIC ,F P G A ,IP)
C US to m e
pro ces sot
M Ï
processr -1
(e.g ARM7)
/R
“o
interface
zr
external
instruments
.4
Figure 2-3 a typical multiprocessor architecture 124]
7Multiprocessor architectures j nclude multiple p rocessors a nd s everal A SICs (or e oprocessors) t o
improve the overali performance of the application programs running on such architectures. Some
Co-design tools (e.g StepNP [6] and CoWare [14]) provide a way to build the interface between
hardware and software and to create a simulation model, supporting the Co-design on Multi
Processor architectures. Designers are also able to select different kinds of processors ( e.g. ARM 7
and MC68000 processors) and to build heterogeneous platforms.
2.4 Partitioning: definition and problems
As shown in previous section, the partitioning plays a crucial role for a successful
hardware/software co-design. Based on mainly on the work presentcd in [14], we introduce in this
section the principal partitioning problems. Hardware/software partitioning refers to dividing the
application system into software components and hardware units, and bridging the software
components and hardware units according to the system specification[14].
2.4.1 Recent problems on hardware/software co-design
While looking at the co-design, we observe several relevant problems remained in software and
hardware.
• embedded system
• partitioning
• software compiler
• software modularity
• parallel computation
• other
Problems in embedded system
Embedded systems have two categories: embedded control systems and embedded data
processing systems [14]. Embedded control systems are also called real-time control systems which
react to extemal events, execute corresponding functions to deal with incoming data and produce
the result within restricted time. On the other hand, the demand for lower cost control forms huge
8pressure in market place. How to balance the constraints of cost and time encourages design
engineers to develop specific design methodologies that deploy software and hardware components
ofHW/SW co-design.
Meanwhile, embedded systems for telecommunication applications need to deal with more
complex data processing, including data receiving, data compressionldecompression and routing.
Such complex computing systems require more powerful processors (e.g. DSPs or ASWs), with
ASICs. Therefore, researchers are developing advanced approaches related to specific HW/SW
partition, performance analysis and evaluation.
Partitioning
Certain segments in a software program may be bottlenecks, downgrading the performance in the
execution ofthe software programs [15]. In this case, the crucial issue ofpartitionillg is to develop
new methodologies to identify those critical segments efficiently in system level at the aid of CAD
tools including modeling, verification and simulation. Thus, some interactive and iterative
approaches on how to map on hardware/software draw the concem of rnany researchers in this area.
Some of these approaches are implemented autornaticafly based on sorne partitioning algorithm to
identify the critical operations in software code. The others are considered and guided by human
designers by the aid of CAD tools manually. In addition, resources constraints may force tasks
executed in serial order, resulting in difficuit scheduling in partitioning issue [14].
Software compilers
Since embedded data-processing systems applications often execute dedicated software programs
for telecommunication, applications-specific instruction processors(ASW5), which select particular
instruction set (IS) to match the specific application software, come to support specific programs
with high-performance and to make programmability more feasible compared with ASICs [15]
Given the performance, power, flexibility and developrnent time, ASIPs are between general
purpose processor and ASICs . For example, the design time ofASWs is lower than that ofASICs,
and performance of ASICs is higher than that of ASWs [15]. The specific IS may support fast
execution of particular programs, improving the p erformance of applications and making ASfPs
9more competitive. However, ASIPs solution introduces a software problem related to instruction-
set selection (ISS) and code generation [15]. How to select an instruction set for an applications
may affect the underlying hardware structure of ASP, and the design of the corresponding
compilers, which generate efficient codes to perform desirable tasks on such ASP [15]. Although
re-targetable compilers may generate high-quality binary codes oriented to desired instruction sets
and specific hardware architectures, some technology challenges in the development of such
compilers have existed and have not been solved completely until now [16], requiring great efforts
and much time in co-design environrnent.
Software Modularity
software modularity makes it possible for designers to reuse ofexisting modules [27]. Such
software architectures allow to separate complicate ftmctions into many simpler modules, reducing
the design of the complexity of software. However, dividing a complicated task into too many
processes may involve in many exchanging data arnong processes, reducing the overail
performance to complete the whole task. As a result, meeting performance constraints presents a
huge challenge to such architectures. Thus, the exploration of a methodology to shorten the
execution tirne ofmodular software bas special meaning whule re-use is valued in commercial
profitability.
Parallel computation
Since most applications for embedded computers require to respond external stimuli instantly,
computations distributed physically on ASICs may be performed with the computations in CPUs
concurrently to satisfy the performance constraints. This aspect increases the difficulty of co
design when designers must consider relevant scheduling and interfacing problems in several
ASICs units to meet the performance constraints ofthe applications.
Other
As networks require much faster communications, network on chips (NoCs) have developed more
complex systems, more stressing the speed of packets processing. How to evaluate and analyze the
performance ofNoCs on codesign becomes a demanding problem [20].
10
In CAD field, co-simulation methods have to be explored. Better tools need to be developed to
support the conception of better system-level models in hardware/software co-design [15].
2.4.2 Partitioning concerns
Designers may use partitioning algorithms to select functionality, to balance the hardware and
software components, and to meet the requirement of computing system automatically. The tools
including those partitioning algorithms wiÏl finish the whole partitioning process autornatically with
as littie designer’s intervention as possible. Besides using partitioning algorithms, design engineers
may complete partitioning phase semi-automatically with CAD tools(such as simulation tools).
$everal partitioning concenis related to hardware/software co-design must be taken into account
t’ 4]:
• hardware/software mapping
• hardware sharing
• interfacing
• scheduling
Hardware/software Mappïng
Given the system specifications, hardware/software mapping refers to choosing sorne particular
components to be executed on microprocessors as software and to determine the other parts to be
implemented on a set of logic circuits as hardware.
The Figure 2-4 shows that multiple functions (vl,v2) in a program cannot be executed on a
processor concurrently. When these functions are not data dependency and are mapped onto
different hardware units (processor pi, Ïi2) respectively, these units may execute those functions
(vl,v2) in parallel.
Hardware sharing
Designers usually can analyze the relationship among different functions of a program, place those
functions into a single hardware unit through the consideration of performance and cost constraints.
As a result, many functions can share the same hardware resources and a hardware unit may cover
and implement several different software functions. This process is called hardware sharing, whose
11
goal is to use a minimum amount of hardware resource to meet the cost constraints(see figure 2-
5). In addition, while considcring hardware sharing, design engineers have to think of the
scheduling problem to achieve better outcome of hardware sharing.
Ii I .L
vi v2
1 1
software 1m piem e n ta tin n
on processor pi
int vit); II fnntien ci;
in t y 2 t); II in n n tin n o 2;
vnid maint)
snftwareseheduie
52
hardware imptementatinn
IvI
figure 2-4 Hardware/software mapping 1141
HARDWARE
COMPONENTS
TIME SCHEDULE
vi J
v2
v3
v4
v4
Hi-2 H3-4
vi
v2 J
v4
lst HW lmplementtation time
s c h e cl u le
function vi—v4 implem ented on
four (Hi,H2,H3,H4) hardware
components, respectively
2nd HW Implementtation time
s c h e du le
function vi, v2 share
hardware Hi-2 , function
v3,v4 share hardare H3-4
figure 2-5 Hardware sharing 1141
hardware scheduie
vi
À
v3
À
v3
12
Scheduling
Scheduling decides the start time at which different hardware (HW) units begin to execute a
particular function. Given the allocation of hardware resource, this process is vital, especially for
the hardware sharing. Through scheduling, designers have to consider the relationship such as
dependencies, priorities of hardware units to prevent the conflict of those operations. Effectively
scheduling Yimited hardware resource makes it possible to fulfiul overaïl performance of computing
system and to decrease the cost of resource at the sanie time. Functional pipelining is a typical
example ofscheduling [14]. Figure 2-6 depicted this idea.
After transferring outputs to next HW unit vi, HW unit vi starts to deal with new incoming data
again. HW unit v2 depend on the resuit of HW unit vl. There is a data dependency between vi and
v2. Therefore, unit v2 must wait for vi until vi unit completes operations. The functions ofwhole
system are scheduled to run concurrently or sequentially according to their relationship of data
dependency and priority [14].
Function vi, v2on vi
H a rU w are
1m plementation on
A SIC
v2 v3
Function y 3,v4
v4 software
1m plementation on
pro ces s o r
S c h e du Ii in g
tim e
t
vi v2
v3 v4
Figure 2-6 Functional scheduling [14]
Interfacing
Since the mapping phase separates computing functions onto different processing units, either
software or hardware, to execute the particular functions in a program, interfacing focus on how to
13
communicate data between those units (HW and processor). Usually such communications arc
rnapped as some specific channels between HW units ( or ASICs) and processor. In addition, the
extemal tirne for transferring data has to be considered and scheduled to prevent the data conflicts
between HW units outputting the data and software functions waiting for the data as input.
Figure 2-7 explain the work what should be considered in interfacing phase.
processorPi
À Transfer data
scheduling Channel botween
I plandHWhi
hi
vi
channel interfne
pi
v2
Figure 2-7 Interfacing 114]
2.4.3 The strategies of partitioning
Various system-level partitioning approaches have been developed to divide the behaviour
specification into a set of software processes running on CPUs and co-processors in
hardware/software co-design. Two approaches stimulate the interest of researchers: software
oriented and hardware-oriented [14].
The Cosyma synthesis tool suite (launched at the University of Braunschweit, Germany)[15] use
software-oriented algorithm to partition a computing system into the components mnning on the
processor and the components implemented in ASIC hardware, shortening the executing time of
14
overail system. Software-oriented approaches start with a compiled software program, determines
the bottlcneck of software and move the bottleneck segments into dedicated hardware coprocessors.
The approach is a loop partitioning process to shift critical functions implemented in hardware until
ail timing constrains of a computing system are fulfihled [14].
A hardware-oriented algorithm has been introduced by Gupa and Demicheli and applied in the
Vulcan synthesis tool suite [14, 15]. The approach starts the initial step to set ail functions onto
hardware and meets the perfonnance constraints of a computing system by iteratively moving the
functions executed on hardware coprocessors onto the software components running on the
processor [20]. Thus, the functions remaining on hardware circuits for synthesis may be
implemented on a minimum amount of ASICs hardware with lower cost of the system design. One
benefit ofthe approach is to reduce the cost of hardware resource, fulfihling the overail performance
constraints of the computing system [14].
2.5 Our work on the hardware/software co-design
Considering commercial profit and time-to-market shortening, design engineers must pay more
attention on the reuse of modular software. The research on how to identify the crucial segments in
modular software and implement these parts on hardware draws more attention in
hardware/software co-design. New approaches dealing with partitioning oforiented-object systems
are needed in order that such system can keep the ftexibility of modular programming and gain
high performance with the combination of hardware and software. Our work will be concentrated
on this aspect ofhardware/software co-design, exploiting the object-oriented paradigm.
We are using the Click router, which inherits the strengths of oriented-object technology such as
flexibility and reuse. However, the transferring cost among objects may lower the overall
performance ofthe Click system. Software optimization may flot eliminate such cost due to the way
in which Click routers process packets (see Chapter 4.3). Therefore, the alternative way of using
hardware to implement some function plays a much more important role in accelerating packets
processing in network. Thanks to the previous research that D.Quinn, M.Hubin[12] has completed
a method on how to build the interface between Click router running on a processor and A SIC
‘5
components on StepNP platform (described in Cliapter 4.2) lias been proposed, we ai-e able to
develop a system-level approach on how to identify the bottleneck fragments in Click code and
move those fragments into hardware components and how to evaluate the performance of network-
on-chip (fixed Click inside) in co-design. In fact. DQuinn and MHubin[12] developed a way to
create connection hctween hardware parts and software parts in StepNP environment without
discussing how 10 partition and evaluate the whole performance in the whole IPv4 router. We used
their w ay t o do t lie simulation a nd t o finish the e valuation on o ur p artilioning a pproach for t hc
wliole C lick lPv4 s oftware r outer i n S tepNP p latfonn. M eanwhile. s mcc D .Quinn a nd M Hubin
choose a specific Click elernent (ChecklPheader) on a crucial path to implement their rnethod
without implementing their method on the other Click clements on different downstream routers.
their way may cause deadlock in simulation, wc did some srnall modification on their method (the
detail is discusscd in Chapter 4.5 section 1).
2.6 Our partitioning methodology
Our p artitioning ni ethodoÏogy i s created b ased o n ni odel-based co-design (MBC) p erfoniied j n
higher abstract 1 evel. M odel-based c o-design t MBC) [24] p artitioning r efers t o u sing s irnLdation
modeling approaches to explore the way of hardware/software partitioning in co-design. My
partitioning methodology is created based on sucli idea. Although more and more CAD tools have
been developed, we may flot search for a solution without human users’ guidance in most
hardware/software co-design problems [15].
Based on simulation modeling technique as well as analyzing Click code rnanually, we explored a
system-level methodology to partition Click application onto hardware/software, implementing a
Click JPv4 router on one processor SoC architecture. Given the methodology, we did an evaluation
for the performance oftlie entire IPv4 router architecture in hardware/software co-design.
_______________
16
Chapter3
Multiprocessor Concepts
As the computing for embedded systems gets more complicated, those complex computations are
distributed onto multiple processes and performed on several processors, with inter-
communications links between the processors[21]. Figure 3-1 is a example of the distributed
system which includes both a DSP (digital signal processor) handling signal processing and a
microcontroller dealing with external user interaction [211.
Furthermore, many embedded systems may perform time-critical tasks, such as real-time
transactions. Due to the technique on semiconductor process, the yield of the small size chip is
larger than that of the big size chip out of a wafer[30]. Therefore, using a couple of srnall
processors including simple operations may cost lower than does using a large processor which
contains complex operations[2 1]. Due to the constraints for higher performance and lower cost, the
requirements for distributed computing motivate to search for new design approaches oriented to
multiprocessor architectures in hardware/software co-design [27].
Also, while designers must handie the more difficult problem about processor parallelism caused
by the distributed computing, the current multiprocessor architectures makes the flexibility to add
(or cut ) the number of processors possible and improves the performance of software by
distributing computation among multiple processors[27].
This chapter will present the following sections. Section 3.1 will concern some current problems on
multiprocessor. Section 3.2 will introduce some existing researches on the design of
multiprocessor.
17
3.1 The basic knowledge on multiprocessors
A multiple processors architecture is an architecture defined with the following major features[3 1]:
1) Two or more central processing units (CPUs)
2) Shared memory and shared I/O
3) Hardware and software interact at all levels (including hardware and operation level)
4) Operation system for such architectures
bus
interface
device
interface
CJ) C) C)
KeyBoard
(Z) C) O
Figure 3-1 A distributed system on embedded chip 121]
Given the multiprocessor architecture, some existing problems should be considered:
• Basic organization
• Synchronization
• distributed system scheduling
• processes partitioning
BUS
micro
controller
RAM ROM
DSP
(Digital Signal
Pro ces sot)bus inerface
RAM ROM analog-to
digital
J concerter
analog signal
receiver
• distributed process allocation
18
Basic organization
Based on different physical connections, the basic architectures of multiple processors may be
classified into two typical categories: a single-bus multiprocessor and network rnultiprocessor[30].
The architecture of a single-bus multiprocessor is depicted in the figure 3-2.
Cache Cache
g
Memo
Figure 3-2 A single-bus multiprocessor [30J
By replicating the data in the caches, the caches in the close-processor structure may help to reduce
traffic between memory and processors, lowering the communication pressure on the single bus.
Single-bus organizations are useful and attractive when the number ofprocessor is not quite large,
usually between 2 and 32[30].
Network multiprocessor: processors are connected by a network [30] depicted in Figure 3-3.
Such structure overcomes the limitation of single-bus designs and may s upport more processors
running in parallel, from 8 to 256[30]. Network multiprocessors may use message passing method
to exchange data among processors[30].
Processor (1) Processor (2)
Cache
Processor (n)
t
ASIC (1) ASIC (n)
Single Bus
19
Figure 3-3 multiprocessor architecture with Network connection 1301
Synchronization
Whcn processors are running concurrently, they may require exchanging or sharing data. Thus, the
communication needs to be done to handie the shared resources between the processors. This
process is called synclironization or coordination[29, 30]. Sorne efficient techniques are widely
used to support this behaviour.
1) Locking
When a processor is working on the shared memory at a time, it locks the memory and other
processors that require using the same memory have to wait at that moment until the previous
processor releases the shared memory. This process is caÏled locking and unlocking [29, 30].
The goal of locking techniques is preventing write/read conflict in the shared memory. Otherwise, a
processor may work on the old data before the previous processor finish updating the data.
2) Message passing
Message passing is an alternative approach to handie how to share the data. This process is similar
to the communication on local area network. When a processor as a sender sends a message to the
processor as a receiver, the sender notifies the receiver that the message lias been sent out. In
Network
ASIC fi) Memory(i) Memory (2)1
Cache Gahe
-
Procossor (1) Processor (2)
20
response, when the message gets to the receiver, the receiving piocessor wiÏl acknowledge the
sending processor that the message has been received. This process is called message passing
technique [30]. By this way, the processors may communicate the data between each other.
Distributed system schedulîng
Process scheduling determines when hardware engines(e.g. processors) start to perfonu a process
and how the hardware engines use available time efficiently[29]. Process Scheduling definitely
affects the overali cost and performance of the distributed system. While making scheduling
strategy, designers must consider two major points: the time required to run the process on
available processors and the number of processors required to meet time constraints[21]. Usuaiiy,
scheduling policies may be divided onto two categories: static scheduling and dynamic scheduling
[29].
Static scheduling determine on which processor each process vi11 execute according to existing
processors and even the time to run each process when a program is designed. Thus, every process
may run on a certain, flxed processor, preventing high overhead if scheduling is perfonned during
system execution. However, this method may flot suit the case that the nitrnber ofparallel processes
is likely to change at the runtime[29].
Dynamic schedulïng can solve the above mentioned problem since such policy may load or unioad
processes ninning on a processor according to the changes of processes at nrntime. Although this
policy increases the cost for rescheduling process, it can get better performance when the number
of processes in paraliel changes frequently and rapidiy and the cost for swapping the processes is
low.
Until now, some important scheduling algorithms have been developed such as single shared ready
queue, co-scheduling and dynamic partitioning.
Single share ready queue (SSRQ)
Based on some mies such as FCFS(firs corne first served), SJF(shortest job first or RR(round
robin), designers use a queue shared by ail processors to allocate processes onto processors in
21
SSRQ policy[29].
Co-scheduling
Coscheduling ananges processes to be executed concunently on different processors. Such policy
may be suitable to the situation in which processes require frequent data exchange[29].
Dynamïc partïtioning
In dynamic partitioning policy, the processes of a program are dynamically scheduled onto a set of
processors chosen from available processors. Then the program periodically checks the number of
processes from scheduling server. This number is called “ideal number”, which indicates the
processes number with which a program may be executed in an ideal condition. If the ideal number
is more than that of processes which are nmning, the scheduler actives previously suspended
processes again. If the number of nmning processes is more than ideal number, sorne processes will
be suspended to rneet the rnost ideal situation in which the program is executed best [29].
Process partitioning j2lj:
Process partitioning refers to dividing functions without data dependency (one ninning has to wait
for another’s output as input) into several smaller processes and ntnning those processes on
different processors. Good partitioning might ensure hardware resources (e.g. processors) to be
used efficiently and prevent the phenomenon that some processors are very busy at computing and
the others are often idle [21]. Such partitioning may shorten the idie time of a processor, speed up
the executing time of processes in parallel on different processors and improve the overali
performance of software architectures [21].
For example (see Figure 3-4), a process pi includes two functions, which are likely to be
partitioned onto two process x and process y before partitioning. Since two functions are executed
sequentially, the process p3 that depend on the two values from two functions has to wait a long
time, increasing the idle time of processor 3. After process partitioning, process x and process y
may run on two different processors concurrently. As a resuit, the computing outcome of two
processes x and y may be sent the process p3 earlier than before the partitioning, shortening the idie
time ofprocessor p3 [21].
Distrfbuted process allocation [211
Distributed process allocation considers h 0W to a ssign a group of processes ninning onto a
processors. For example, in Figure 3-5 (a), processes fi, f2 and f3 require exchanging
(messages) frequently. Process f4 has no or a littie data exchange with the other processors [21
V IProcess pi runrnng on the For (i0; For (iO; i<N2;
For(i0; i<Ni;i++)
“\ processorpi kNi;i++) i++) I
proc (i,a)
_________
proc (i,a) proc (f, b) I
send (w,a); V_-_ send (w,a); I send (x,b);
Processor process x process y
For (i0; i<N2; i÷+) I pi
tend (x,b);
I / //
process pi
Process p3 depending on
the computing results of
xw processpi
proc(w);
r=proc (X);
Processor
process p3 p3
Process p3 running on the
processor p3
Figure 3-4 An example to describe the influence of process partïtioning
on distributed system performance 121]
Thus, We may consider to arrange fi, f2, and f3 running on the same processor(P1), and f4 rulming
on the processor P2 as depicted in Figure 3-5 (b). Since fi ,f2 and f3 are running on the same
processor, We may use much cheaper, faster shared memory as a ay of exchanging their
messages. Such allocation may minimize the bandWidth of communication on the link (e.g. bus)
22
Some typical contributions in this research area includes the process partitioning algorithm
developed by Huang [21] to minimize the delay betWeen inter-process communication [21], and
Ma et al. branch-and-bound algorithrn to reduce the number of processes and inter-process
communication cost.[21j.
few
data
After process
partitioning
Processorpi Processrp2
ftwhpI3
VJ
simultaneously
23
between processor Pi and P2, reducing bus conflict and improving the whole performance of
distributed process allocation[2 1].
The process allocation can be done either under the designer’s guidance or in automated algorithm.
The first automated algorithm was developed by Stone[21], but might not suit more than two
processors allocation. After the birth of the algorithm, many researchers have continued to
developed more approaches on the process allocation. for example, with the help of graph
matching heuristic technology [21], Shen and Tsai’s algorithm may effectively allocate the
processes running onto the available processors, lowering inter-processor communication traffic to
ED
(a)Before process allocation
PI, P2exchange
databybus
Processor P2
o
(b)After process allocation
Figure 3-5 how process allocation affects the performance
of distributeil computing system [21]
a minimum amount and balancing overall system load [21].
3.2 The related researches on multiprocessors
Nowadays, System-on-chip stimulates much interest of researchers in multiprocessors architecture
design. Several existing approaches have presented various model to support the design related to
24
applications-specific (dedicated applications like embedded systems) multiprocessor system-on
chip[27]. Among those research projects, POLIS[14] introduces a target architecture with the
combination of general-purpose processors and other possible components such as DSPs and
ASICs. CO$Y[14] has developed a layered communication model based on the POLIS approach
[27]. CoWare[141 presents a more generic architecture and design flow to help applications
specific multiprocessor systems-on-chip design[27]. However, above mentioned approaches have
at least two limitations to support various applications-specific multiprocessor systems-on-chip
design:
1) Those approaches give the restriction on the use of components, only supporting some
specific components. For example, designers can choose only DSP or products with a
particular type(MC68000, or ARI\4 7, etc.) [27) in those approaches.
2) Their protocol channel library may offer limited protocol types for communication among
components. Only sorne specific components can be connected with the specific
communication link[27].
Given those situations, Amer.B. and Darnie.L introduce a generic model and a methodology to
support heterogeneous processors in applications-specific SoC design.
Compared with the above mentioned methods, this approach cornes up with an architecture model
on the higher level, a nd gives the more abstract architecture model, which may be the g uide of
building more generic platfomis and may be used in much wider application fields without the
limitation of components and communication models. For example, StepNP platform is such a
practical application project based on the generic model[27], helping network-on-chip design.
Figure 3-6 depicts such a generic architectural model.
In multiprocessor architectural model, designer must be concemed with the following three aspects:
Modularity is the nature of oriented-object technology allowing to divide a complex system into
some small simple components, which encapsulate specific behaviours. The internai behaviours
may not be modified by the outsider directly. Due to the flexibiiity of modules, components may be
reused and assembied to support various applications according to different requirements [27].
25
Components may be software, hardware, and communication network (e.g. chaimel, or bus).
Hardware components include processors, memories or particular penpherals. Components inherit
the quality of modularity. In this model, processors components may choose various brands of CPU
sucli as ARM7 or Mc68000 to constnict heterogeneous systems, meeting various practical SoC
designing[27].
A multiprocessor architecture model should be scalable. In such architecture, designers should be
able to increase or reduce the number of components to support various application-specific
I Those components
are useful to the
scalability of I
architecture.
Components may be
added and removed
by this hardware
flexibly and
I conveniently
Memry
systems [27]. Now components can be CPU or communication channel [27].
In the Figure 5-2 of chapter 5, the architecture for click application on StepNP platform may be
regarded as an instance of the generic multiprocessor architecture model. The detail is depicted in
chapter 5. Meanwhile, Amer.B. a nd Damie.L present a g eneric methodology on multiprocessor
SoC design. The overail design ftow is descried in Figure 3-7.
CPu
Mem
ASIC
L_t0
Figure 3-6 Generic architecture model j27J
26
The major steps are introduced as the following:
1. Detenriine hardware architecture with fixed parameters inciuding CPU types, such as
ARM7, MC6$OO etc, from the hardware aspect. Consider which process of the application
specific should be executed concurrently on the processors fiom the software aspect. These
two actions may be taken concurrently.
2 Select parameters such as the number of CPUs, communication protocol for creating high
level model.
Furthermore, ail pararneters and next step has been presented in Figure 3-7.
Figure 3-7 the generic design metliodology of
a multiprocessor SoC architecture t271
________
The data in
çJ 7 processing
Processing
stage
SoC VaIidaton KZ
“ItfoniInn
______ __ _ _
Processing
stage
The Se]ect Pararneters process(see Figure 3-7), which focuses on hardware design based on
applications-specific parameters, generates abstract architecture description and allocation table.
27
The abstract architecture is a rough layout of the architecture platform. Allocation table consists of
detailed inforniation about architecture, including intemiption levels reserved for each CPU [27].
An allocation table may be used to refine the rough layout into further detailed architecture for SoC
synthesis and to detenriine various interfaces including the connection between processors [27].
The work of software design mainly focuses on compiling programs and generating the binary
code targeted on each CPU [27].
After the validation on CAD sirnulators, the final resuits of design are generated by SoC synthesis
process.
This approach introduces a general guide on how to build an architecture on SoC efficiently and
supports the development of various application-specific systems. Designers rnight be able to
follow the design flow depicted in Figure 3-7 to complete embedded chip design[27].
3.3 Our work on the multiprocessor
As we mentioned in flic Chapter introduction, modular components router bas benefit on flexibility
and extension comparcd with conventional router. Click router is a new architecture to bui]d
flexible and coniigurable router with high perfonnance [IÏ. When we analyse the structure of Click
software router, wc know that Click configurations are modular and easy to extend [1]. Such
structure bas potential parallelism, make it more possible to running Click router on multiprocessor.
In addition, Click router rnodularity bas its own llaws and limitation (see the expianation in the
opening of Chapter 4). Our work on the multiprocessor focuses on exploring a way to implement
the Click router applications system (coded in C++) on multiprocessor.
While considering the hardware design of multiprocessor architecture, researcher must consider
whether sofiware applications are available to run in parallel and whether programs can run faster
on multiprocessor architecture. In fact, it is even harder to find applications that can perfonn well
on multiple processors, taking advantage of such hardware architectures efficiently{30]. Also,
designers should consider whether programs can be reused and whether the cost to modify the
software for such architecture can be minimized as the number of processors increases[30]. Given
28
the reuse of software, an oriented-object language program presents such strength due to its
inherent nature that” describes a highly mobile concurrent system in which new objects are created
as computation proceeds and the linkage between components changes as references to objects are
passed in communications “[29].
A particular example of the relevant applications is SMP Click router, originated from Click router
[11 (the detail in Chapter 4.3) and designed for running on multiprocessor architectures. The nature
of modularity of SMP Click allows users to write separated configurations, achieving the goal of
configuration-level parallelism[2$]. Configuration-level parallelism refers to spiitting a
configuration into smaller sub-configurations and executing the sub-configurations on multiple
processors concunently.
Due to the cost ofpassing between modules, the Click router may get speed up either with the help
of coprocessors or by running on multiprocessors in hardware/software co-design. However, it may
be difficult to implement SMP Click on multiprocessor network on chip (NoP) co-design for the
following reasons:
1) The expensive cost for CPU scheduling on the centralized task queue of elements makes the
Click designers choose the private a task queue of elements for each separated threads
running on different processors. However, since SMP Click inherits the CPU scheduling of
Click (see the feature 3 in Chapter 4.3.2), such scheduling scheme may it difficult to
determine the running time for the element on different processors. Thus, it may take place
that one processor is very busy, and others are idie. How to balance processors resource
becomes a problem that need to be solved [2$]
2) Now that SMP Click use separated threads for different processors, designers must consider
how to create shared locking to synchronize the data exchange between processors. That
increases the complex of handling the packets communication between different processors
[28].
Therefore, above mentioned problems leave much room to explore an approach on how to execute
29
Click router on multiprocessor SoC architecture. Meanwhile, The structure of modularity
contributes large flexibility to Click router, making it possible to implement the approach.
3.4. Our methodology for Click implementation on multiprocessor
SlMicroelectronic has developed a simple multi-threaded processor architecture on which a
hardware thread executes the overail Click application with the different or same configuration
files. In their case, there are very few data dependency and data exchanges between the different
threads executing the Click application. This approach is called inter-packet processing level[26].
Our approach focuses on using multiple ARM processors to proccss a single packet with possible
high inter-block dependencies [26], implernenting the configuration-level parallelism [28]. In other
words, the Click application will be partitioned on the different processors ofthe architecture.
With the guidance of the generic architectural model, we have explored a system-level design
approach on how to implement Click router with multiple processors and evaluated the method on
the StepNP simulator. The detail ofthe method will be introduced in Chapter 5. The approach may
be considered as a contribution on how to design a suitable multiple processors SoC architecture to
mn programs written in orientcd-object language (e.g. C++).
30
Chapter4
HardwarelSoftware Co-Design Methodology on
Partitionïng
Although the Click element (the detailed in section 4.3) lias flexibility and scalabifity for various
router requirements, sucli modular element causes the cost of performance on processing packets
in two-folds:
1) the modularity of elements brings about the expense for passing packets between elements
in flow chart of Click router configuration [I]. For example, one of the three virtual
functions such as sirnple_action(Packet *p), pull(int port number) or push(int port ntimber,
Packet *jj) in each element (or C++ object) must be called when a packet is forwarded
from one element to next element along the processing path in a Click configuration. A
virtual function cal! must cost certain CPU tirne. According to a experirnental resuit on
Click performance, the cost to pass a packet between two elements is about 70 ns{l] and the
passing the sixteen elements in a regular Click IP configuration costs about I ms[l] in total.
2) the modular structure may inevitably increase the cost of passing packet since the Click
router has to cal! sorne elements with the general functionality [1]. A typical example is the
element named “Classifier”, which includes the generic functions that may flot be used for
IP Click router. $ome generic functions are flot tailored to some specific applications of IP
Click router when software writers may consider its multiple usages for various Click router
applications. Thus, given the usage in Click IP router, such general element codes may
cause unnecessary overhead, costing more CPU time[1J.
Those features are inherent in Click application software and the overheads may be very difficult
to be prevented in Click software. To counteract these costs, using H/W co-design methodology is
a feasible choice. The Methodotogy is implemented on StepNP platform.
3’
4.1 The general presentation for methodology
Now that my research concentrates on system-level partitioning process with the help of using
simulation tools and rnanually analyzing source code together, my work focuses on the approach
from three aspects: hardware, software, and simulation platform.
Simulation P]atform:
We choose StepNP platform as our simulation platform. Section 4.2 presents the stntcture of
StepNP platforrn in detail.
Hardware:
Based on StepNP simulation platform, we use systemC’ modeling language to design the hardware
units for computational system, to implement the system-level partitioning, the performance
analysis and the evaluation on Click JPv4 router.
Software:
Our software mainly concentrates on an open source code: Click rotiter application.
STMicroelectronic cornpany has rnodified the source code targeted to ARM7 processor SoC
architecture.
Section 4.3 will introduce the working principle of Click IPv4 router and the basic knowledge of
Click router application in detail. Section 4.4 introduces how to move crucial software segments
onto hardware implementation partitioning. After creating a hardware architecture in SystemC, we
may analyze the modeling result of simulation and identify the bottleneck code segments for
partitioning. Section 4.5 describes the hardware algorithm in SystemC and the way on how to
SystemC is a modeling language that is derivative from C++ libraries and may be used to design
both hardware and software. If hardware and software are designed based on oriented-object
technology, the system-level co-simulation of HW/SW may be completed more easily in the
environment to support SystemC. In addition, some synthesis tools such as CoCentric(R) SystemC
compiler also can implement hardware synthesis in the lower level.
32
communicate between the SysternC hardware modules and the software modules in C++ oriented
object language.
In our research, we modify Click software, design hardware, build simulation platform and adjust
these three aspects iteratively until the constrains for the application system are reached. $ince the
executable specification of Click application has been defined and completed in oriented-object
C++ language, the co-simulation between Click and the SysternC hardware units may be done more
easily on high-level. Figure 4-1 depicts the general outiine of the rnethodology.
Someelements
areimplemented
on hardware
coprocessors
L0OkUPRT DecTPTtL lPFragment
HWModule HWMoUule Module
StrlpHW CheckSum
Mdule HWModuIe
SOCPlnterConnect
i-1w architecture Application: Click
Architecture: StepNP
Figure 4-1 the general framework of the approach on how to map software onto hardware
33
First, we created the cycle-accurate executable architecture in Figure4-1 by modifying the existing
simulation platforrn in StcpNP tool kits. Then, ninning each element of Wv4 router configuration in
the ARM processor of the architecture, we applied simulation to count the executing cycles of each
element, which show the complexity of each part of Click IPV4 code. Thus, we were able to
identify bottleneck fragments that cost many cycles and map those critical operations onto co
processors tailored for the given architecture (here is ARM) iteratively [19].
With the support of StepNP platform, the approach may build the executable application-specific
architecture easily and complete partition process efficiently based on the combination of user
guide and sinnilation evaluation, shortening the time and efforts in hardware/software co-design.
4.2 StepNP platform introduction
First, we introduce the principle of StepNP platforrn, which is a System-level Telecom
Experimental Platform[61 running on Unix-like OS for Network Processing. StepNP modeling
supplies an approach to support top-down co-design verifications methodologies and to reduce the
tirne and efforts spent on system-level partitioning stage. StepNP offers existing components
libraries for designers to model hardware components and to create a system-level model frarne for
simulation validation and evaluation. StepNP platforrn contains three components i-elevant to ARM
RISC processor.
1) MIT Click Router Platform in StepNP
StepNP h as d eveloped s orne p atches to MIT C lick r outer (user-model) for A RIVI s imulator, and
supplied some new elements as the interfaces to support ClickTs running on ARIVI simulator.
Thus, the user-module configuration for MIT Click router can running on StepNP’s ARIvI
simulator. Two new added elements are FwidlSink which injects packets frorn Unix platforrn into
the address in StepNP, and FwidlSource which reads packets from the address in StepNP platform.
These two elernents establish a bridge between the internal environment of StepNP platform and
extemal processing system or operation system platform.
34
2) SoC tools platform
SoC tools platform is classified two categorics: 1) tools for developing embedded system on a
single processor, including an instruction-set simulator , a compiler, etc. 2) tools for developing
computing system over multiprocessors, including controlling, debugging and analyzing
functionality[6].
3) NPU architecture simulation platform
The three major components of the NPU architecture simulation platform include modeling
language( using SystemC 2.0), rnultithreaded processor mode! and a SOCP( SystemC Open Core
Protocol) communication channel interface[6].
Using an existing ARM processor components as a master and SOCP communication channel, we
can build a simple master-slave testing architecture t called simplePacket platfonn shown in Figure
4.2) to support the simulation of Click router for ARM CPU. Then, we choose a collection of the
elernents in Click and test it on the simplePacket platform.
Wç
C-H-
$ÂNP
InrfaœsWth
e4emalernironnent
Figure 4-2 SimplePacket platform in StepNP [12]
35
4.3 Click router
Click router is a typical oriented-object application system developed by Eddie Kohier in MIT.
Most router systems have predetermined, fixed functions often implemented on hardware(or
ASIC’s) before they are launched into thc market. Thus, such routers leave littie room for an
administrator to configure routers flexibly to meet some particular requirements and to add new
functions in route?s configuration to support new protocol or new network development [1]. The
software based [4] ( means implementing most functions in software ) Click router is a
configurable architecture for packet forwarding processing to prevent those disadvantages. Click
router has about 60 major elements to support various router configurations. Each element
implements specific functions in packet processing. Depïoying the Click router elements with
connections ( explained in section 4.3.1 in detail), the tisers may build a router configuration- the
collection ofthe elements- 10 support the desired behaviors.
In fact, the Click “element” is the object of the C++ class. Sorne of the elernents are virtual
classes. According to the inheritance attributes of C++ class. Click router has two typical features
aboitt router configuration:
1. Based on existing elements or virtual classes, users may develop some new elements to extend
Click functionality. After they write a new C++ object with the new functions, users are able to add
the new object into Click router conveniently according to the approached taught by Click router
developers. Thus, the new elements may combine with the old ones to build varions configurations,
supporting desired complex applications[ 1].
2. An element implements various functions based on the concept of C++ polymorphism. for
example, RED(Random Early Detection) can behave as RED over multiple queues, weighted RED,
or drop-from-front RED according to the different requirements ofpacket processing[1].
4.3.1 Click router element connection
The graphic edges among elements in a configuration diagram are called connections used to
forward packet between the elements. Each element may have the multiple ports for exit to connect
itself 10 the other element’s port for entrance[ 1].
36
Click elements have three types of ports, pull, push and agnostic. Connections on push ports will
pass the packet from one push port downstream to other push ports. Connections on pull ports wilÏ
have downstream pull ports get packets by triggering upstream pull ports upstream in forwarding
path. The agnostic ports may be used as push if connected to push ports or pull if connected to pull
ports. The possible connections among elements are either push ports connected to pusb ports or
pull ports connected to pull ports[1].
The types of connections arnong the elements are detennined during the configuration initialization
phase. Then, a packet will be forwarded through the estabiished connections from entrance ports,
which receives incoming packets such as FromDevice element, to exit ports, which sends packets
out of router such as ToDevice elernent, when Click router starts to process packets. The
connections are actually implemented in the way which Click scheduling modules eau some
virtual functions such as push(int port nztnzber, Packet *p) in each element[1]. Ail elements in
Click router include one of the three virtual functions such as simple action(Packet *p), puli(int
port_number) and push(int port mtmber, Packct *p).
4.3.2 Click router features
According to packets processing functionality, Click router elements may be classified into several
major groups such as Network Devices, Classification, Checking Validity, Storage, Dropping,
Packet Scheduling and Duplication[1]. Click router clements have all the features of C++ objects.
The forwarding actions for a packet are encapsulated in each element. In addition, Click elements
have the following typical characteristics.
1. f irst, Click router has an explicit Queue element as its storage to hold the packet. The benefit of
this feature bas designers handie how to store a forwarded packet[ 1] in a direct way.
2. Secondly, the routing table in conventional routers is shared by general CPU, interface card and
any other entities in the router. Click router’s routing table, however, is encapsulated in one single
element involving in the packet forwarding path[lJ. Due to this property, Click router may more
suitably be used as a distributed router in different network card, processing packets independently.
37
3. Although the forwarding flow in a Click router may have many branch as the result of routing,
Click router aiways runs in a single thread and follows the running of each elernent on a single
processor [1]. Click router loads and schedules the element’s running through the path in
configuration until a packet is sent into an explicit store such as Queue. Then, Click router will
continue to process other incoming packets. Using this feature, we are able to implernent Click on
multi-processors platform ( described in Chapter 5).
4.3.3 Click router general forwarding principle
To explain the general forwarding principle, we choose the example that Internet Protocol (TP) is
used over Ethernet. When an incoming packet is waiting at the Click router, packet source drivers
(e.g fromDevice element) read it from network and ptish it to the next elements on the path of
packet processing. Thus, the packet is forwarded through a series ofelements in configuration until
the packet reaches an explicit storage (e.g. Queue). Once a packet is stored in the queue, the packet
will stay in the queue until the packet sink (e.g. ToDevice) get ready to dump it to the network.
After ARQuerier element finds the MAC address for the packet, the downstrearn packet sink
elements (e.g. ToDevice) will pull (or move) the packet from the explicit storage and put the packet
on to the output device and transmit the packet to the next website in network[4].
4.3.4 Click configuration for IPv4 router
It is well known that Internet Protocol (IP) is a major part of TCP/IP suite and is the most widely
used in the packet processing route of an fP router. Thus, the capability of IP packet processing
plays an important role to determine the performance of a router[31. Therefore, we use IP Click
router as an example to explain the co-design methodology in our report. We choose sixteen Click
elements to construct an IP router configuration depicted in Figure 4-3.
Given the behaviors of each element, the setup of Click IP router must arrange the order of
elements on the forwarding path in IP configuration to follow W protocol standard, guaranteeing W
packets processed correctly[ 1]. The behavior for the elements in Wv4 Click router is realized based
on the work presented in[l] as follows:
38
FromDevice(cthO): Reads the packets from extemal network devices ( e.g. port ethO), working on
the Linux kemel.
Classïfer(patternl . .patternN): checks and compares incoming packets with the pre-set pattems
including ARP queries, ARP response , TP packet and others. Then, forwards the packets to the
FromDeivce(etho)I FromIDevice(ethf
Figure 4-3 Click Configuration for Ipv4 Router[1J
39
corresponding output ports respectively according to the resuit of pattem comparing.
Paint(colorX):Sets a mark on a packet to ensure that the packet is coming to and leaving from the
same interface identified by the color X.
Strip(N): Rernove N bytes before a packet header. The elernent is usually use to remove Ethernet
frame head from IP packet header.
ChecklPlleader(...): Validates IF packet header information including IP version, header length,
packet length and CHECKSUMfie1d [1]. The operations ofthis elernent focus on IF header field
mainly. If any header information of an incorning packet is invalid, die packet will be dropped or
forwarded to another element depending on the arrangement in Click 1F configuration.
LookuplPRouter(..j: Defines a forwarding table, matches a route in the table for an incorning IP
packet and direct the packet to relevant out-port. The forwarding table here is a linear routing table.
DropBroadcasts: Drop the packet with broadcast address. This function gives router function for
isolating broadcasting in network.
IPGWOptïons(Myaddress): Focuses on the Options segment (if IP header length > 5 32 bit).
Checks some parameters including route recording and 1F Timestamping in optional field [1]. If
optional field is modified, re-compute the CheckSum field. Parameters Myaddress is the address of
cunent ethO port.
FixIPScr(Myaddress): This element is for ICMPEiror packets. If the passing packet is an
ICMPError m essage w hich u ses c urrent E thernet port a ddress(Myaddress) as source a ddress, IP
Source annotation ( or flag bit) shouÏd be set in this element, and the CheckSum field will be fixed
in this element [1]. Then, the ICMPError packet will be sent back to the original source address.
DecIPTTL: This element decrements the numbers in Time-To-Live field of W packet header, and
update Checksuin field.
40
IPFragmenter(MTU): Checks the length of an incorning packet. If the packet size is larger than
MTU, then this element fragments the large packet into two pieces whose size is smaller than MTU
and re-compute Checksum in two separated fragments.
Queue(buffer): Stores the passing packets until the moment when the downstream elements pull
the packet from queue.
ARPQuerier(...): Sends an Ethernet frame following ARP protocol, gets appropriate frame
address to the destination address ofthe host packet fiom an ARP response, and encapstilates the IP
packet into an Ethernet frarne.
4.4 Partitioning methodology
As shown in Figure 4.2, StepNP platform is spiit into two parts. The left part executes W packets,
and the right is responsible for injecting and outputting the packets. In order to perforn simulation
we modified click application, we add functionality required for:
- read external packets injection on stepNP platforrn( FwidlSource(...))
- communication for the simulation experiment( In Figure 4-3, both ARPQuerier(...) and
ToDevice(...) elements were replaced to modify distant communication into local communication)
4.4.1 Partitioning — implementation considerations
When we choose the parts in Click elements to be implemented by hardware module, we consider
several factors which influence hardware module efflciency in terms of hardware algorithm
complex, resources (gates, addition...) and time constraints. Based on those thoughts, we consider
the following three points:
A) Module speedup
The hardware usually performs an algorithm faster than the software to meet time constraints. We
consider using hardware to implement the code segments with computational complexity in Click
elements. For example, the hardware solution with a shift register and some X-OR gates may
41
perform the modulo 2 operation of Checksum to meet tirne constraints while the operation may cost
more time if implemented on software solution.
B) Shared modules
By analyzing ah elernents in fPv4 configuration, we find that Checksum algorithm is shared by
rnost elements such as ChecklPHeader, IPGWoptionst, FixlpSrc, DecIPTTL, lPfragmenter.
Designers may get better overall performance by shortening the executing time of the frequentÏy
used modules [30]. Considering this factor, we choose CheckSum to be implemented by hardware
circuits.
C) Module functionality
According to the way of Click’s CPU scheduling, the following path is the bottleneck on the
forwarding route of Click IPv4 router described in Figure 4-4. Especially, as a multi-gigabit P
router which may handie multiple protocols and forward the packets from different direction link at
higher speed, the longest prefix matching of LookupiProuter is on the must-passing path and has
serious impact on lowering the overall performance of router [7, 81.
j Check IPHeader(...)
I Lok;tpIPRottte(.)
Figure 4-4. The crucial path in Click IPv4 router [121
Considering above-mentioned points, we have as much as possible code fragment implemented in
hardware circuit except the memory transfer operation along click eiement in IPv4 diagram. The
main idea for the methodology I can be described as the following diagram Figure 4-5, which
gives an outiine structure about SimplePacket internai core part.
42
4.4.2 Partitioning - simulation considerations
Our partitioning methodology is based on the following two considerations:
1) We obtain ail timing information from the StepNP simulator. The accuracy of our
methodology depends on the simulator accuracy.
We use the simulation restiits when making partitioning. for the computational complexity
2) and memory considerations, the possible parts to move onto the hardware implementation
are those that contain rnuch arithmetical computation, occupy large proportion on the
whole software elements, flot just memory access . For instance, although some parts do
occupy a large execution cycles(e.g. ICMPEnor ), we don’t move this parts into co
processor for the major operation of this part is just memory transfer (e.g.
“memcpy(....)” ). Moving this part into co-processor may flot lose the advantages of
hardware implementation that greatly speeds up complicated mathematical computation.
The Table 4-1 presents computational complexity of the different elements of the Click Pv4
router, and the complexity is based on and reflected at the rate of each element executing time to
ARM Master Internai
Structure
Cliecktum Control Circuit
AddrOxOdOOOOOQ—OxOdODOO2O
CheckSum HW
Module
Ctrip Control Circuit
Addr OxOdOl000—OxOdOlOO2O
Stnp HW Mdule
DecTPTTL HW
Module
tPFragmenter
HWModule
H 1W Module
Implementation
Figure 4-5 Hard/software Co-desïgn for Click router on SimplePacket Platform 1121
43
the total time of ail elements in the Click lPv4 router. For each element, the executing cycles of
operations are c ounted during s imulation. M eanwhile, w e u se t he w orst-case executing t irne for
Click Pv4 router to simutate packets processing. For example, the first incoming packet often costs
more time to pass the path flow of packets processing than any other packets incoming later and
passes as possible as maximum elements inside the lPv4 router. The resuit of simulation also shows
Softwaref each CS The runnïng cycles of
element) each element
cycles cycles I Total running cycles of
entire lPv4 router
Classier 356 2,44%
Paint 16 0,11%
Strip 56 44 0,38%
ChecklPHeader 688 320 4,71%
LookupiPRouter 1324 1040 9,06%
DropBroadcasts 32 0,22%
Paintlee 184 1,26%
lPGWOptions 2108 320 14,42%
FixlPSrc 464 324 3,17%
DecIPTTL 232 152 1,59%
IPFragmenter 4228 840 28,92%
EtherEncap 476 3,26%
queue 112 0,77%
ICMPError 4344 640 29,71%
Total elements 14620
cycles number
Table 4-1 the click element computational cycles from simulation
the executabie cycles of each eiement in Click Wv4 router configuration from Strip to ICMPError.
Although ICMPError spends the rnost execution cycles in the entire Click fPv4 router, ICMPEnor
is flot on the crucial path of forwarding packets in IPv4 Click router. Only sorne packets with errors
may need to be processed by this element. Moreover, its major function is copying data in memory.
Considering the above-mentioned consideration of partitioning, we put our design efforts on the
iookup router eiement (LookuplPRouter)as shown in the Table 4-1. This part may be accelerated
greatiy once it is impiemented on the coprocessor (see Table 6-6).
44
Given the above two considerations, we are able to identifythe critical fragments( presented in
section 4.4 Algorithm analysis) inside the element (highlighted in Table 4-1 ) of Click router,
obtain large speedup gain while moving Click application into hardware (see Chapter 6 Table 6-
6).
4.5 Algoritlim analysis
We choose SysternC as a hardware modeling language. The hardware components are irnplemented
according to the algorithms described in the following sections.
1) Build address connection on StepNP platform
First, we need to allocate an address in elernent class definition. We use DecIPTIL elernent as an
example to explain how to build an address connection between click and simplePacket platfonn
and ensure click to access the Dec1PTTL hardware module.
In click we have:
ffdefine ADDRDecIPTTL OxOdO30000
class CSDecIPTTL : public Element { public:
CSDecIPTTL()
-.CSDecIPTTL();
private:
uatomic32t drops;
(1) static uint* AddrCS = ADDR DecIPTTL;
(2) void simple action()
{
*(AddrCS+4)=(ujnt)jp->jpttl; II TTL address in Click IP header
*AddrCS = (uint) (ip->ipsuin); // Checksum address in Click I? header
ip->ipttl = *(AddrCS + 4); // Read f rom hardware module
ip->ipsum = *(AddrCS); // Read f rom hardware module
Statement (1)(underlined) isa pointer which we add into the old DecIPTTL class. The pointer
address is assigned according to the address allocation shown in Figure 4-5.
45
Then, in statement(2) we assign the pointer a value with start address of 1P header information,
thus click can access to HW DecIPTTL HW module in this way [12]. In particular, we observe that
the passing packets can choose more than two branches from lookuplPRouter element by routing
selection in Figure 3-3. We may consider using more than one processors to process the packets
out of different branches.
Secondly, the statement(1) in Class CSDecIPTTL must have static, because the CSDecJPTTL
elements in two branches are the objects belonging to CSDecIPTTL Class, respectiveiy. Click
creates the objects for each branch w hen configuration initialization. Ail these objects share thc
same Class. Static indicates that this pointer variable is shared bv the whole Class. Thus, the
adcÏress pointer value wont get changed while another new CSDccIPTTL object is deflnecl.
Otherwise, if Click is running, and the address pointer value is changed, simulation will produce
Click router running error and deadlock.
2) Etement Algorithm Analysis and Implementation
ChecklPileader: The detailed algorithrn is described in D.Quinn and M.Hubin’s report [12].
DecIPTTL: Click router adopt RFC 1624 to update CheckSum described as Fignre 4-6. DecIPTTL
hardware implementation uses the algorithrn directly with SystemC modeling language. The detail
is shown in Appendix -J.
The following “C’! code algorithm re-computes the checksum when
TTL f ield lias been changed
ip->ipttl--;
7/ 19.Aug.1999 - incrernentally update I? checksum as suggested by
7/ SOSP reviewers, according to RFC1141, as updated by RFC1624.
1/ newsum = -(-oldsum + —oldlialfword + newhalfword)
/7 = -(-oldsum + -oldhaltword + foldhalfword
- 0x0100))
// = -(--oldsum + -.oldhalfword + oldhalfword + -0x0100)
7/ = —(—oldsum + —0 + —0x0100)
/1 = -(-oldsum + OxFEFF)
unsigned long sum = f-ntohs(ip->ipsum) & OxFFFF) + OxFEFF;
ip->ipsum = -htons(sum + (sum » 16));
Figure 4-6 The code description ofDecIPTTL element [11
46
IPFragmenter: The Figure 4-7 describes IPFragmenter Algorithrn.
From Figure 4-7 , we can sec CheckSum algorithm is called twice respectively. Thus we can
replace CheckSum function with HW Checksum module in the foflowing way.
ipl->ipsum = o;
*(AddrCSCHK + 4) = (uint) ipl->iphl; 7/
*AddrCSCHK
= f uint) (unsigned char *) (pl->dataW;
ipl->ipsum = *AddrCSCHK;
qip->ipsum = O;
*(AddrCSCHK + 4) = (uint) qip->iphl;
qip->ipsum = *AddrCSCHK;
In addition, the codes bctween (1) cnd (2) in Figure 4-7 can be implemented hy HW IPFragmentcr
module dircctly. The Appendix -1 gives SysternC codes about detail algorithm about IPFragmenter
rnodu)e.
The “C” code algorithm IPFragmenter
// This cbeck
//ore—set iensth I? inf’. Zf so, he aiozi:hr
//searate the IP oacket into two fragments within the fixed Ïengtb.
fi) IPFragmenterbegin:
ipl->iphl = hulen » 2; //assign IP packet length
ipl->ip off = ((off - lien) » 3) + (ipoff & IP_OFFMASK);
if(ipoff & IPMF) // assign IP fragmentation offset
ipl->ip off IPMF;
if(off + pidataien < plen)
ipl->ip off 1= IPMF;
ipl->ip off = htons fipl->ip off)
±pl->iplen = htons(plien);
(2) IPFragmenter end:
ipl->ipsum = o;
ipl->ipsum = click in cksum(pl->dataf) , hulen) ; //re-calculate
7/
the checksum of fragmented packet
qip->ipsum = O;
qip->ipsum = click incksum(reinterpret cast<unsigned char
*>(qjp)
, bien);
Figure 4-7 The code description of IPFragmenter flement [1]
47
Strip(N): This element is usually used to remove Ethernet frame header before IP header when
N=14 -> Strip(14). The algorithm is described in Figure 4-8.
The following “C” code algorithm re-locate IP header’s
starting position by plus 14 bytes
// the following algor±thm cut the bytes hv the argument, and cari be
// used to cut an Ethernet frame head into an IP packet head
static uint* AddrCS Strip = (uint*) ADDRSTRIP;
static uint* AddrTimer Strip = fuint*) OxOdd0004O;
Packet: :pull (uint32 t nbytes)
if (nbytes > lengthQ)
nbytes = lengtli()
(1) _data += nbytes;
Figure 4-8 The code description of Strip element tu
The algorithm in Figure 4-8 is implemented by hardware Strip module directly. Statement (1) can
be replaced by the following code fragment.
*fAddrcSStrip + 4) = fuint) nbytes;
*AddrcSstrip = (uint) (unsigned char *)data;
(*AddrTjmerStrjp )=1; 1/ waiting for calculating resuit from HW module
data = (unsigned char *) (*AddrcSStrip)
It should be aware that the algorithm in Figure 4-8 is running in the Packet Class to protect the
passing packets. Due to the encapsulation of C++ objects, we must modify the codes in Packet.cc
file directly rather than that in Strip element Class.
LookuplPRouter: Click router uses Linear searching algorithm described in Figure 4-9.
LinearlPLookup uses a linear search algorithm that may look up every route on each packet. It is
therefore rnost suitabe for small routing tables in edged routers.
The linear algorithrn was applied in our hardware module design directly. We made some
modifications in our hardware module. The algorithm in Figure 4-9 is separated into three parts,
4$
searching, appending and deleting on the shared routing table. Software click can allocate some
spaces for routing table in the memory. In our hardware design, we put the routing table in a
hardware component. That means that reading!writing the routing table is concentrated on one chip.
We must pay attention on the signal for the synchronization among the processes (e.g. add(..),
lookup(.. ) and del(..)) , when we do the simulation in SystemC. Otherwise, the processor wont
find the correct route from the hardware chip. The following Figure 4-9 illustrates our hardware
design methodology about longest matching prefix in routing table.
The following “C’, code algorithm describe searching, appending and deleting
routing entry in linear routing table
bool
IPTable::lookuptlPAddress dst, IPAddress 0gw, lot &index) const
int best = -1;
1/ lcriqes prefix tratci
// check in the rotino table by coopacing destination addeess
for tint j = O; i < v.size() ; i++)
if tdst.matchesprefix(v[iJ .dat, _v[j] .mask))
if tbest < O v[i] .mask.mask more specific( v[best] .mask))
beat = i;
if (test < O) // if net found, return “faise”
return false;
else
gw = v[best] .9w;
index v[best] .index;
return true;
void IPTable::add(IpAddress dat, IPAddress mask, IeAddress gw, int index)
// add an IP destination address and related address
// info the routing tatie, usuallv for .ioisiaiizing
1/ e routing table 0f C router
dat 0= mask;
struct Entry e;
e.dst = dat;
e.mesk = mesk;
e.gw 9W;
e.index = index;
for tint j = O; i < v.sizet) ; i++)
if t!_v[iJ .validt))
v{iJ = e;
return;
v.pushbackte);
void IPTable::deltlPAddress dat, IPAddress mask)
// delete en I? destination address end related
// address oct cf the routing table
for tint I = O; I < v.sizet) ; j++)
if (v{i] .dst == dst 00 v[i] .mask == mask)
v[i] .dst = IPAddresstl) ; v[i] .mask = IPAddresstO)
Figure 4-9 The code description of lookuplPRouter Element [1]
49
The detailed steps about searching routing table algorithm are:
1. The IP address entry is appended into the routing table(or forwarding table) on the
synthesized component (shown in Figure 4-10). when Click initializes configuration with
a loop. Then, FPGA module outputs the actual capacity of the routing table to Ïookup
router table control Circuit (shown in Figure 4-10). Meanwhile, FPGA module gives a
signal indicating the fPGA is ready for searching.
2. When a packet is passing the lookupiPRouter element, ARM CPU sends a startlookup
signal to control circuit(CC). When the CC receives the signal, it starts the signal to
communicate with the synthesized component and search the network number in the
routing table on the synthesized compnent in the loop circuit, until finding the next
hop.
3. The loop times is determined by the IP network address’s location in the synthesized
component(shown in Figure 4-10) routing table, and capacity of routing table. The
component(shown in Figure 4-10) output the next hop number(port) to register in ARM
kernel.
4. When C lick e xecutes r ead i nstruction i n I ookuplPRouter element, t he r esuÏt i s r etum t o
Click.
ARM ACCESS
ROUTING TABLE on
FPGA
Figure 4-10 The methodology about longest matching prefix in routing table 1121
50
Based on the consideration in section 4.4.1 (B), we can have the left elernents
- fixlPSrcO,
IPGWOptions() and fPFragmenter() - share the same CheckSurn hardware module. Thus, the
whole IPV4 router configuration may perform the packet processing with the combination of
hardware and software components in co-design.
51
Chapter 5
The Methodology to Implement Clïck Router
on Multiprocessor
5.1 Genera] presentation of the methodo]ogy
This section presents an overali picture ofthe methodology about how to put Click router onto the
multiprocessor and the advantage ofthis methodology.
The methodology involves in three parts: “Click.exe” source code, StepNP components and Click
elements
Ï) By modifying on “Click.exe” , we ensure that multiple processors may perform the different
task ofprocessing packets conculTently. Section 5.2 will explain this issue in detail.
2) By improving the StepNP existing processor components, we implernented that three ARM
processors can access(read/write) the shared lock for transferring packets data between the three
processors. Section 5.3.1 will introduced how to modify the processor components in detail.
3) By developing a few new Click elements, we build the connections between different
configuration files executed on different processors. Section 5.3.2 will present such implementation
in detail.
5.1.1 The architecture description
Click modularity allows to process packets pipe!inbtg over several configurations that are executed
on multiple processors, implementing configuration-level pipeline (defined in Chapter 3.3)
[28].Thus. we may take advantage of multiple processors to process packets in parallel without
modifying Click source code. The methodology also may be flexible to spiit or merge passing flow
by the number ofprocessors and use the processors resources efficiently.
We use Click configuration for IPv4 Router as an example to present how the system-level design
methodology works on multiprocessor architecture. Figure 5-1 is the overall picture of an
52
executable application-specific multiprocessor architecture including hardware components and
software applications
figure 5-1 the general description on how Click works on three ARM processors
(a)ClickSoftw areroute rlPv4befores plitting
FromDeivce(ethO) FromIDevice(eth1)
Glassitier(...) Classitier(...)
ARP ARP IP ARP ARP IP
queries resppnses jp queries responses (P
(II
___
ARPRespondor
_________
__________
\ ARPResponder
°°••) / Paint(1) I Paint(1) “\ (2.0.0.1...)I -
I Stripfl4) I toARPQuerier
toARPQuerier
Check IPHeade r(.j)
LookuplPRouter(...)
tdst1 I Drop&oadcasj
Check Paint(1) I CheckPaint(2)
__L_t
__ _
I IPGOptionsi) I lPGOptionst2.O.O.))
ICMPError ICMPError
bad param bad param
FixlPSrc(1 .0.0.1) I FixIPSrc(2.0.0.1)
I
DecIPTTL f DecIPTt[J
ICMPErro ‘N__ ]ICMPError
Ttlexpirod Ttlexpired
I lPFragmen50 lPFraomenter(1504
______
I ARPQuerierf1.0.0.1,. LPQuerier(2.O.O.1,.) I
I toDevice(ethO)I toDevice(ethl)l
An Click route rIPv4
configurationiss plitinto
threelPv4configuration
lnmym ethodology,ClicksoftwarerouterrunningonthreeARJvl7processors
tes pectivelyw ithdiffe te ntlPv4config urationparam ete rs depictedasfollows:
(b)Clicksoftw are route rIPv4-1 —IPv4-3configurations
I PoIIDeivce(ethO I PollDevice(ethl)I
_______
y
CIassitier(...),’’- fIassifier(...)
ARP :jp RÇp IP
quenesresponses I P queriesresponesI P
- -....“
Ii I II —
_______
II
________
• •m
I F S
L
_
y
__ _ __________
‘ ReadaI DropBroadcasts I r DropBroadcastsf \ packet
I ICrPEr lIcrPEnI fromlock
I CheckPaint(1) I ror f CheckPaint(2)I Irecec Lj
,,
__fredirec tor
*
tor
I IPGOptions(1.O.0.Ï) [IPGOptions(2.O.O.]crEr I
____________
ICtvPEr t: rot bad f—h
I FixIPSrc(1.O.1.lr rorbad1 param Processo.._.
*
____
-
‘
IicrEr f p2(ARM7)’C
IICP r I ror I ï
DecIPUL f L DecIPTTL iuexpir I /1expi edr
_ _____
Mem
menter(1 504)
rEt
II
/
I RPdI9rier(1.O.O I ARPQuier(2.O.O.1Jj / ExtrnaI
Queue, Softward out_ Ot
Store ‘ Iick.exelPv4-3I toDevice(ethO) toDevice(j “ —mappedon ,SoftwareOck.exeIPv4-2processorp3 mappedonprocessorp2
53
(c)Hardw are arch ite ctu rew ith
s h are d Iock s tr u ctu te o n s te pNP
ARpierier ARPQuerier [(2j ‘Software click.exe
4 Strip(14) Stripfl4)
ChecklPHeader(... ChecklPHeader(...
GetIPIdress(16 GetlPMdress(16)
/
\ LookuplPRouter(...)
r .i
IPv4-1 mapped on
processorpi
SoCarchitectureon
stepNPplatfo rm
Extern al
LpqaJ
SpkCJ
jSourceCS(addr
Locki -3shared
inDDR
0x22000000
t1 Pac fl
n
Writea Processo
packet pl(ARMZ)
ntoloqk
__ __
‘o
II I
_w
z
D.
o
Processo
p3(ARM 7)
Mem
III
Packets
OLI
54
In Figure 5-1, we set configuration file ipv4-1 on processor pi, configuration file ipv4-2 on
processor p2, configuration file ipv4-3 on processor p3. Thus, each configuration can process
packets on pi, p2, and p3 concurrently. By setting two shared memories arnong three processors,
we have created the dedicated hardware components to support transferring packets. In addition,
we have designed an element to write incoming packets to the shared memories and two elements
to read the packets out of the shared rnemorics. The internai operations of three processors have
been rnodified to perforrn the task concurrently and to implement pipelining on multiple processors.
The method may be considered to implement configuration in pipeline[2$].The system-level design
ofthe methodology is simulated and cvaluated on StepNP platform.
5.1.2 The benefït of the methodology
The mode! ofthe methodology has two advantages on implementing Click on multiprocessors.
1) the mode! reduce the latency for incorning packets to wait for process. Given the Click
processing mechanism t see the feature 3 in Chapter 4.3.2), only one packet is being processing in
Click router each tirne on one processor rnechanism. This methodology may process three (or
more) packets simultaneously in three processors, shortening the waiting time for incorning
packets. Considering the availability of hardware resources, we may take the methodology for a
feasible way to improve the performance of Click router.
2) The architecture model is scaiable meaning that tisers may easily increase or reduce the number
of processors to packet processing path, implementing multiple Click configurations in parallei.
From the point of LookupWrouter operation, users may easily add or modify Click configuration
files, changing the number of the path to process packets according to the number of Ethernet card
in the router system.
To ensure three processes (e.g. Ciick.exe) work, we must consider two questions:
A. How to map the Click .exe program on three ARIVI processors, and ensure the three processors
execute the same Ciick.exe program simultaneously.
B. How to communicate between the three ARIVI processors.
55
5.2 Mapping the program on three processors
First, according generic multiprocessors SoC design, the application-specific is dctermined as Click
router software narned “click.exe” binary code for ARM processors.
The Click elernents performing subtasks of packet processing make up a packet path in Click
configuration router. It is “click.exe” program that loads each elernent and executing the element
code along the packet path, processing incoming packets.
Then, like the process partitioning ( defined in Chapter 3 ), the IPv4 configuration is divided into
three sub configurations since Click elernents of configurations may be spiit and re-assembled into
various configurations according to different packets operations. In Figure 5-1, sub configurations
are tpv4-], ipv4-2, tpv4-3.
The configuration file is the pararneter of”click.exe” according to Click router [1]. Since original
“click.exe” program does flot Support more than oneconhguraiion Files running on different
processors, wc have le rnodify “Click.exe” program so that the sarne “click.exe” may run on the
multiple processors and load different configuration files for different processors. Thus, three sub
configurations (ipv4-Ï, ipv4-2, ipv4-3) perfonn different IP packets processing tasks according to
the elernents collection in the three configuration files. The modified code segment in Click router
is described in Figure 5-2.
In addition, we have to do sonie modification on StepNP components to support the building of
three processors architecture. With the guide of application-specific multiprocessor SoC design
flow, we choose major following components from StepNP library to construct the architecture
with three processors. Three ARM7 processor components (pi, p2, p3) is rewritten accortiing to
user’s operation requirernents to processors in StepNP platforrn. The three improved ARM7
processors are modified as Figure 5-3-1.
56
The Modified “Click.exe” program
1) Click.exe modification
±nt maintint a, char **av)
#endif
/ / Ask to the coprocessor
uint* AddrCS;
AddrCS = (uint*) OxOd000000;
unsigned result= (unsigned) (*AddrCS);
printf(”result== %d\n”,result)
int argc = 2;
char *argv[] =
‘click’,
“armrouter.click”,
// else
/7 change the configuration file
if (recuit == 1)
argv[OJ = “click”;
argv[1J = “sidlPush-l.click”; // configuration file name -1
argv[2J
=
if (recuit ==2)
argv[O] = “click”;
argv[l] = “sidlPush-2.ciick”; // configuration file name -2
argv[21
= ““;
if (recuit ==3)
argv[O] = “click”;
argv[l] = “sidiPush-3.click”; 7/ configuration file narrie -3
argv[2]
= ““;
printf(”starting click main...\n”);
figure 5-2 Click router code modification for multiprocessor
57
The pl,p2 and p3 ARM Processors modification
// the following code tells ARM processors to run different Click
//routers in siiîiulation on Step.NP pÏatforrn
class MyArml public MTARM {
AntiData memflead(ArmAddr address)
if(address >= OxOd000000 && address < OxOd00002O)
printf(”Demande la reponse du AEM processor-l \n”)
return (uint)l;
}
}
class MyArm2 : public MTARM {
AntiData memRead(ArmAddr address){
±f(address >= OxOd000000 && address < OxOd00002O)
printf(”Demande la reponse du AEY processor-2 \n”)
return (uint)2;
}
class MyArm3 public MTARM
AntiData meinReadfArmAddr address){
if(address >= OxOd000000 && address < OxOd00002O)
printf(”Demande la reponse du AEM processor-3 \n”)
return (uint)3;
Figure 5-3-1 ARM Processor modification in StepNP
A single bus with SOCP protocol connects three processors. The instance of architecture is
depicted in Figure 5-3-2.
After rnodifying “Click.exe” and ARM processor components, we succeeded in mapping ipv4-1 on
pi, ipv4-2 on p2, ipv4-3 on p3 (shown in Figure 5-3-2). Each configuration can process packets
on p], p2, andp3 concurrently.
click.exe Ioads
lPv4-1 IPv4-3
N. configuration
58
5.3 Exchanging data between three processors
After a processor completes processing incorning packets processing, it will transfer the packets to
one of its two down flows (depicted Figure 5-1) along the passing path depending on the routing
resuit. How to synchronize the communication between two processors raises concem in Click
users. for this question, we can find a solution from hardware and sofhvare aspects respectively.
5.3.1 Hardware architecture building
The hardware solution is to set two individual shared structures called Jocks to buffer the packet
between two processors. The capacity of a lock is big enough to hold a packet. Each lock is set up a
single address space that programmers should 5e offered for designing the architecture. In my
experiments, we set the two locks’ address with OxOe000000 and OxOe000002O respectively,
depending on the available address spaces to users on StepNP platform.
/ /
Figure 5-3-2 the instance of Click on three ARM7 Processors
architecture
for example,
Address(lock- 1 -2) OxOe000000; Address(lock- I -3)=0x22000000;
Figure 5-4 describes the detail structure.
Figure5-4 Hardware architecture with shared lock
structure on stepNP
IPv4-1 configuration
runningon
processorpl_
plwrite a
packetinto
Iock
p2o p3readsa
packetfrom Iock
59
In addition, we must modify the internai write/read operations of processor components. Thus,
processorspl, p2 may access the iock structure to perform data exchange. Becausc existin StcpNP
does not contain such components to support the hehaviour oCproccssorp], p2 to finish the data
transfer described in figure 5-4, we did sorne modification on StepNP platronu code as follows,
The processor plis modified as in Figure 5-5-l;
The processor p2 is modified as in Figure 5-5-2;
The processor p3 is similar to p2 (refer to Figure 5-5-2);
Lockl-3
sharedin
DR
0x22000000
j
IPv4-2configu ration IPv4-3config uration
running on running on
processorp2 processorp3
60
Afier rewriting component, we can see that processor pi will check whether the shared hardware
structure is lock by r ead() function. If flot, write() function in processor p 1 will Ïock the shared
structure and write the packet into the hardware structure.
The relevant 11C” code algorithm Component Processor 1
// the following shows how to access(R/W) the shared lock structure
// on the multiprocessor architecture in the simulation on StepNP
(1) Key segment beginbegin:
ArinData meinRead(ArmAddr address){
if(address >= OxOdOl0000 && address < OxOdOlOO2O) {
printft”Read ARM—i SPDFlag --1
return (uint) spd ARM1 ARM2 inReady;
if(address >= OxOe000000 && address < OxOe00002O)
printf(”Read ARM-1 SPI) dataïn and lenln --1 \n”);
if (address == flxOe000000)
return (uint)spdARM1ARM2.dataln[0J
else
return tuint) spd ARM1 ARM2 lenln;
void memWr±te(ArmAddr address, ArinData data)
if(address >= OxOe000000 && address < OxOe00002O)
printf(”writing SPD Dataln and lenln \n”);
if (address == OxOe000000) {
spdARM1ARM2.dataln[hLenl = data;
else {
if faddress == OxOe000004) spdARM1ARM2.lenln = data;
if (address == OxOe000008) hLen = data;
} }
else if faddress >= OxOdOl0000 && address < OxOdOlOO2O) {
princf(”iocking/uniocking SPI) shared structure \n”);
spdARM1ARM2.inReady = data;
}
(2) The key segment end:
Figure 5-5-1 The internai operations of an ARM processor
61
The relevant “C” code algor±thm Component Processor 2
/7 the following shows how to access(R/W) the shared lock structure
/7 on the multiprocessor architecture in the simulation on StepNP
(2) Key segitent beginbegin:
ArmData memRead(ArmAddr address)
if(address >= OxOe000000 && address < OxOe00002O)
if (address == OxOe000000) {
len =lenln-hLen;
hhen-
-;
/7 printf(”Read ARM-2 SPD dataln --2\n “);
return (uint)spdIRMlARM2.dataIn[lenJ;
else {
//printf(’Read ARM-2 SPD lenln --2 \n);
liLen = (uint)spdARM1ARI12.lenIn;
lenln = hLen;
return fuint) spd ARM1 ARM2 lenln;
else if(address == OxOdOl0000) { /7 && address < OxOdOlOO2O)
7/ printf(”Demande la reponse du SPD Flag lockl-2 \n”);
return (uint) spd APN1 ARM2 inReady;
void memWrite(ArmAddr address, ArmData data)
/7 printf(Tlwrote flag to address\nhT);
if(address == OxOdOl0000) && (address < OxOdOlOO2O)
7/ printft”writing SPD inReady Flag lockl-2 \nJ);
spdARM1ARM2.inReady = data;
(2) The key segment end:
Figure 5-5-2 The internai operatïons of an ARM processor
62
5.3.2 Software t Element)
To prevent the write/read conflict to the lock, accessing the lock must respect the locking policy.
Since ail storages are explicit in Click, those two locks are actually ttnderlying hardware and Click
applications doesn’t providc elements to access those two lock structures explicitly. Therefore, we
must write two new specific elements to push packets into the lock and to pull packets ftom the
lock, supporting locking policy.
Two new elements are the following:
SpdSinkCS: contain the implicit behaviour to write a packet into shared lock structure.
SpdSourceCS: contain the implicit behaviour to read a packet from shared lock structure.
SpdSinkCS:
SpdSinkCS(...)is a push element(”push” is explained in Chapter 4.2.1).
The key segment ofSpdSinkCS for writing is described in Figure 5-6:
(1) AddrCSChipl = (uint*)OxOe000000
(2) AddrCSFlag = (uint*) OxOdOl0000
In une (1), address point AddrCSChipl holds the address for lock-l-2;
In une (2), address point AddrFlag holds a value to indicate whether ]ock-1-2 is occupied or
released;
SpdSinkCS lock or unlock the shared memory by setting the flag with ‘1’ or 0’.
If lock-1-2 is being occupied, processor pi won’t write a packet into to the shared structure.
SpdSourceCS:
SpdSourceCS(...)is pull element(”pull” is explained in chapter 4.2.1).
The key segment ofSpdSourceCS for writing is descnbed in Figure 5-7:
(I) AddrCSChipl = (uint*)OxOe000000
(2) AddrCSFlag = (uint*) OxOdO 10000
63
The relevant “C” code algorithm SpdSinkCS
(1) Key segment beg±nbegin:
AddrCSFlag = (uint*) 0x0d010000
AddrCSChipl = (uint*)0x0e000000
*AddrCsFlag O;
void SpdSinkCS::pushfint, Packet * p)
if (p != 0) {
unsigned flagpush = (unsigned) f *AddrCSFlag);
//check whether shared lock structure is available. if not, wait.
whule (flag_push == 1)
printf(”AddrCSFlag-l: %d\n’, flag pusli)
f lag_push = (uns igned) f *AddrCSFlag);
// f111 the packet into the shared ioc.k s tcture
if ( flagpush == 0)
int len = fint)p->length()
* tAddrCSChipl+1) = len;
unsigned char* datalnpush = (unsigned char*)p>dataf);
for (int j =0; i<ien;i++)
*fAddrCSChipl+2) =
*Addrcschipl *dataln_push;
datalnpush ++;
*AddrCSFlag = 1; // set up inReady =1 teliing shared locked
// structure is not available now
p->kill f)
(2) The key segment end:
Figure 5-6 the segment code to write packets into the shared hardware
on StepNP platform
64
In une (1), address point AddrCSChipl holds the address for lock-1-2;
In une (2), address point AddrFlag holds a value to indicate whether Iock-l-2 is occupied or
released;
SpdSourceCS lock or unlock the shared memeory by setting the flag with ‘1 or ‘O’.
Processor p2 won’t take reading operation to shared structure until lock-l-2 is released.
The relevant “C” code algorithm SpdsourceCS
(1) Key segment beg±nbegin:
AddrCSFlag (uint*) OxOdOl0000
AddrCSChipl = (uint*)oxoe000000
*AddrCSFlag O;
Packet * SpdSourceCS;:pull(int)
struct SPO &spd - *(SpD *) addr;
spd.±nReady = (int) (*AddrcsFlag);
//check whether shared lock structure ±s ava±tabIe. if not, waic.
wh±le tspd.inReady i){
printf(”AddrCSFlag-spdsource: %d\n”, spd.inneady);
spd.±nReady =(±nt) f*AddrCSFlag)
// read the packet eut cf the shared lock structure
if (spd.inReady ==l ){
spd.lenln = (±nt)*tAddrCSchip2+l)
±f(spd.lenln < O spd.lenln >= SPDMTU) return O;
for (int i=O; i<spd.lenln;i÷+){
spd.dataln[il=(unsigned char) *AddrCSChip2;
Wr±tablePacket *p = Packet::make(spd.dataln, spd.lenln);
memcpy(p->data() , spd.dataln, spd.lenln);
*AddrCSFlag = 0; /7 release lock structure
return p;
(2) The key segment end:
Figure 5-7 the segment code to read packets from the shared hardware on
StepNP platform
After completing the design of software and sorne hardware components, the original JPv4 router
may be changed into new one that described in figure 5-8.
65
OriginaNPv4configuration NewlPv4conflguratïon
New addedClick
elements
SpdSmkCS(addr2)
____L
SpdSourceCS(addr2)
Figure 5-8 IPv4 original file and IPv4 configuration drawing with two new elements
66
5.4 Overail description of comrnti nication between processors
In general, Figure 5-9 introduces how to exchange data ( e.g. packets) between two processors(p]
top2, orp] top3).
(a) Three different IPv4-1—IPv4-3
configurations
IPv4-3 configuration
runningonprocessor
p3
Figure 5-9 The coitanunication between three processors
with the help cf software & hardware
(b)Hardware architecturewith
shared Iock structure on stepNP
IGetIPAddressflÇ)
\ IPv4-lconfiguration SoC
‘s running on architectureon
‘processorp1 stepNP
SpdSinkCS(addrl) SpdSinkCS(addr2
Wrftea
packet
into Iock
SpdSourceCS(addrl SpdSourcec$(addr2
DropBroadcasts
Reada -.
packet
frorrdockDropBroadcast
_____________
CheckPaint(1) CheckPaint(2)
IPv4-2config u ration
running on processorp2
67
In Figure 5-9 after the routing by element “lookuprouter”, the packet chooses a destination address
b go. Thus, a packet is transferred to element SpdSinkCS elernent as an incoming packet. There
arc two choices at this moment:
1) if the packet chooses “addri” path, element SpdSinkCS(addrl) running on processor pi must
check whether the hardware iock-l-2 is occupied hy reading operation. processor pi must check
and wait repeatediy until lock-l-2 is reieased. Once unlocking, processor pi writes the packet into
the lock and continue b process oîher incoming packets by executing the elernent such as
“Classfier( in ipvÏ flow. At the moment. if processor p2 is perfonning the SpdSource elernent
code and the iock bas the packet, p2 wili rcad the packet out of loch-i-2 and continue to transfer the
packet into elernent “DropBroadcasets’. Thus, processor pi and p2 may perfbrrn packets
processing in pipeline. executing ditïerent conflguration ip;l. ipv2.
2) If the packet chooses addr2’ path, element SpdSinkCS(addr2) running on processor pi must
check whether the hardware lock—1 -3 is occupied by reading operation. processor pi must check
and wait repeatediy until loch— l—3 is rcicased. Once unlocking, processor pi wnles the packet mto
the lock and continue 10 pmcess other incomlng packets by execuling the element such as
“Classfier(...)” in ipv I flow. Ai the moment, if processor p3 is perfonning ihe SpdSource element
code and the lock bas the packet. p3 wilI rcad the packet outoflock-l-3 and continue to transferthe
packet mb eicment “DropBroadcasets”. Thus, processor pi and p3 may perlhrm packcÏs
processing in pipeline. executing diFferent configuration ipvi. ipv3.
By this way, processor pi, p2 and p3 may exeeute their own task respectively in paraliel, and
transfer the packet in pipeline.
In addition to the above-mentioned supports, StepNP platforrn can execute software code on cycle
accurate instruction-set-simulator(IS$) [271 based on standard RSTC processors and use SysteniC
Open Frotocol(SOCP) as communication [6] between components. Ail the code is written in
SystemC ianguage (defined in Chapter 4.1). Thus, after designing the software for multiprocessor
architecture and building a cycÏe-accurate executable architecture [27], we can start system-level
simulation t o y erify w hether t lie p rocessors on architecture c an b e running in p arailel e orrectly.
Whiie hardware parts ( e.g. ARIVI processor, memory, channel, etc.) ofthe architecture can be
68
modeled in SystemC, Click router software parts may be executed in StepNP (ISS) core to support
SoC validation on system-level t27] and to get the evaluation resuit of performance for the overali
multiple processors architecture.
_________________
69
Chapter 6
Performance Evaluation
With the support of StepNP simulation platform, we get the evaluation resuit to estimate the cost of
the design quality such as execution time by cycle, to validate whether rny rnethodology is possible,
and to calculate the overali Click IPv4 acceleration with the help of co-processors. shortening the
design process in HW/SW co-design.
Performance evaluations rnainly consist oftwo sections:
6.1 hardware/soflware partitioning on one processor with ASIC hardware (described in Chapter 4);
6.2 the evaluation of IPv4 router in application-specific SoC three processors SoC architecture
(described in Chapter 5).
6.1 Evajuation ofthe hardware/software co-design methodology in Chapter 4
This section introduces the result ofhardware/software partitioning rnethodology implernented on
Click IPv4 router and the evaluation way on how to get such resuit.
6.1.1 Synthesis tools
This section introduces the tools for simulating and synthesizing the HW/$W design.
1) Software simulation: we choose STMicroelectronics StepNP platform for developing and testing
our design.
2)Hardware Module: we choose Synopsys CoCentric SystemC Compiler to synthesize a behavior
design.
6.1.2 InitiaI constraints for synthesis
bc_ena.ble_analysis_info = “true”
bcmuiticycle = “faise”
cycle period = “25ns”
10 module = “superstate”
effort module = “low”
70
6.1.3 Synthesis resuit
CoCentric SystemC Compiler presents the reports about resources, time and FSM status to four
hardware module. (CheckSum module see D.Quhm and M.Hubin ‘s report [12]).
bc_shell> create_clock clk
—p 25
Cycles period is 25 ns.
bc_shelt> reprt schedule —abstrac_fsm > csdecttl.Ïsm
We used the above command to get FSM report.(see Appendix A-J).
The statistics resuits are described in Table 6-1. The detail information is shown in the Appendix
A-J.
Hardware Estimated Area ResourceStatus
Module Cycles (FSM) Combination Sequential Tota le ADD/SUB/CM Operation(Ibits)
ccstrip 1 3 18$ 733 921 1/0/0 32
csipfragp 2
— 4 2427 1614 4042 1/1/1 32
csdecipttl 1 3 247 565 $12 3 8-16
csiptable 1 6 6396 55882 62278 1 32
Table 6-1 Result from CoCentric SystemC Compiler 1121
6.1.4 Performance analysis by resuit
We use the evaluation methodology which is described in D.Quinn and M.Hubin ‘s report [12].
We set breaking points in the elements and calculated the Cycles number aided by StepNP
platform. Thus, we obtained a group of figures to illustrate the rulming cycles comparison between
software and hardware as follows (except CheckSum element explained in D.Quiim and M.Hubin’s
report [12]) from the simulation.
Then, we supposed Click software running on ARM CPU 150MHZ.
Therefore, the period ofa cycle for HW and SW is different.
SofPvare period/cycte = 6.6 ns
71
Harctwareperiocl/cycte = 25 ns
Meanwhile, we suppose that an instruction executes in one cycle. Thus, we get the following resuit
about software/hardware for different HW module.
6.1.5 Software/hardware analysis
Hardware 2 6 4
• Software 21 44 44
DecIPTTL
Module
Total( ïncluding Actual in
cali fonctions) Element
Table 6-3 Cycle Number in DecIPTTL module execution 1121
Table 6-4 Cycle Number in IPFragmenter module exedutionjl2J
In T able 6-2- Table 6-4, t he h ardware c ycles for D ecIPTTL, IPfragrnenter a nd S trip g et from
CoCentric SystemC Complier.
From above figure we must take notice two points:
1. The period for HW cycle is more than that for Software. Now We must consider how to make
HW module and SW code to nin in parallel. To avoid software code’s waiting for reading resuit
from hardware, we can bring writing data into hardware forward and keep write/read HW some
code length, then let CPU execute other necessary codes (CPU must execute them) in the same
element. After a CPU finishes executing other codes, it begins to read the resuits from hardware
module. In this way, we can avoid latency for waiting the resuits of hardware module.
Strip Module Total( including
call fonctions)
Actual in
Element
Table 6-2 Cycle Number in Strip module execution 1121
Hardware 2 5 3
• Software 133 152 152
IPFragmenter Total( including Actual in
Module cali fonctions) Element
Hardware 3 13 10
Software 93 200 200
72
2. When we evaluate the hardware perfonnance, we must consider the cycles for writing instruction
to HW, we suppose one cycle for this step. Figure 6-1 illustrates the two above-mentioned points:
/*
*(AddrCS + 1) = hulen;
*(AddrCS + 2) = off;
*(AddrCS + 3) = lien;
*(AddrCS + 4) = ipoff;
*(AddrCS + 5) = plen;
*(AddrCS + 6) = pidatalen;
*(AddrCS) = pilen;
/7 Between this code segment, other
7/ Click router instructions are executed, implementing
• 7/ running with Hardware component(or ASICs) in parallel
ipl->ipsum = O;
ipl->±p lii = *(AddrCS + 1);
ipl->ip off = *(AddrCS + 2);
ipi->iplen = *(AddrCS + 3);
*7
figure 6-1 The code segment in IPFragment element
As for the routing table, one cycle is for searching, the capacity of routing table array is defined to
20 in our testing.
We suppose the worst case is to parse the whole data array and the first packet aiways cost the most
time to look up the output port in a routing table. Thus we get the hardware cycle for searching
rotiting table is 25 cycles.
Routing table Total( including Actual in
Module cail fonctions) Element
Hardware 25 30 5
Software 926 1040 1040
Table 6-5 Cycle Number in Lookup Routing Table module
execution[12]
6.1.6 Click IPv4 router performance in co-design
Wc use Amdhal formula [12] to calculate our acceleration by HW/SW co-design. This fommla is
applied to measure to extent which an co-design rnethodology improve the software.
73
1Gatn =
(1_p)+P/
/ ace
p : Instructions implernented by HW / Instructions in Total Elements (1%)
ace : Instructions before using HW / Jiistructions after using HW (100)
Gain: Evaluation parameters for Performance Improved
Normally, the more improved proportion in software, the more the Gain. The bigger the Gain,
better the performance for the software.
For example, from Tab]e 6-5 (LookupiPRouter), we can get:
Acc = 1040/5 = 208
Gain = 4.58
The Table 6-6 table is the speedup statistic and overali speedup for the whole IPv4 Router.
Elements in IPV4 Router Gain
Classier 1.00
Paint 1.00
Strïp 3.50
CheckiPHeader 1.85
LookupiPRouter 4.58
DropBroadcasts 1.00
Paintlee 1.00
IPGWOptïons 1.18
FixIPSrc 3.24
DecIPTTL 2.76
IPFragmenter 1.05
EtherEncap 1.00
queue i.oo
ICMPError 1.17
Average I .82—I .87
Table 6-6 Improved status for each element in IPv4 Click router
74
Therefore, we can see that the performance for overali Click IPv4 router has been improved 1.82-
1.87 times as much as the old one after using the methodology (mentioned in Chapter 3 ). The
actual performance is more than that shown by data normally. Because Click router can be applicd
as the NPU in each network card. This may be caÏled Distributed Router Architecture[3]. An
normal packet is usually passing tirne-critical which includes Strip, CheckiPHeader,
LookuplPRouter, DecIPTT1,ARPQuerier and Queue elements. In this path, routing table searching
is an time-consuming packet forwarding process which is a bottleneck to seriously lower the
performance of an router. By co-design methodology, we can sec searching routing table(RB) get
the rnost improvement from the Gctin for LookuplPRouter element in Table 6-6.
Non-time critical path includes IPGWoptions, FixlSrc and PFragmenter elernents [3] which cost
much instruction cycles to transfer data in memory. If a packet doesn’t pass these elements, the
Gain wilI increase. Thus the performance of IPv4 Click router shows rnuch higlier in most time.
The rnethodology has disadvantage, however, as the lookuplPRouter elements, the hardware
consuming time for searching in routing table is too long(25cycles). The cycles between write and
read hardware module is about 10 hardware cycles(25ns/cycles). Table 6-6 shows that hardware
cost 25 hardware cycles for searching in routing table by our co-design methodology. Therefore,
the methodology needs to be improved by using better data structure and search atgorithm. In
addition, applying multiple NPUs is an possible way to improve the performance for Click router.
6.2 Evaluatïon of the methodology explained in Chapter 5
This section presents the resuit ofthe methodology on how to implement oriented object program
on multiple processors using Click Wv4 router as an example. Meanwhile, the section give the
method how to evaluate this methodology to obtain the result.
6.2.1 Experïmental tools and approach
Experimental tools consist of StepNP platform that offers a simulation environment and is
developed by STMicroelectronics. The purpose of the experimental approach is to estimate how
much the time to process packets may be shorten by comparing Click executed on three ARM
processors with Click ninning on a single ARM processors and to verify whether the methodology
75
( described in Chapter 5) may improve the Click router overall performance on three ARIVI
processors. We choose “cycles per packet” as execution time for a packet, meaning that how many
cycles should be required to process a packet in a IPv4 Click router configuration..
Based on StepNP tool, we designed two experimental architectures as follows:
(1) An experimental architecture with three rnulti-threaded ARM processors.
(2) An experimental architecture with a single rnulti-threaded ARM processor.
In the architecture (1), three processors respectively execute the same Click program with three
different router configuration parameters. As a whole, such router architecture undertakes the
equivalent task of processing incorning IP packets as architecture (2) does (the reason is explained
in Chapter 5). Then, we set up a check point in “Print” and “Strip” elernents (C++ objects) to count
the executable cycle number in StepNP simulation. In the following step, we use “InfiniteSource”
offered by Click software router to generate two streams of packets to pass the two above
mentioned router architectures. To make experimental gathered realistic, the two streams ofpackets
have different route while passing the two router architectures. By increasing the packets to pass
those router architectures gradua]Jy, we may gather the experirnental result for analysis.
6.2.2 Experimental resuit
(a) The cycles required for packets to pass the Click router on a single AR1 processor.
ê jL4 i;V’
•1
1 20679838 20857502 177664
2 20679862 21 187070 253604
3 20679862 21 512638 277592
5 20679862 22165366 297101
7 20679862 22818094 305462
10 20679862 23798166 311830
12 20679862 24452030 314347
14 20699358 25132798 316674
18 20699358 26432126 318487
20 20710622 27100966 319517
35 20710622 32003638 322658
50 20710622 36919744 324182
80 20710622 46726054 325193
100 20710622 53277630 325670
250 20 681 646 102 292 734 326444
500 20 681 646 184 043 526 326724
902 20681 646 315520334 326872
2000 20 701 846 674 883 886 327091
3000 20 701 846 1 002 177 166 327158
4000 20681 646 1 329470446 327197
5 000 20 701 846 1 656 764 054 327212
10000 20701 846 3293231 310 327253
50000 20685046 16373670990 327060
100000 20685046 32728341 502 327077
(b) The cycles required for packets to pass the Click router on three multiple ARM
processor.
èrrn
1 9488038 9665638 177600
2 9488062 9848174 180056
3 9488062 10022022 177987
5 9488062 10375942 177576
7 9488062 10729910 177407
10 9488062 11 263694 177563
12 9488062 11 616230 177347
14
— 9488062 11 972318 177447
18 9493870 12678422 176920
20 9496742 13037374 177032
35 9496742 15686758 176858
50 9496742 18342214 176909
80 9496742 23645446 176859
100 9496742 27183790 176870
250 9478486 53799942 177286
500 9478486 98 131 614 177306
902 9478486 169417782 177316
2000 9498582 364391 870 177447
3000 9498582 541 965150 177489
4000 9498582 719538430 177510
5000 9498582 897112038 177523
10000 9498582 1 784979294 177548
50000 9493174 8893030366 177671
100000 9493174 17778100878 177686
Table 6-7 The cycles counted for packets passing the
Click router on two architectures respectively
The original data about performance estimation is shown in Table 6-7. We should note that a
packet does not require so many cycles to pass the Click router in real router setup. We only want
76
77
to observe the change of performance about the two Click router architectures by comparing the
cycle number per packet. Therefore, whether the data is real time for a packet to pass the Click
router wont affect the accuracy ofthe performance analysis according such experimental resuit.
6.2.3 The discussion of resuits
Based on the data depicted in Table 6-7, we work out the comparison about the different
performance on the two Click router architecture shown in Table 6-8.
‘j mI*îII
1 177664 177600 1,000
2 253604 180056 1,408
3 277592 177987 1,560
5 297101 177576 1,673
7 305462 177407 1,722
10 311830 177563 1,756
12 314347 177347 1,772
14 316674 177447 1,785
18 318487 176920 1,800
20 319517 177032 1,805
35 322658 176858 1,824
50 324182 176909 1,832
80 325193 176859 1,839
100 325670 176870 1,841
250 326444 177286 1,841
500 326724 177306 1,843
902 326872 177316 1,843
2000 327091 177447 1,843
3000 327158 177489 1,843
4000 327197 177510 1,843
5000 327212 177523 1,843
10000 327253 177548 1.843
50000 327060 177671 1,841
100000 327077 177686 1,841
Table 6-8 The comparison of the packet processing performances
on the two Click router architectures
According to Table 6-8, we are able to draw the diagram described in Figure 6-2.
78
2,000
o 1,800
1,600
,,
1,400
o. o 1 200
=0
1,000
0,800 1 —Sériel
0,600
0,400
0,200
0,000
Packet_N umber
Figure 6-2 The Speedup of the Performance on Multi_Processors
In Figure 6-2, the curves une changes abruptly from start, and then gets a stable state as the
packets processed increase. We may observe two points from this phenornenon.
1) As the number of incoming packets to Click router architecture increases, the speedup
(cycles per packet on three processors % cycles per packet on one processor) increases
gradually until getting to a stable point (1.843 in Figure 6-2 from 902 packets on). This
phenomenon shows that the Click router ninning on three processors may process the
packets 1.843 faster than ntnning on one processor, and the ability to process the packets
may get irnproved as the number of incoming packets increases.
2) If the number ofpackets continues to increase to a specific point (as highlighted in Table 6-
8 from 50000 packets on), the speedup increases slightly. This indicates that the
performance of process packet may be difficuit to get improved, and even the performance
may be reduced as incoming packets increase to a large amount.
Given the above analysis, we may draw a conclusion as follows:
The Click router impÏemented on three processors may process packets nearly double faster than a
single processor does, proving that the methodology described in Chapter 5 is feasible. If the
number of processing packets is small, the advantage of Click multiprocessor architecture is not so
noticeable. As the incoming packets increase, the overall execution time to process a packet may be
shortened rapidly. Therefore, Click multiprocessor architecture fit to process large amount of
79
passing packets. However, such processing ability may be lirnited if the number of packets is
beyond sorne constraints, meaning that a packet needs to wait longer time to pass the Click router.
This situation shows that hardware resources get a saturated state at this moment. Under this
circumstance, we may consider increasing processors or other methodologies if we want to improve
the performance of processing packet on Click router. The methodology depicted in Chapter 5 is
scalable. Based on the methodology, we may increase processors to design a new multiprocessor
architecture easily provided that we must consider the design constraints of the hardware resource.
_________________
80
Chapterf
Conclusion and Future Work
We p resented a e o-design H 7W m ethodology i mplemented in C lick router. A s a software b ased
router, Click possesses flexible and extensible quality thanks to its modular architecture. The router
architecture can construct different configuration to server different requirements. To avoid
software disadvantage, we choose some key parts of Click element and implernent them in
hardware module. Using methodologies in co-design, we can increase the Click performance on
packet forwarding process and improve performance overall Click router. The methodology was
tested on multithread ARM CPU of StepNP platforrn. The tirne (cycle numbers) for processing an
packet is 1 .8 tirnes as rnuch as that before using the methodology.
In addition, we also found a methodology to implement different parts of Click Wv-4 on rnulti-ARM
processors. This way will reduce the overhead of passing packets in different elements. The
cunent data structure about routing table detennines Click is better to be applied in edged router.
Our future work may include implementing Click JPv4 with the combination of multiprocessors
and HW/SW co-design. for large routing table, we will try to replace the current linear algorithrn in
lookupWRouter element with binary search tree algorithm [7, 81 and test the performance. Seine
researches said such binary Iookup trce algorithrn may shorten the time of finding au route,
especially in large network searching [8]. 1f such algorilhrn can be used in Click router and
irnplernented in hardware, we may improve the performance of Click and keep Ihe llexibility of
Click over the conventional router. Furthermore, because Click router may he flexible te make up
various router configurations according to dilTerent router fonctions or usages such as firewall,
Ethernet switch, we will try to implement the practical element such as RED and IPfilter which
are key elements [1] in those configuration, and evaluate the performance to overall router
configuration. Those future work may cover more the parts of Click router and gel higher
performance of Click in network application.
gj
References
[1] E.KOHLER,R.MORRIS,B.CHEN,].]ANNOHI, and M.FRANS KAASHOEF, “The Click Modular
Router “, ACM Transactions on Computer Systemsl8(3), August 2000, pp. 263-297.
htty//www.dos.lcs.mit.edu/j3ajJers/click:tocsOO/
[2] D.Decasper,Z.Dittia,G.Parulkar, and B.Plattner,” Router Plugins: A Software Architecture
for next-Generation Routers”, IEEE/ACM Transactions on Networking, Feb. 2000, pp 2—15.
[3] J.Aweya” IP Router Architectures: An Overview “ C±teSeer.IST, 1999.38
httr://citeseer. ni.nec.com/aweya99iii html
[4] Y.Gouttlieb and L.Peterson, “ A Comparative Study of Extensible Routers “, Open
Architectures and Network Programming Proceedings, 2002 IEEE, June 2002, pp. 51-62.
[5]D.L.Mills. The Fuzzba. “ In Proceedings of the SIGCOMM “,88 Symposium, Stanford, CA,
USA, Aug.1998, pp. 115-122.
[6] P.Gpaulin,C.Pilkington, and E.Bensoudane, “StepNP: A System -Level Exploration Platform
for Network Processors “, Design & Test of Computers, IEEE, Nov.-Dec. 2002,pp. 17-26.
[7] D.Pao,C.Liu,A.Wu,L.Yeung and K.S.Chan, “Efficient hardware Architecture for Fast IP
Address Lookup”, Computers and Digital Techniques, IEE Proceedings- IEEE, Jan. 2003,
pp. 43 - 52.
[8] T.Harbaum,D.Meier,M.Zitterbart and D.Brokelmann, “Hardware-Assist for Ipv6 Routing
Table Lookup”, International Symposium on Broadband European Networks (SYBEN).
Zurich, Switzerland, May 1998, http : //citeseer. nj . nec.com/aweya99ip. html
[9] S.Floyd and V.Jacobson,” Random Early Detection Gateways for Congestion Avoidance”,
IEEE/ACM Transaxtons on Networking, Aug. 1993, pp. 397 - 413.
[10] A.Moestede,P.Sjodin,T.Khler, “Header Processing Requirements and Implementation
Complexity for Ipv4Routers”,HP Labs Technical Reports, 1998.
htti : //www. hpl. hp.com/techreijorts/98/H PL-IRI-98-00 1. html
82
[11] E.Kohler,R.Morris,B.Chen, “Programming language optimizations for modular router
configurations”, ACM SIGOPS Operating Systems Review, SESSION: Communication
abstractions and optimizations Dec. 2002, pp. 251 - 263.
[12] D.Quinn,M.Hubin, “Approche co-design pour des optimisations de haut viveau “, IF6221
Synthèse des systèmes numériques,Dec. 2002.
[13] W.Stallings, “DATA AND COMPUTER COMMUNICATINS”, Prentice hall UpperSaddle River,
N.J. c2004
[14] R. Niemann, “Hardware/Software Co-Design for Data flow Dominated Embedded
Systems”, Kiuwer Academic Publishers, Boston, cl 998.
[15] G.De Micheli, “Computer-aided hardware-software codesign”, Micro, IEEE, Aug. 1994,
pp. 10-16.
[16] G.De.Micheli and K.Gupta, “Hardware/Software Cc-Design”, Poceedings of the IEEE,
MARCH, 1997, pp.349-365.
[17] ].Levma.n, G.Khan, and J.Alirezaie,” Hardware-Software Co-Design of Embedded
Systems: A Brief Survey “, Department of Electrical and Computer Engineering, Ryerson
University.
[18] A. kalavade, and Edward A.Lee, “A Hardware-Software Codesign Methodology for DSP
Applications”, Design & Test cf Computers, IEEE, Sept. 1993, pp. 16 —28.
[19] A.)antsch, P.Ellervee , J.Oberg, and A.Hemani, “A Case Study on Hardware/Software
Partitioning “, FPGA5 for Custom Computing Machines, Proceedings. IEEE Workshop
on, 10-l3April 1994, pp.111 —118.
[20] W.Wolf, “A Decade of hardware/Software Codesign “, Computer, IEEE, April 2003,
pp. 38- 43.
[21] WAYNE H.WOLF, “Hardware-Software Cc-Design cf Embedded Systems “, Proceedings cf
the IEEE. July,1994. pp. 967 - 989.
83
[22] M.Chiodo, P.Giusto.H.Hsieh, A.]urecska, LLavagno, and A.S-Vincenteiii, “A Formai
Methodology for Hardware/Software Cc-design cf Embedded Systems”, IEEE Micro,
August, 1994, pp.26-36. http://citeseer.ist.psu.edu/chiodo94formai. html
[23] Danïei D.Gajski, F.Vahid, and S.Narayan,”A System-Design Methodoiogy: Executable
Specification Refinement”, European Design and Test Confetence, 1994. EDAC, The
European Conference on Design Automation. ETC European Test Conférence. EUROASIC,
The European Event in ASIC Design, Proceedings, IEEE, Match 1994, pp. 458 - 463.
[24] S.Schuiz, ].W.Rozenbiit, K.Buchenrieder, and M.Mrva,”Concepts for Model Compilation in
Hardware/Software Codesign”, World Computing Congress, Beijing, 2000, pp. 413-420
[25] R.Domer, A.Gerstiauer, P.Kritzinger, and M.Oiivarez, “The SpecC System-Levet Design
ianguage and Methodoiogy, Part 2 Class 349”, Embedded Systems Conference San
Francisco, Match, 2002. http://citeseer.ist.psu.edu/559949. html
[26] P.G. Paulin, C. Piikington, E. Bensoudane, and M. Langevin, “Application of a Multi
Processor SoC piatform to High-Speed packet Forwarding”, Design, Automation and Test
in Europe Confetence and Exhibition, 2004 Proceediings, Feb. 2004 pp. 58 — 63.
[27] A. Baghdadi, D. Lyonnard, N. Zergainoh, and A. A. Jerrays, “An Efficient Architecture
Model for Systematic design of Applications-Specific Muitiprocessor SoC”, Design,
Automation and Test in Europe, 2001. Confetence and Exhibition 2001. Proceedings,
Match 2001, pp. 55 - 62.
[28] B. Chen and R. Morris, “Flexible Control cf parallelism in a Multiprocessor PC Router”,
The USENIX 2001 Annual Technical Conference, June, 2001.
[29] T.L. Casavant, P.Trdik, and F. Plasil, “Parallel Computers Theory and Practice”, IEEE
computer Society Press, cl 996.
[30] J.L. Hennessy and D.A. Patterson, “Computer Organization and Design THE
HARDWARE/SOFWARE INTERFACE”, Morgan Kaufmann Publisher, Inc. San Francisco,
California, c1998.
84
[31] H. Philip and J. Enslow ,“ multiprocessors and parallel processing.”, N.Y., Toronto, Wiley,
c1974.
85
Appendix A
***************************************************************************
Date : Thu Oct 9 17:56:59 2003
Version : 2003.06
Design : csdipttl
* State graph style report for process run: *
***********************************************
present next
state input state actions
s_0_0 cl s_0_l (no actions)
s_0_1 c2 s_l_2 aO mainloopoeclPTTL/ipsum46 (read)
al mainloopDeclPTTL/ipttl4s (read)
a_3 mainloopfleclPTTL/outsumSs (write)
aS mainloopDeclpTTL/outttls4 (write)
a_8 main loopDecIPTTL/add48 (operation)
a_11 main loopDecIPTTL/addso (operation)
a_14 mainloopDeclpTTL/adds2 (operation)
s_0_1 c4 s_1_2 a_2 mainloopDeclPTTL/start44 (read)
s_1_2 c5 s_1_2 aO mainloopDeclPTTL/ipsum4s (read)
al mainloopDeclPTTL/ipttl4s (read)
a_3 mainloopDeclPTTL/outsumss (write)
aS mainloopDeclPTTL/outttls4 (write)
a_8 main loopDecIPTTL/add4s (operation)
a_11 main loopDecIPTTL/addso (operation)
a_14 main loopDecIPTTL/adds2 (operation)
s_1_2 c6 s_1_2 a_2 mainloopDeclPTTL/start44 (read)
c7 s_0_0 (no actions)
Branch Conditions
state condition source
cl true
c2 (and (branch O of conditional mainloopfleclPTTL/BB5SPLITL441O)
true)
c4 true
c5 (branch O of conditional mainloopfleclPTTL/BB5SPLITL441O)
cG true
c7 true
86
Appendix B
cl s_O_1
c2 s_1_2
a2
a3
a4
aS
a 38
c4 s_1_2
c5 s_1_2
a 26
c6 s_1_3
a 15
c7 s_1_3
a9
c8 s_1_3
dO s_1_3
c12 s_1_3
c14 s_1_3
c16 s_1_2
a2
a3
a4
aS
a 38
c17 s_1_2
c18 s_1_2
a 26
c19 s_O_O
(no actions)
al loop47/hlenS4 (read)
loop47/ipoff 55 (read)
loop 47/0ff 53 (read)
loop47/pldatalens7 (read)
loop47/plenS6 (read)
loop47/sub6l (operation)
a_6 loop47/start5O (read)
a_18 loop47/add64 (operation)
loop47/1t64 (operation)
aO loop47/hulen52 (read)
loop47/add6l toperation)
a_7 loop47/out_iphl67 (write)
loop47/outipoff 68 (write)
(masked out)
(masked out)
(masked out)
(masked out)
al loop47/hlen54 (read)
loop47/ipoff 55 (read)
loop 47/off 53 (read)
loop47/pldatalens7 (read)
loop47/plen56 (read)
loop47/sub6l (operation)
a6 loop47/start5O (read)
a_18 loop47/adds4 (operation)
loop47/1t64 (operation)
(no actions)
*********** Branch Conditions ***********
state condition source
cl
c2
c4
c5
true
(and (branch O of conditional loop47/B35SPLITL5O1O)
true
true)
(and (brandi O of conditional loop47/BBSSPLITLSO1O)
true)
(branci O of conditional loop47/BB5SPLITLSO1O)
Date : Thu Oct 9 18:11:35 2003
Version : 2003.06
Design : csipfragp
***************************************************************************
***********************************************
* State graph style report for process run: *
***********************************************
next
input state actions
present
state
sOO
sOl
sOl
sOl
s12
s12
s12
s12
s12
s12
s13
s13
s13
+++++
c6
lccp47/BBSSPLITL5O1O)
lccp47/BBESPLITL6234)
lcop47/BB9SPLITL6443)
lccp47/BB6SPLITL6234)
lcop47/BB9SPLITL6443)
lccp47/BB5SPLITLSO1O)
c7
cS
dO
d12
d14
d16
c17
dB
d19
(branch
(branch
(branch
(branch
(branch
(branch
true
O cf conditional
O cf conditicnal
O of ccnditicnal
1 of ccnditicnal
1 cf ccnditional
O cf ccnditicnal
87
(branch O cf ccnditicnal lccp47/BB5SPLITL5O1O)
true
Appendix C
Date : Thu Oct 9 18:28:19 2003
Version : 2003.06
Design : csstrip
***************************************************************************
***********************************************
* State grapli style report for process run: *
***********************************************
6$
present next
state
s_O_O cl s_0_1
s_O_1 c2 s_1_2
al
a3
a6
s_0_1 c4 s_1_2
s_1_2 c5 s_1_2
al
a3
a6
s_1_2 c6 s_1_2
+++++ c7 s O O
(no actions)
aO processloop/data5l (read)
processloop/in5O (read)
processloop/result56 (write)
processloop/add54 (operation)
a_2 processloop/start47 (read)
aO processloop/data5l (read)
processloop/in5O (read)
processloop/resulr56 (write)
processloop/add54 (operation)
a_2 processloop/start47 (read)
(no actions)
Branch Conditions
state condition source
cl true
c2 (and (branch O of conditional processloop/BB5SPLITL471O)
c4 true
c5 (brandi O of conditional processloop/BB5SPLITL471O)
c6 true
c7 true
input state actions
true)
Appendix D
***************************************************************************
Date Thu Oct 9 18;30:41 2003
Version 2003.06
Design cs
***************************************************************************
***********************************************
* State graph style report for process run: *
89
cl s_0_l
c2 s_l_2
c4 s_2_3
a3
a6
a 14
a 17
c6 s_2_3
c7 s_2_3
c8 s_2_3
c9 s_2_3
dl s_1_2
dl2 s_2_3
a3
a6
a 14
a 17
d13 s_2_3
c14 s_2_3
c15 s_2_3
c16 s_2_3
c17 s_2_3
a3
a6
a 14
a 17
c18 s_2_3
c19 s_2_3
c20 s_2_3
c21 s_2_3
c22 s_0_0
actions
(no actions)
a_2 loop3S/start38 (read)
aO loop4O/in46 (read)
loop4o/result49 (write)
loop4O/add47 (operation)
loop4O/add472 (operation)
loop4O/add48 (operation)
al loop4o/start43 (read)
a_2 loop38/start3s (read)
(masked out)
(masked Dut)
a_2 loop38/start38 (read)
aO loop4o/in46 (read)
loop4O/result49 (write)
loop4O/add47 (operation)
loop4O/add472 (operation)
loop4o/add48 (operation)
al loop4O/start43 (read)
a_2 loop38/start38 (read)
(masked out)
(masked out)
aO loop4O/in46 (read)
loop4O/result49 fwrite)
loop 40/add 47 (operation)
loop4o/add472 (operation)
loop4o/add48 (operation)
(masked out)
al loop4o/start43 (read)
(masked out)
(masked out)
(no actions)
next
input state
present
state
sOO
sOl
sOl
sOl
sOl
sOl
sOl
sl2
sl2
s12
s12
s12
s12
s23
s23
s23
s23
s23
+++++
Branch Conditions ***********
state condition source
cl true
c2 (and true
(not (branch 1 of conditional loop38/334SPL1TL3832)))
(and (liranch O of conditional loop4O/BB5SPLITL4312)c4
90
true
(branch 1 cf conditional lccp38/334SPL1TL3832))
c6 (and true
(brandi 1 cf conditional loop3S/BB4SPLITL3832))
c7 (and true
(branch 1 cf ccnditicnal loop3S/334SPL1TL3832))
c8 (and true
(branci 1 cf ccnditicnal lccp3S/BB4SPLITL3B32))
c9 (and (brandi 1 cf ccnd±ticnal lccp4O/3358PL1TL4312)
true
(brandi 1 cf ccnditional locp3B/334SPL1TL3832))
cil (net f branci 1 cf ccnditionai lccp38/334SPL1TL3832))
c12 (and (brandi O cf ccnd±ticnal lccp4O/BB5SPLITL4312)
(branch 1 cf ccnditicnal lccp38/BB4SPLITL3832))
d13 (branci 1 cf conditicnal lccp38/BB4SPLITL3832)
c14 (branci 1 cf ccndit±cnal locp3S/BB4SPLITL3S32)
d15 (branci 1 of ccnditicnal lcop38/BB4SPLITL3832)
clG (and (branch 1 cf conditicnal locp4O/BB5SPLITL4312)
(branch 1 cf ccnditicnal lccp38/334SPL1TL3832))
c17 (branch O cf ccnditicnal lccp4O/BB5SPLITL4312)
diS (brandi O cf ccnditicnal locp4O/BB5SPLITL4312)
c19 true
c20 (brandi 1 cf ccndit±cnal lccp4O/BB5SPLITL43T2)
c21 (branch 1 cf conditional lcop4O/EB5SPLITL4312)
c22 true
91
Appendix E
Date : Tue Sep 16 17:26:31 2003
Version : 2003.06
Design : csdipttl
***************************************************************************
*************************************
* Summary report for process run: *
*************************************
Timing Summary
dock period 25.00
Loop timing information:
run 2 cycles (cycles O
- 2)
main loop DecIPTTL 1 cycle (cycles 1
- 2)
Area Summary
Estimated combinational area 247.342819
Estimated sequential area 565.640015
TOTAL 812.982834
3 control states
3 basic transitions
2 control inputs
2 control outputs
Resource types
Register Types
Operator Types
(88->8) -bit DWOladd 1
(l6l6->l7) -bit DWoladd 2
i/o Ports
l-bit input port 1
8-bit registered output port 1
16-bit registered output port 1
32-bit input port 2
92
Appendix F
***************************************************************************
Date : Tue Sep 16 17:43:05 2003
Version : 2003.06
Design : csipfragp
***************************************************************************
*************************************
* Summary report for process run: *
*************************************
Timing Summary
dock period 25.00
Loop timing information:
run 3 cycles (cycles O
- 3)
loop47 2 cycles (cycles 1
- 3)
Area Summary
Estimated combinational area 2427.434082
Estimated sequential area 1614.905029
TOTAL 4042.339111
4 control states
4 basic transitions
5 control inputs
7 control outputs
Resource types
Register Types
l-bit register 3
14-bit register 1
32-bit register 1
Operator Types
(3232->1) -bit DWOlcmp2 1
(3232->32)-bit DWOladd 1
(3232->32)-bit DWOlsub 1
I/O Ports
l-bit input port 1
8-bit registered output port 1
16-bit registered output port 1
32-bit input port 6
93
Appendix G
***************************************************************************
Date : Tue Sep 16 18:35:21 2003
Version : 2003.06
Design : csiptable
****************************************
* Summary report for process lookup: *
****************************************
Timing Summary
dock period 25.00
Loop timing information:
lookup 5 cycles (cycles O
- 5)
reset loop 4 cycles (cycles 1
- 5)
loop29 1 cycle (cycles 2
- 3)
exit at line 29 (cycle 2)
main loop 1 cycle (cycles 4
- 5)
Area Summary
Estimated combinational area 6396.606445
Estimated sequential area 55882.113281
TOTAL 62278.719727
6 control states
8 basic transitions
7 control inputs
20 control outputs
Resource types
Register Types
Operator Types
(32-z’32) -bit DWOlinc 1
I/o Ports
1-bit input port 1
1-bit registered output port 1
32-bit input port 6
32-bit registered output port 83
94
Appendix H
***************************************************************************
Date : Tue Sep 16 18:11:14 2003
Version : 2003.06
Design : csstrip
*************************************
* Summary report for process run: *
*************************************
Timing Summary
dock period 25.00
Loop timing information:
run 2 cycles (cycles O - 2)
processloop 1 cycle (cycles 1
- 2)
Area Summary
Estimated combinational area 188.465607
Estimated sequential area 733.520020
TOTAL 921.985626
3 control states
3 basic transitions
2 control inputs
2 control outputs
Resource types
Register Types
Operator Types
(3232->32) -bit Dwoladd 1
I/O Ports
1-bit input port 1
32-bit input port 2
32-bit registered output port 1
95
Appendix I
/
CSFragp.h --
Author: Dan Li
*****************************************************************************/
/
MODIFICATION LOG - modifiers, enter your name, affiliation, date and
changes you are making here.
Name, Affiliation, Date:
Description of Modification:
#define NE BITS HEADLEN 8
#define NE BITS OUT 16
#define NE BITS DATA 32
#define IPOFFMASK Oxlfff
#defi.ne IPRF 0x8000
#define IPOF 0x4000
#define IPMF 0x2000
SCMODULE (csipfragp)
scin<bool> reset;
scin<bool> start;
scin<int> hllen;
scin<int> off;
scin<int> hien;
scin<int> ipoff;
scin<int> plen;
scin<±nt> pidatalen;
scout<scuint<NBBITSHEADLEN> > outiphi;
sc_out<sc_uint <N3_BITSOUT> > out_ip_of f;
scinclk clk;
SCCTOR(csipfragp)
SCCTHREADfrun, clk.posO);
watching(reset.delayed() == true)
/1 csipfragp() { csf”IPfragP”);};
void runO;
96
/
CS: IPFragmenter
/
csipfragp.cpp --
Author: Dan Li
*****************************************************************************/
4innclude <systemc.h>
#include “csfragp.h”
#include <sys/param. h>
void csipfragp::run() {
int inhilen, inoff, inoffip;
int inhien, inipoff;
int inplen, inpldatalen;
while(1)
inhilen = O;
inoff = O;
inhien = O;
inipoff = O;
inplen = O;
inpldatalen = O;
waituntil(start.delayed() == true)
wait ()
while(1) {
if (start == 1)
inhilen = hllen.read()
inoff = off.readO;
inhien = hlen.read()
inipoff = ipoff.read()
inplen = plen.readQ;
inpldatalen = pldatalen.read()
inhilen = inhilen » 2;
inoffip = ((inoff - inhien) » 3) +(inipoff & IPQFFMASK);
if(inipoff & IPMF)
inoffip IPMF;
if(inoff + inpldatalen inplen)
inoffip = IPMF;
outiphi write (inhilen)
outipof f .write (inoffip);
else inoff = O;
wait()
97
Appendix J
/ *****************************************************************************
CSDepttl.h
--
Author: Dan Li
*****************************************************************************/
/ ***************************************************************************k.A.
MODIFICATION LOG - modifiers, enter your name, affiliation, date and
changes you are making here.
Name, Affiliation, Date:
Description of Modification:
//define NBBITSOUT 16
#define NB BITS DATA 32
#define NB BITS IPTTL S
#define N3 BITS IPSUM 16
SCMODULE (csd±pttl)
scin<bool> reset;
scin<bool> start;
scin<unsigned> ip_ttl;
scin<unsigned> ipsum;
scout<scuint<NBBITSIPTTL> > outttl;
scout<scuint<NBBITSIPSUM> > outsum;
scinclk clk;
SCCTOR (csdipttl)
SCCTHREAD(run, clk.posO);
watching(reset.delayed() == true)
/1 csdipttl() { cst”decipttl”);};
void runO;
98
/******************************************************************************
CS: DecTPTTL
*****************************************************************************/
/ *****************************************************************************
cs.cpp --
Author: Dan Li
*****************************************************************************/
#include <systemc .h>
#include “csdipttl .h’
void csdipttl::run()
scuint<NBBITSIPTTL> inttl;
scuint<NBBITS DATA> insum32;
reset loop DecIPTTL:whule (1)
inttl = O;
insum32 = O;
waituntil(start.delayed() == true)
wait C)
main loop DecIPTTL:whule (1)
if (start == 1) {
inttl=ipttl.read()
insum32=-ipsum.readO;
inttl
- -;
insum32 = insum32.range(15,O)+OxFEFF;
insum32 = insum32.range(15,O)+insum32.range(31,16);
outttl.write(inttl)
outsum.write (-insum32 .range (15,0));
else insum32 = 0;
wait f)
