Bus scheduling implementation on the cell processor by Chivukula, Deepti K.
c© 2010 Deepti Kumar Chivukula
BUS SCHEDULING IMPLEMENTATION ON THE CELL PROCESSOR
BY
DEEPTI KUMAR CHIVUKULA
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2010
Urbana, Illinois
Adviser:
Associate Professor Marco Caccamo
ABSTRACT
Real-time computing is the study of hardware and software systems that are sub-
ject to a “real-time constraint” - i.e., strict deadline guarantees. With uncontrol-
lable cache and front side bus, in the modern computer architectures, the estima-
tion of a tight bound, worst case execution time (WCET) is difficult. The new
generation computer architecture, Cell Broadband Engine Architecture (CBEA),
has a software controlled front side bus (i.e. Element Interconnect Bus) that helps
moderate the unpredictable task execution time problem. The CBEA is a hetero-
geneous chip system containing one Power Processing Element (PPE) and eight
Synergistic Processing Elements (SPEs), each having an internal independent lo-
cal storage memory.
In this thesis, using CBEA as a platform, I implemented an interrupt based
scheduling framework that uses Element Interconnect Bus (EIB) in a temporally
predictable manner. The framework is built by abstracting away low-level archi-
tectural features. Experiments were also performed to show that the real-time
transactions of feasible transaction sets are executed before deadline when sched-
uled according to a real-time scheduling algorithm, while the same transactions
can miss their deadlines when scheduled according to an arbitrary (non-real-time)
scheduling policy.
ii
To my husband Sujan, for his love, motivation and support
iii
ACKNOWLEDGMENTS
First and foremost, I would like to thank my adviser, Dr. Marco Caccamo, who
has been more than just an academic and research adviser over the last two years.
He has been a mentor and a valuable guide, giving me complete freedom to choose
a problem that I am really excited about, and thus made my research experience
fantastic.
For their advice, inspiration, insights and discussions, I would like to thank
other professors and fellow graduate students in the Beckman Institute and Siebel
Center, including Dr. Narendra Ahuja, Dr. Lui R. Sha, Rodolfo Pellizzoni and
Bach D. Bui.
I would also like to thank Shakti Kapoor, Senior Technical Staff Member at
IBM Austin, for his timely support in terms of connecting me to the right people
to talk.
Thanks are also due to my close friends: Manoj, Ramya, Sreekanth, Siva
Kumar Sastry Hari, and many more for making my life at UIUC so enjoyable.
Finally, I would like to thank my parents for their love, encouragement and
support.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 5
CHAPTER 2 CELL BROADBAND ENGINE ARCHITECTURE . . . . 6
2.1 The Three Walls . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Architectural Overview of CBEA . . . . . . . . . . . . . . . . . . 9
2.3 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 15
CHAPTER 3 PROGRAMMING ENVIRONMENT ON CBEA . . . . . . 16
3.1 Partitioning of the Applications . . . . . . . . . . . . . . . . . . . 16
3.2 Threads on PPE and SPE . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Communication between PPE and SPE . . . . . . . . . . . . . . . 24
3.4 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 32
CHAPTER 4 BACKGROUND WORK: BUS SCHEDULING AL-
GORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Terms and Terminology . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Real-Time Bus Transaction and Scheduling Model . . . . . . . . 37
4.3 Scheduling Algorithms for Ring Buses . . . . . . . . . . . . . . . 40
4.4 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 56
CHAPTER 5 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 57
5.1 Scheduling Framework: PPE Side . . . . . . . . . . . . . . . . . 58
5.2 Scheduling Framework: SPE Side . . . . . . . . . . . . . . . . . 64
5.3 Development Structure . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Flow of Implementation Code . . . . . . . . . . . . . . . . . . . 68
5.5 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 70
v
CHAPTER 6 EXPERIMENTAL RESULTS, CONCLUSION AND FU-
TURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
APPENDIX A IMPLEMENTATION CODE AND BIT ORDERING
IN CELL PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.1 PPU: simple.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.2 PPU: cFile.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.3 PPU: Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.4 SPU: simple spu.c . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.5 SPU: Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.6 SPU: barrier heavy.h . . . . . . . . . . . . . . . . . . . . . . . . 100
A.7 SPU: spu slih reg.c . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.8 SPU: spu slih reg.h . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.9 Bit Ordering and Numbering . . . . . . . . . . . . . . . . . . . . 106
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
vi
LIST OF TABLES
3.1 Values of Behavior . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 Schedule Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 DMA Transmission Time . . . . . . . . . . . . . . . . . . . . . . 72
vii
LIST OF FIGURES
2.1 The Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . 6
2.2 SPE Architectural Block Diagram . . . . . . . . . . . . . . . . . 10
2.3 Memory Flow Controller . . . . . . . . . . . . . . . . . . . . . . 11
2.4 CBEA: Element Interconnect Bus . . . . . . . . . . . . . . . . . 12
2.5 Sending Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Command Phase . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Data Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Receiving Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Application Partitioning Models . . . . . . . . . . . . . . . . . . 17
3.2 PPE-Centric Models: Multistage Pipeline Model and Parallel
Stages Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 PPE-Centric Service Model . . . . . . . . . . . . . . . . . . . . . 19
3.4 The Cell Storage Domains . . . . . . . . . . . . . . . . . . . . . 24
4.1 Types of Transactions . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Overlap Contention . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Overload Contention . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Non-circular Transaction Set . . . . . . . . . . . . . . . . . . . . 38
4.5 Circular Transaction Set . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Indexed Straight Line Representation . . . . . . . . . . . . . . . . 39
4.7 An Example of the POBase Algorithm . . . . . . . . . . . . . . . 42
4.8 Scheduling Intervals on the Execution Timeline . . . . . . . . . . 45
4.9 Constructed Graph G . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Different Versions of Implemented Code . . . . . . . . . . . . . . 58
5.2 Flow of Implementation Code . . . . . . . . . . . . . . . . . . . 69
5.3 Snapshot of User Interface . . . . . . . . . . . . . . . . . . . . . 69
6.1 Experimental Transaction Set . . . . . . . . . . . . . . . . . . . . 72
A.1 Big Endian Ordering . . . . . . . . . . . . . . . . . . . . . . . . 107
viii
CHAPTER 1
INTRODUCTION
A real-time system can be a hardware or a software system, whose application is
considered to be mission critical. The total correctness of an operation, for such
a system, depends not only on its logical correctness but also upon the time in
which it is performed. In fact, the classical concept states that in a “hard” real-
time system, the completion of an operation after its deadline is considered to be
useless. Thus, real-time systems are used when there is a need for the deadlines
of all the tasks to be guaranteed: to analyze the temporal behavior of the system,
the worst case execution times (WCET) must be reliably estimated. Examples of
such hard real-time systems include car engine systems, avionics systems, medical
systems (like heart pacemakers) and industrial process controllers.
A significant source of randomness in estimating WCET lies in the uncon-
trollability of interconnection architectures, specifically both shared cache and
front side bus (FSB). These architectures are used by CPU and direct memory
access (DMA)-enabled peripherals to communicate amongst each other and with
the main memory.
This problem, of randomness in estimating WCET, is potentially more severe
in multiprocessor systems with a multitasking operating system (OS), as they have
more entities which are concurrently competing for bus access. In such a system,
execution time of a task can be unexpectedly extended by the execution of other
tasks or DMA-enabled peripherals. The authors of [1] observed that extensions in
execution time may be as high as 44%.
There have been significant research efforts on interconnection architectures
for multi-core processors which have features suited for real-time systems, espe-
cially in the field of networks-on-chip (NOC). These works have been surveyed
earlier in [2]. Commercial multi-core processors with software-controlled inter-
connections have also been developed, such as the IBM Cell Broadband Engine
Architecture (CBEA) which is well distinguished for its high performance.
The main research issue is to how to provide software designers with: (1) a
1
practical and accurate abstraction of the real scheduling problem on multiproces-
sor bus and (2) an effective scheduling methodology that optimizes multiproces-
sor bus utilization. The present research focuses on addressing this problem on a
specific multiprocessor bus architecture, specifically CBEA.
CBEA [3] is a new architecture that extends the 64-bit PowerPC Architecture.
It is the result of collaboration between Sony, Toshiba and IBM, popularly known
as STI, which was formally started in early 2001. It consists of nine process-
ing elements, including an IBM 64-bit Power Architecture core called the Power
Processing Element (PPE), augmented with eight co-processors called Synergistic
Processor Elements (SPE) which contain private 256 kB of local storage (LS). All
processing units and peripherals communicate with the main memory and other
units through the Element Interconnect Bus (EIB) [4], which acts like an FSB.
The EIB supports multiple transactions at the same time and its accesses are soft-
ware controllable. It provides more control for moving or accessing memory data
of the processing elements. Also, it provides more control over the cache usage,
thus making this architecture intrinsically more predictable. Due to its hardware
features, CBEA overcomes three important limitations of modern microprocessor
performance - power use, memory use and processor frequency.
Support provided by the unique software features of CBEA also enhances con-
trol over EIB and the cache. Some of these features include dedicated instruction
sets for each type of processor (PPE and SPEs), flexibility in developing appli-
cations in either C/C++/assembly language, good application programming inter-
face (API) support called software development kit (SDK), and enhanced ability
for parallelization (PPE and SPE Linux threads). These features are discussed in
Chapter 3.
As mentioned earlier, our research group focused on developing an abstraction
of the scheduling problem as well as an effective scheduling methodology that
predictably schedule the multiprocessor bus. My key contribution to this effort,
which is the focus of this thesis, was to build a framework that would allow the
team to implement and test the performance of a class of scheduling algorithms
for CBEA, while maintaining high throughput. The motivation for building the
framework is that CBE’s low level arbitration is complex and analyzing how re-
quests compete for access to the bus is not easy. This framework would abstract
away from the user the low level physical bus implementation. Further, it would
allow table based scheduling through which we can eliminate all contentions for
access to the bus and make it completely predictable. A key consideration while
2
building the framework was to ensure that it is simple to understand and be used
by developers.
I have also conducted experiments to show that the real-time transactions of
feasible transaction sets are executed before deadline when executed according
to the real-time scheduling algorithm, while the same transactions can miss the
deadlines if my scheduling framework is not used.
The implemented framework is interrupt based: it receives a timer interrupt
at the beginning of each time slot (or quanta), and it makes a scheduling deci-
sion for the current slot based on a table that is generated offline by a real-time
scheduling algorithm. The framework abstracts away from the low-level physical
bus implementation, which includes the data and command arbiter. If a transac-
tion is scheduled in a slot, a DMA packet, of the transaction of a given size, will
be transferred from source to destination in that slot through EIB. The size of a
DMA packet to be transferred is decided on the basis of bus bandwidth and slot
size. The source and destination of DMA packets are SPE units on the chip. Each
SPE receives data specific to the transactions it is involved in. The interrupt mech-
anism is implemented on SPEs, and interrupts are fired periodically depending on
the timer value.
Many hardware and software hurdles (Chapters 2 and 3) had to be overcome
to successfully run offline generated schedules on the implemented framework.
This merits a more detailed discussion covering the unique challenges that were
faced during the project.
In terms of hardware, design of EIB as well as its arbiter in CBEA is inherently
complex. Thus, during the implementation of the framework I faced several chal-
lenges including incorporating multiple degrees of parallelism on the EIB, elimi-
nating initial delay in the first phase of task execution, avoiding delays caused by
the cell arbiter as well as incorporating alignment restrictions imposed by DMA.
These are explained in more detail below.
The EIB comprises a 4-ring structure (2 clockwise and 2 anticlockwise direc-
tions) and has multiple degrees of parallelism: each bus ring can carry up to three
concurrent transactions. The bus ring can start a new task after every three cycles,
which causes an initial delay in the first phase of the task execution. Details about
the different phases of a task on the EIB are covered in Chapter 2. Tasks on the
bus can delay each other further if they share the same bus segment. Details of
the types of contentions on EIB and how they affect the task execution time are
described in Chapter 4.
3
Central arbiter is responsible for handling the individual data transactions and
scheduling them so that they move around the ring and eventually end up on their
respective destination units. The data arbiter implements round-robin bus arbitra-
tion with two levels of priority - memory interface controller (interface between
the main memory and the EIB) has the highest priority and the rest of the units on
the chip have a lower priority. The data flows on the EIB in a stepwise manner
around the ring. The data arbiter does not allow a data transaction to be trans-
ferred along hops if its path is more than half the ring diameter. That is, a data
ring is granted to a requester only so long as the circuit path requested is not more
than six hops in either direction. An important design consideration was that if the
request is greater than six hops, the requester must wait until a data ring operating
in the opposite direction becomes free.
Another important hardware challenge was the alignment restrictions imposed
by DMA. Non-aligned DMAs (not at 16-byte boundary) are not supported by
CBEA. If non-aligned DMA operation is encountered, the MFC command queue
processing is suspended and DMA alignment interrupt is generated. Resolving
these exceptions takes up a lot of processor time which again causes delays in
the execution of tasks on the bus. DMA accesses in the main storage domain
are atomic (128-bit) if they meet the requirements of the PowerPC architecture.
All other DMA transfers, if greater than a quadword (16-byte) or non-aligned, are
performed as a set of smaller (1, 2, 4 and 8 bytes), disjointed atomic accesses. The
number and alignment of these accesses are implementation dependent. Further
details of the alignment issues have been illustrated in Chapter 3.
There were some hurdles during software implementation as well. My frame-
work was implemented using a PPE-centric type of model called the parallel stage
model, as discussed in Chapter 3. The parallel stage model executes different
portions of data in parallel by dividing data among different processing elements
(SPEs) - the PPE creates software threads (the types of software threads that run on
PPE and SPEs are explained in Chapter 3) and puts them on SPEs which execute
the data in parallel. Thus, the main issue was to locate exploitable concurrency in
a task to successfully divide the data for parallel computation. Other challenges
faced while trying to parallelize the data computation were data dependencies and
overhead in synchronizing concurrent memory accesses or transferring data be-
tween different processor elements (SPEs).
All the above mentioned problems were resolved and I was also able to lever-
age the low level features of EIB and central arbiter to successfully implement the
4
described framework. More details on the implementation are provided in later
chapters.
1.1 Thesis Organization
This thesis is divided into two main parts. Before describing the framework im-
plementation, it is critical to have a good understanding of CBEA. Thus, in Part
1 of the thesis, I describe the architecture and various software features in detail.
Chapter 2 provides an overview of the architecture of CBE. Chapter 3 covers the
programming process on PPE and SPEs of the CBE.
Part 2 of the thesis deals with actual implementation of the framework and
experimental results. Chapter 4 begins with description of the bus model and ex-
planation of the terminology that will be used in Chapter 5. This chapter also
provides a brief overview of a novel dynamic scheduling algorithm: it is impor-
tant background material needed to understand the scope of the framework and
my experimentation work. Chapter 5 explains the actual implementation of the
proposed framework. Chapter 6 describes results of the experiments that were
conducted, concluding remarks as well as details of future work.
5
CHAPTER 2
CELL BROADBAND ENGINE
ARCHITECTURE
First generation Cell Broadband Engine (CellBE) is the first incarnation of a new
family of microprocessors, called Cell Broadband Engine Architeciture (CBEA),
that extends the 64-bit PowerPC Architecture. CBEA is the result of collaboration
between Sony, Toshiba, and IBM, known as STI, formally started in early 2001.
CBEA, as shown in Figure 2.1, consists of 12 core components connected
through the Element Interconnect Bus (EIB): One Power Processing Element
(PPE) , eight Synergistic Processing Elements (SPEs), one Memory Interface
Control Unit (MIC), and two I/O Interface Control Units (IOIF0 and IOIF1).
PPE is more adept at control-intensive tasks and is quicker at task-switching.
SPEs are more adept at compute-intensive tasks and are slower at task switching.
While either processor is capable of doing both types of tasks, this specialization
has increased the efficiency in implementation of both PPE and, especially, SPEs.
It is a significant factor in improving CBE’s peak computational performance, area
and power efficiency, by an order of magnitude (approximately) over conventional
PC processors.
Figure 2.1: The Cell Broadband Engine
This chapter is organized in three sections. Sections 2.1 talks about the limita-
6
tions of modern microprocessors and how CBEA overcame these hurdles. Section
2.2 talks about the various hardware features of CBEA including Power Proces-
sor Element (PPE), Synergistic Processor Elements (SPEs), Memory Interface
Controller (MIC), Element Interconnect Bus (EIB) and Cell Broadband Engine
Interface (BEI).
2.1 The Three Walls
As mentioned earlier, CBE overcomes three important limitations of modern mi-
croprocessor performance: power use, memory use and processor frequency. Fol-
lowing is a detailed description of how CBEA overcomes these hurdles.
2.1.1 Power Wall
Microprocessor performance is often limited by achievable power dissipation,
rather than by number of available integrated circuit resources (transistors and
wires). Thus, power efficiency has to be improved to significantly increase micro-
processor performance. One way to increase power efficiency is to differentiate
between processors optimized to run an operating system and control-intensive
code, and processors optimized to run compute-intensive applications.
CBE achieves this by providing a general purpose PPE to run the operating
system and other control-plane code, and eight SPEs specialized for computing
data-rich (data-plane) applications.
2.1.2 Memory Wall
Latency to DRAM memory for multi-gigahertz symmetric multiprocessors, even
those with integrated memory controllers, is currently approaching 1,000 cycles.
As a result, program performance is dominated by the activity of moving data
between main storage (effective-address space that includes main memory) and
the processor. Compilers and even application writers must increasingly manage
this movement of data explicitly, even though hardware cache mechanisms are
supposed to relieve them of this task.
CBE’s SPEs use two mechanisms to deal with long main-memory latencies:
(a) 3-level memory structure (main storage, local storages in each SPE, and large
7
register files in each SPE) and (b) asynchronous DMA transfers between main and
local storage. These features allow programmers to schedule simultaneous data
and code transfers to cover long latencies effectively. As a result, CBE can usually
support 128 simultaneous transfers between the eight SPE local storage and main
storage. This surpasses the number of simultaneous transfers on conventional
processors by a factor of almost 20.
2.1.3 Frequency Wall
Conventional processors require increasingly deeper instruction pipelines to
achieve higher operating frequencies. This technique has reached a point of di-
minishing returns - and even negative returns if power is taken into account.
CBEA, on which the CBE is based, allows both PPE and SPEs to be designed
for high frequency without excessive overhead. PPE achieves efficiency primar-
ily by executing two threads simultaneously, rather than by optimizing single-
thread performance. Each SPE achieves efficiency by using a large register file,
which supports many simultaneous in-flight instructions without the overhead of
register-renaming or out-of-order processing. Each SPE also achieves efficiency
by using asynchronous DMA transfers, which support many concurrent memory
operations without overhead of speculation.
2.1.4 Cell’s Solution
CBE takes care of problems posed by power, memory and frequency limitations,
by individually optimizing control-plane and data-plane processors. As a result
a processor, with the power budget of a conventional PC processor, can theo-
retically be expected to provide an approximately ten-fold improvement in peak
performance compared to a conventional processor.
Of course, actual application performance varies. Some applications may ben-
efit little from SPEs, while others show a performance increase well in excess of
ten-fold. In general, compute-intensive applications that use 32-bit or smaller data
formats (such as single-precision floating-point and integer) are excellent candi-
dates for the CBE.
8
2.2 Architectural Overview of CBEA
In the following sections, we provide architectural features and the hardware en-
vironment of the CBEA.
2.2.1 PowerPC Processor Element (PPE)
PPE is the main processor that contains a 64-bit PowerPC Architecture reduced
instruction set computer (RISC) core with a traditional virtual memory subsystem.
It consists of two main units, the Power Processor Unit (PPU) and the Power
Processor Storage Subsystem (PPSS).
PPE performs multiple functions, like running the operating system, manag-
ing system resources, as well as control processing, including the allocation and
management of SPE threads. It can run legacy PowerPC Architecture software
and performs well executing system-control code. It supports both the PowerPC
instruction set and the Vector/SIMD Multimedia Extension instruction set.
2.2.2 Synergistic Processor Elements (SPEs)
CBE includes eight SPEs that are SIMD processors optimized for data-rich oper-
ations which are allocated to them by the PPE. Each of these identical elements
contains a RISC core, 256 kB, software-controlled local storage (LS) for instruc-
tions and data, and a large (128-bit, 128-entry) unified register file. It consists of
two main units (shown in Figure 2.2), the Synergistic Processor Unit (SPU) and
the Memory Flow Controller (MFC). Also, each SPE has full access to coherent
shared memory, including the memory-mapped I/O space.
SPEs support a special SIMD instruction set, and they rely on asynchronous
DMA transfers to move data and instructions between main storage (the effective-
address space that includes main memory) and their local storages (LSs). SPE
DMA transfers access main storage using PowerPC effective addresses. As on
the PPE, address translation is governed by PowerPC Architecture segment and
page tables. An SPE is a 128-bit processing unit that has a Memory Flow Control
(MFC) unit which controls the memory management unit and the DMA engine.
The method considered in this work for moving data between LSs, and between
system memory and LSs, is DMA. A software designer operates the DMA engines
by issuing commands to its MFC. The command must specify the starting and the
9
ending addresses of the transaction and its size. Each SPE contains a bus interface
unit (BIU) that provides an interface from the SPE to EIB.
SPE is optimized for running compute-intensive applications, and not for run-
ning the operating systems. The name synergistic for this processor was chosen
carefully because there is a mutual dependence between PPE and the SPEs. The
latter depend on the former to run the operating system and, in many cases, the
top-level control thread of an application. On the other hand, PPE depends on
SPEs to provide the bulk of the application performance.
Figure 2.2: SPE Architectural Block Diagram
2.2.3 Memory Flow Controller
As shown in Figure 2.3, each SPU has its own Memory Flow Controller (MFC)
that serves as the SPU’s interface to main storage, other processor elements and
system devices. MFC’s primary role is to interface its LS storage domain with
the main-storage domain. It does this by means of a DMA controller that moves
instructions and data between its LS and main storage. MFC also supports other
functions, including storage protection on the main storage side of its DMA trans-
10
fers, synchronization between main storage and the LS, as well as the communi-
cation functions (such as mailbox and signal-notification messaging) with PPE,
other SPEs and devices.
In my thesis I will focus on only DMA transfers and mailboxes since I used
these two mechanisms in implementing my framework.
Figure 2.3: Memory Flow Controller
2.2.4 Memory Interface Controller (MIC)
MIC provides the interface between Element Interconnect Bus (EIB) and main
storage. It supports two Rambus Extreme Data Rate (XDR) I/O (XIO) memory
channels and memory accesses on each channel of 1 - 8, 16, 32, 64, or 128 bytes.
2.2.5 Element Interconnect Bus (EIB)
PPE and SPEs communicate coherently with each other, with main storage as well
as with I/O through the EIB (see Figure 2.4). The bus includes a 16-byte wide 4-
ring structure (two clockwise and two counterclockwise) for data transfer, a data
and command arbiter, and a tree structure for commands.
11
Each participant on the EIB has one 16-byte read port and one 16-byte write
port. The bus has multiple degrees of parallelism: each bus ring can carry up to
three concurrent transactions. Data flows on an EIB channel stepwise around the
ring. Since there are 12 participants or units, the total number of steps around the
channel back to the point of origin is 12. Six steps is the longest distance between
any pair of participants. An EIB channel is not permitted to convey data requiring
more than six steps; such data must take the shorter route around the circle in the
other direction. The number of steps involved in sending the packet has very little
impact on transfer latency: clock speed driving the steps is very fast relative to
other considerations. However, longer communication distances are detrimental
to the overall performance of the EIB as they reduce available concurrency. The
EIB’s internal bandwidth is 96-bytes per cycle, and it can support more than 100
outstanding DMA memory requests between main storage and the SPEs.
  


 
	


 
	


 
	


 
	





	

 
	


 
	


 
	


 
	





 
          
 
   
Figure 2.4: CBEA: Element Interconnect Bus
The transfer of a transaction on the bus takes place in four steps or stages -
send phase, command phase, data phase and receive phase. Each of these steps
is detailed in the section below.
2.2.5.1 Sending Phase
Sending Phase, shown in Figure 2.5, is responsible for initiating a transaction. It
consists of all processor and DMA controller activities needed before transactions
are injected into any components.
At the end of this phase, a command is issued to the command bus to begin
the next phase.
12
Figure 2.5: Sending Phase
2.2.5.2 Command Phase
Command Phase shown in Figure 2.6 coordinates end-to-end transfers across the
EIB. This phase is also responsible for coherency checking, synchronization, and
inter-element communication. EIB informs the read or write target element of the
transaction in progress to allow the target to set up the transaction (i.e., data fetch
or buffer reservation).
Figure 2.6: Command Phase
2.2.5.3 Arbitration Phase
A low-level round robin scheduler arbitrates bus accesses between contending
transactions stored in the post-command phase queue (arbitration phase together
with data phase is called post-command phase). This phase occurs when the BIU
has to wait for bus access due to the contention between atomic transactions on
the bus.
2.2.5.4 Data Phase
As shown in Figure 2.7 Data Phase handles data ring arbitration and actual data
transfers across the ring. This phase grants access to packets through one of
the four data rings when a ring becomes free and ensures that no more than six
13
hops are needed along the ring. End-to-end transport of packets happens over a
pipelined circuit-switched granted EIB Ring.
Figure 2.7: Data Phase
2.2.5.5 Receiving Phase
Receiving Phase shown in Figure 2.8 concludes the transaction by transferring
received data from the receiving node’s BIU to its final destination at that receiver,
such as local storage memory or the system memory.
Figure 2.8: Receiving Phase
2.2.6 Cell Broadband Engine Interface (BEI)
BEI manages data transfers between the EIB and I/O devices. It provides address
translation, command processing, an internal interrupt controller, and bus interfac-
ing. It supports two Rambus FlexIO external I/O channels. One channel supports
only non-coherent I/O devices. The other channel can be configured to support
14
either non-coherent transfers or coherent transfers that extend the logical EIB to
another compatible external device, such as another CBE.
2.3 Chapter Conclusion
In this chapter, I gave a brief overview of the CBE architecture, as well as the
functions of each of its hardware components. As discussed earlier, the unique
architecture of the CBE allows it to overcome serious limitations of modern mi-
croprocessors and deliver approximately ten-fold improvement in performance. In
the next chapter I will describe unique and important architecture details, knowl-
edge of which is required for programming on CBEA.
15
CHAPTER 3
PROGRAMMING ENVIRONMENT ON
CBEA
Support provided by the unique software features of CBEA enhances our control
over the interconnection architecture. Thus, it is important to understand the pro-
gramming environment of CBEA to know the complexity involved in abstracting
from the low level bus architectural constraints. This chapter gives an overview of
programming on the nine processor elements of CBEA, i.e. PPE and eight SPEs.
As already discussed in Chapter 2, PPE on the CBEA is optimized to run the
operating system and any control-intensive code, while the SPEs are optimized
to run compute-intensive applications. Running an application on these proces-
sors requires an understanding of how applications partition tasks among different
processors as well as how these tasks are actually executed on them. For further
reference a detailed description of the programming process is given in [5] and
[6].
This chapter is organized in four sections. Partitioning of applications is de-
scribed in Section 3.1. The next section deals with creation of threads; these are
used to send processing requests to different processors. Section 3 describes in-
struction sets that are used to process these tasks. Finally in the last section, an
overview of how SPEs and PPEs interact is provided.
3.1 Partitioning of the Applications
All applications running on the Cell Broadband Engine need to divide the work
among the nine processors. The following considerations have to be taken into
account while deciding on how to distribute the workload and data.
• Program structure: Any application will usually have both compute inten-
sive tasks as well as control intensive tasks. In CBEA, the two types of tasks
are assigned to different types of processors. Thus, the choice of a task par-
titioning model, i.e. PPE-centric or SPE-centric as shown in
16
Figure 3.1, will depend on whether the application is more compute-intensive
or more control intensive. The model choice ensures that the application is
run optimally on the processors.
• Program data access and data flow patterns: After deciding the type of par-
titioning, the developer will determine if data is going to be sent in a parallel
fashion or serially to SPEs, depending on the processor’s coded functional-
ity.
• Optimizing cost: Context switching can cause the processing time to in-
crease substantially as SPEs have to be stopped while local storages are
reloaded. Thus, the developer needs to be careful to write code to minimize
context switching.
Our implementation used a PPE-centric model, which is described below. De-
tails about the justification for using this model are given in Chapter 5.
Figure 3.1: Application Partitioning Models
3.1.1 PPE-Centric Model
In this model, the main application runs on the PPE, and individual tasks are
offloaded to the SPEs. PPE then waits for, and coordinates, the results returning
from the SPEs. This model fits an application with serial as well as with parallel
data computation.
The PPE-centric model can be classified into three sub-models depending on
how SPEs are used:
• Multistage pipeline model (shown on left of Figure 3.2): It is used when
a task requires sequential stages, where the SPEs can act as a multi-stage
17
pipeline. Each SPE is responsible for one stage of the process. This model
is not suitable for parallel processing owing to difficulty in load balancing.
Additionally, this model increases the costs due to greater data movement.
• Parallel stage model (shown on right of Figure 3.2): Suitable when tasks in-
volve a large amount of data that can be partitioned and executed in parallel.
Thus, each SPE is used to execute different portions of data simultaneously.
• Services model (shown in Figure 3.3): Here, PPE assigns different services
to different SPEs, and the PPE’s main process calls upon the appropriate
SPE when a particular service is needed. For example one SPE processes
data encryption, another SPE processes MPEG encoding, and a third SPE
processes curve analysis. Fixed static allocation of SPE services should be
avoided. These services should be virtualized and managed on a demand-
initiated basis.
We used the parallel stage model in our framework. More details for our
choice of sub-model are presented in Chapter 5.
Figure 3.2: PPE-Centric Models: Multistage Pipeline Model and Parallel Stages
Model
3.2 Threads on PPE and SPE
A software developer also needs to understand how applications are executed on
CBEA. This will give the developer more control and he/she will be able to effec-
tively abstract the low level design.
18
Figure 3.3: PPE-Centric Service Model
The architecture can support several types of operating systems. Our research
group used the Linux operating system (Fedora Core 7), which was running on
the Sony play station (PS)3.
Every program is a process to the operating system. A process is a task that
competes for execution time on the microprocessor, and has resources. Linux
allows a program to execute multiple threads of execution. One program can
create independent threads, which share the resources of the parent process, but
execute independently of each other, and the parent process. A thread running in
the Linux OS environment is referred to as Linux thread.
In CBEA, the main thread of a program is a Linux thread running on the PPE.
Any Linux thread running on a PPE is called a PPE thread. The main thread can
spawn one or more CBE Linux tasks. A CBE Linux task has one or more Linux
threads associated with it that may execute on either a PPE or an SPE. All Linux
threads within the task share the tasks resources.
The operating system defines the mechanism and policy for scheduling an
available SPE. It must prioritize among all the Cell Broadband Engine Linux ap-
plications in the system, and it must schedule SPE execution independent from
regular Linux threads. It is also responsible for runtime loading, passing parame-
ters to SPE programs, notification of SPE events and errors, and debugger support.
A thread running on SPE is called an SPE thread. It has its own SPE context
which includes the 128x128-bit register file, program counter, and MFC command
queues. Further, it can communicate with other execution units or with effective
address memory through the MFC channel interface. All architectures create SPE
threads; however, SPE architecture of CBEA is new and so it needs to create SPE
19
threads in a specific way. It is thus important to understand how to load, run and
destroy threads on SPEs, which is described in the next section.
Additionally, programming on the CBE processor requires an understanding
of parallel programming as it is a multi-core system. Section 3.2.2 reviews differ-
ent styles of parallel programming on the CBE processor.
3.2.1 SPE Thread Creation
Programs to be run on an SPE are often written in C or C++ (or assembly lan-
guage) and can use the SPE data types and intrinsics defined in the SPU C/C++
Language Extensions. A PPE module starts running an SPE module by first cre-
ating a thread on the SPE. It creates SPE threads using the SPE context create,
program load, context run, pthread join and context destroy library calls, all of
which are provided in the SPE runtime management library except pthread join.
3.2.1.1 Creating Threads
spe_context_ptr_t spe_context_create(unsigned int
flags, spe_gang_context_ptr_t gang)
• flags - This is a bit-wise OR of modifiers that is applied when the new
context is created. A number of values are accepted for this parameter out
of which we use SPE MAP PS - Request permission for memory-mapped
access to the SPE threads problem state area.
• gang - It is a collection of contexts in which the context being created should
be made a part of.
Loading the SPEs
int spe_program_load(spe_context_ptr spe,
spe_program_handle_t *program)
• spe - The SPE context in which the specified program is to be loaded.
• program - Indicates the program to be loaded into the SPE context.
20
Running the SPEs
int spe_context_run(spe_context_ptr_t spe,
unsigned int *entry, unsigned int runflags,
void *argp, void *envp,
spe_stop_info_t *stopinfo)
• spe - The SPE context to be run.
• entry - Initial value of the instruction pointer in which the SPE program
should start executing. Upon return from the spe context run call, the value
pointed to by entry contains the next instruction to be executed upon re-
sumption of the program.
• runflags - This is a bit-wise OR of modifiers which request specific behavior
when the SPE context is run. Flags include: - 0 - Default behavior. No
modifiers are applied.
• argp - An optional pointer to application specific data. It is passed as the
second parameter of the SPU program.
• envp - An optional pointer to environment specific data. It is passed as the
third parameter of the SPU program.
• stopinfo - An optional pointer to a structure of type spe stop info t that
provides information as to the reason why the SPE stopped running.
Destroy the SPE threads
spe_context_destroy (spe_context_ptr_t spe)
Destroys the context for the SPE context spe. The SPE threads that were initially
created are destroyed.
3.2.2 Parallel Programming
For efficient computation it is important to understand how tasks can be executed
in parallel, on the eight SPEs. The key to parallel programming is to locate ex-
ploitable concurrency in a task. The basic steps for parallelizing any program
are:
21
• Locate concurrency: Time spent analyzing the program, its algorithms and
data structures will be repaid many-fold in the implementation and coding
phase. The most important question that we need to ask ourselves is - Will
the anticipated speedup from parallelizing a program be greater than the
effort to parallelize a program, which includes any overhead for synchro-
nizing different tasks or access to shared data? Another question we can
ask ourselves is - Which parts of the program are the most computationally
intensive? It is worthwhile to do initial performance analysis on typical data
sets, to be sure the hot spots in the program are being targeted. When you
know which parts of the program can benefit from parallelization, you can
consider different patterns for breaking down the problem. Key elements to
examine are:
– Function calls
– Loops
– Large data structures that could be operated on in chunks
• Structure the algorithm(s) to exploit concurrency: Ideally, you can identify
ways to parallelize the computationally intensive parts:
– Break down the program into tasks that can execute in parallel.
– Identify data that is local to each subtask.
– Group subtasks so that they execute in the correct order.
– Analyze dependencies among tasks.
The major challenges faced during parallelization of the programs are:
• Data dependencies exist.
• Overhead in synchronizing concurrent memory accesses or transferring data
between different processor elements and memory might exceed any perfor-
mance improvement.
• Partitioning work is often not obvious and can result in unequal units of
work.
• What works in one parallel environment might not work in another, due to
differences in bandwidth, topology, hardware synchronization primitives,
and so forth.
22
All levels of parallelism are already available with the CBE. These features
can be used to our advantage to mitigate some challenges posed by software par-
allelization. The CBE processor provides a foundation for many levels of paral-
lelization. Starting from the lowest, fine-grained parallelization SIMD processing
up to the highest, course-grained parallelization networked multiprocessing with
the CBE processor provides many opportunities for concurrent computation. The
levels of parallelization include:
• Dual-issue superscalar microarchitecture
• Multithreading
• Multiple execution units with heterogeneous architectures and differing ca-
pabilities
• Shared-memory multiprocessing
• SIMD processing
3.2.3 Instruction Sets
As mentioned earlier, in CBEA the PPE is focused on control-intensive tasks and
the SPEs are focused on computation-intensive tasks. The two types of processors
have different instruction sets due to their different functionality.
Based on our previous discussion in Chapter 2 the instruction set for the PPE
is an extended version of the PowerPC instruction set. The extensions consist of
the Vector/SIMD Multimedia Extension instruction set plus a few additions and
changes to PowerPC instructions. The instruction set for the SPEs is a new SIMD
instruction set, the Synergistic Processor Unit Instruction Set Architecture, with
accompanying C/C++ intrinsics. It also has a unique set of commands for man-
aging DMA transfer (gets and puts - discussed in later sections of this chapter),
external events, interprocessor messaging (mailboxes), and other functions.
Although the PPE and the SPEs execute SIMD instructions, their instruction
sets are different, and programs for the PPE and SPEs must be compiled using
different compilers. These compilers generate code streams for two completely
different instruction sets.
To conclude, even though a high-level language such as C or C++ code can
be used for the CBE processor, an understanding of the PPE and SPE machine in-
23
structions adds considerably to a software developer’s ability to produce efficient,
optimized code.
3.3 Communication between PPE and SPE
Before understanding how processors communicate with each other, it is impor-
tant to know about the types of storage domains the processors use to move data.
In CBE there are three types of storage domains - one main-storage domain, eight
SPE local storage domains, and eight SPE channel domains, as shown in Figure
3.4. The main-storage domain, which is the entire effective-address space, can
be configured by the PPE operating system to be shared by all processors and
memory-mapped devices in the system (all I/O is memory-mapped). However,
the local-storage and channel problem-state (user-state) domains are private to the
SPU, LS, and MFC of each SPE.
Figure 3.4: The Cell Storage Domains
Finally, using the defined communication mechanism implemented in hard-
ware between the processors, we move data to enable interaction between them.
The three primary communication mechanisms between the PPE and SPEs are
mailboxes, signal notification registers, and DMAs. This thesis focuses on under-
standing only mailboxes and DMA transfers, because only these two mechanisms
24
have been used in implementing the framework. They are explained in the next
two sub-sections.
3.3.1 Mailboxes
Mailboxes are the best way to send fast dedicated 32-bit messages between pro-
cessors. Thus, whenever the processors need to communicate some information
that is less than or equal to 32-bits, mailboxes are our best bet; for example, in
the case of our framework, addresses of certain structures and variables, sending
acknowledgements on receiving specific data, etc., are communicated using the
mailboxes. Each SPE has three mailboxes, for sending, receiving and buffering
32-bit messages from the SPE to the PPE. Two mailboxes (the SPU Write Out-
bound Mailbox and the SPU Write Outbound Interrupt Mailbox) are provided for
sending messages from the SPE to the PPE. One mailbox (the SPU Read Inbound
Mailbox) is provided for sending messages from PPE to the SPE.
PPE is often used as an application controller, managing and distributing work
to the SPEs. A large part of this task may involve loading main storage with data
to be processed, then notifying an SPE by means of a mailbox. The SPE can also
use its outbound mailboxes to inform the PPE that it has finished with a task. An
SPE sends a mailbox message by writing the 32-bit message value to either of
its two outbound mailbox channels. The PPE can read a message in an outbound
mailbox by reading the MMIO register in the SPE’s MFC that is associated with
the mailbox channel. Likewise, the PPE sends messages to the SPE’s inbound
mailbox by writing the associated MMIO register.
Following is a detailed explanation of how SPU Inbound and Outbound Mail-
boxes work along with examples of how to use the functions. This is essential
because: (1) it enables us to follow the low level code of framework, and (2)
knowing the mechanism of how mailboxes work may also help developers to im-
plement their own mailbox messages.
3.3.1.1 SPU Outbound Mailboxes
The MFC provides two one-entry mailbox channels - the SPU Write Outbound
Mailbox and the SPU Write Outbound Interrupt Mailbox - for sending messages
from the SPE to the PPE. We use the write outbound mailbox channel in our
25
implementation for communication. Further details about the channel as well as
the functions that are used at the PPE and SPE side are described below.
SPU Write Outbound Mailbox Channel: SPE software writes to the SPU
Write Outbound Mailbox channel to put a mailbox message in the SPU Write Out-
bound Mailbox. This write-channel instruction will return immediately if there is
sufficient space in the SPU Write Outbound Mailbox Queue to hold the message
value. If there is insufficient space, the write-channel instruction will stall the SPU
until the PPE reads from this mailbox.
SPE Side: On the SPE side the function spu write out mbox() writes to the SPU
Write Outbound Mailbox and stalls until space is available. Once the mbox data
is successfully written to the SPU Write Outbound Mailbox on the SPE side, the
next step is for PPE to read the value. The function used by SPE to send the data
is given below:
unsigned int mbox_data;
spu_write_out_mbox(mbox_data);
PPE Side: Before PPE software can read data from one of the SPU Write Out-
bound Mailboxes, it must first read the Mailbox Status Register to determine that
unread data is present in the SPU Write Outbound Mailbox; otherwise, stale or
undefined data may be returned. To determine that unread data is available in the
SPU Write Outbound Mailbox, PPE software reads the Mailbox Status Register
and extracts the count value from the SPU Out Mbox Count field. If the count is
non-zero, then at least one unread value is present. If the count is ’0’, PPE soft-
ware should not read the SPU Write Outbound Mailbox Register because it will
get incorrect data and should poll the Mailbox Status Register. The function used
to read the Mailbox Status Register is int spe out mbox status (spe context ptr t
spe) where the parameter spe specifies the SPE context for which the SPU Out-
bound Mailbox has to be read. PPE polls to read the SPU channel.
The function used by PPE to read SPU Write Outbound Mailbox Channel is:
#include <libspe2.h> - \\library containes app
\\programs to access the SPEs
int spe_out_mbox_read (spe_context_ptr_t spe,
unsigned int *mbox_data, int count)
26
The description of parameters of the function is as follows:
• spe - Specifies the SPE context for which the SPU Outbound Mailbox has
to be read.
• mbox data - A pointer to an array of unsigned integers of size count to
receive the 32-bit mailbox messages read by the call.
• count - The maximum number of mailbox entries to be read by this call.
This function returns an integer value - 0, >1 or <1.
• >0 - the number of 32-bit mailbox messages read
• 0 - No data to be read
• -1 - error condition and errno is set appropriately
3.3.1.2 SPU Inbound Mailboxes
The MFC provides one mailbox for a PPE to send information to an SPU: the SPU
Read Inbound Mailbox. This mailbox has four entries; that is, the PPE can have
up to four 32-bit messages pending at a time in the SPU Read Inbound Mailbox.
More details about the channel and the functions used at the SPE and PPE side
are given in the following subsections.
SPU Read Inbound Mailbox Channel: SPU Read Inbound Mailbox Channel:
If the SPU Read Inbound Mailbox channel has a message, the value read from the
mailbox is the oldest message written to the mailbox. If the inbound mailbox is
empty, the SPU Read Inbound Mailbox channel count will read as ‘0’.
PPE Side: This function writes up to count messages to the SPE inbound mail-
box for the SPE context spe. This call may be blocking or non-blocking, depend-
ing on behavior. The blocking version of this call is particularly useful to send a
sequence of mailbox messages to an SPE program without further need for syn-
chronization. The non-blocking version may be advantageous when using SPE
events for synchronization in a multi-threaded application. spe in mbox status
can be called to ensure that data can be written prior to writing the SPU inbound
mailbox. spe in mbox status(spe context ptr t spe) function fetches the status of
27
the SPU Inbound Mailbox for the SPE context specified by the spe parameter. A
0 value is returned if the mailbox is full. A non-zero value specifies the number of
available (32-bit) mailbox entries. The function to write to the SPU Read Inbound
Mailbox is given below:
#include <libspe2.h>
int spe_in_mbox_write (spe_context_ptr_t spe,
unsigned int *mbox_data, int count,
unsigned int behavior)
• spe - Specifies the SPE context of the SPU Inbound Mailbox to be written.
• mbox data - A pointer to an array of count unsigned integers containing the
32-bit mailbox messages to be written by the call.
• count - The maximum number of mailbox entries to be written by this call.
• behavior - Specifies whether the call should block until mailbox messages
are written.
There are four possible values for behavior, given in Table 3.3.1.2
Table 3.1: Values of Behavior
Value Description
SPE MBOX ALL BLOCKING The call blocks until all count mailbox
messages have been written
SPE MBOX ANY BLOCKING The call blocks until at least one mailbox
message has been written
SPE MBOX ANY NONBLOCKING The call writes as many mailbox messages
as possible up to a maximum of count without
blocking
After the PPE writes the mbox value, the SPE has to read the same from the
SPU Read Inbound Mailbox.
SPE Side: SPE software can use a read-channel funcion on the SPU Read In-
bound Mailbox channel to read the contents of its SPU Read Inbound Mailbox.
This channel read will return immediately if any data written by the PPE is wait-
ing in the SPU Read Inbound Mailbox. This read-channel function will cause the
SPU to stall if the SPU Read Inbound Mailbox is empty.
28
unsigned int mbox_data;
mbox_data = spu_read_in_mbox();
Although the mailboxes are primarily intended for communication between
PPE and SPEs, they can also be used for communication between an SPE and
other SPEs, processors, or devices.
3.3.2 Direct Memory Access
When the PPEs and SPEs have to transfer larger amounts of data or instructions
they use DMAs. Architecturally the DMAs are implemented to support the trans-
fer of large amounts (maximum 16 kB) of data or instructions between the pro-
cessors at a single instant. As our framework needs to support data transfers of
size 10 kB, knowing to how to use and implement DMAs properly is essential.
MFC’s DMA Controller (DMAC) implements DMA transfers of instructions
and data between the SPUs LS and main storage. Programs running on the asso-
ciated SPU, or the PPE, can issue the DMA commands. The MFC executes DMA
commands autonomously, which allows the SPU to continue execution in parallel
with the DMA transfers. Each DMAC can initiate up to 16 independent DMA
transfers to or from its LS.
3.3.2.1 DMA Transfers
To initiate a DMA transfer, software on an SPE uses a channel instruction to write
the transfer parameters to the MFC command queue channels. An SPE can only
fetch instructions from its own LS. An SPE or PPE performs data transfers be-
tween the SPE’s LS and main storage primarily using DMA transfers controlled
by the MFC DMA controller for that SPE. Software on the SPE’s SPU interacts
with the MFC through channels, which enqueue DMA commands (length of the
queue being 16) and provide other facilities, such as mailboxes and signal notifi-
cation. An SPE program accesses its own LS using a local storage address (LSA).
The LS of each SPE is also assigned a real address (RA) range within the systems
memory map. This allows privileged software to map LS areas into the effective
address (EA) space, where the PPE, SPEs, and other devices that generate EAs
can access the LS. Each SPE’s MFC serves as a data-transfer engine. DMA trans-
29
fer requests contain both an LSA and an EA. Thus, they can address both a SPE’s
LS and main storage and thereby initiate DMA transfers between the domains.
The MFC accomplishes this by maintaining and processing an MFC command
queue. The queued requests are converted into DMA transfers. Each MFC can
maintain and process multiple in-progress DMA command requests and DMA
transfers. The MFC can also autonomously manage a sequence of DMA trans-
fers in response to a DMA-list command from its associated SPU. Each DMA
command is tagged with a 5-bit Tag Group ID. Software can use this identifier
to check or wait on the completion of all queued commands in one or more tag
group.
DMA commands: The majority of MFC commands initiate DMA transfers;
these are called DMA commands. The basic DMA commands are the get and put.
Since the LSs of the SPEs and the I/O subsystems are typically mapped into the
effective address space, DMA commands can transfer data between the LS and
these areas as well. Regardless of the initiator (SPU, PPE, or other device), DMA
transfers up to 16 kB of data between LSs and main memory or between LSs. An
MFC supports naturally aligned DMA transfer sizes of 1, 2, 4, 8, and 16-bytes
and multiples of 16-bytes. The performance of a DMA transfer can be improved
when the source and destination addresses have the same quadword offsets within
a 128-byte cache line.
• put (put[s]) command transfers the number of bytes specified by the transfer
size parameter from the local storage address of the corresponding SPU to
the effective address (EA).
(void) mfc_put(volatile void *ls, uint64_t ea,
uint32_t size, uint32_t tag, uint32_t tid,
uint32_t rid)
The arguments to this function correspond to the arguments of the
spu mfcdma64 command: ls is the local-storage address, ea is the effective
address in system memory, size is the DMA transfer size (maximum is 16
kB), tag is the DMA tag, tid is the transfer class identifier, and rid is the
replacement class identifier.
30
• get command transfers the number of bytes specified by the transfer size
parameter from the effective address to the local storage address of the cor-
responding SPU.
(void) mfc_get(volatile void *ls, uint64_t ea,
uint32_t size, uint32_t tag, uint32_t tid,
uint32_t rid)
The arguments to this function correspond to the arguments of the
spu mfcdma64 command: ls is the local-storage address, ea is the effective
address in system memory, size is the DMA transfer size (maximum is 16
kB), tag is the DMA tag, tid is the transfer class identifier, and rid is the
replacement class identifier, same as above.
When MFC commands are entered into the command queue, each command
in the queue is tagged with a 5-bit tag-group identifier, called the MFC command
tag identifier. The identification tag can be any value between 0 and 31. The same
identifier can be used for multiple MFC commands to create a tag group contain-
ing all the commands currently in the queue with the same command tag. Software
can use the MFC command tag to check the completion of all queued commands
in a tag group. In our implementation the transfer class identifier and replacement
class identifier are both 0 and more information on them is not required.
After MFC commands are entered into the command queue, we might want to
know the status of these MFC commands. There are certain MFC DMA command
functions that can be used to check the completion of MFC commands or the status
of entries in the MFC DMA queue. The function used to determine the status of
the MFC DMA command is mfc read tag status. Each bit of a returned value
indicates the status of each tag group. If set, the tag group has no outstanding
operation (that is, commands completed) and is not masked by the query.
In our implementation, out of the many provided MFC DMA status functions,
we use mfc write tag mask and mfc read tag status all(). We explain these two
functions in more detail below.
• mfc write tag mask - A tag mask is set to select the MFC tag groups to be
included in the query operation, where the parameter mask in the function
is the DMA tag group query mask. Each bit of the mask indicates the tag
group. Implementation:
31
(void) mfc_write_tag_mask (uint32_t mask)
For example in an mfc get() command suppose the tag group identifier is
given as 7, then the 7th bit in the tag group identifier is set to 1. Thus, the
mfc write tag mask function is implemented as mfc write tag mask (1<<7).
• mfc read tag status all() - A request is sent to update tag status when all
enabled MFC tag groups have a “no operation outstanding” status. The
processor waits for the status to be updated. Implementation:
(uint32_t) mfc_read_tag_status_all(void)
Thus, the MFC waits till all the DMA commands in the tag group (tag mask
is already set for all the commands with the same tag group) have com-
pleted.
3.4 Chapter Conclusion
In this chapter, I explained the various unique software features of CBEA that en-
hance our control over the interconnection architecture. This helps to understand
the complexity involved in abstracting away from the low level bus architectural
constraints to successfully implement the scheduling framework.
I also described in detail the different application partitioning models, cre-
ation of threads that are used to send processing requests to different processors,
the PPE and SPE instruction sets and finally an overview of how PPE and SPEs
interact.
32
CHAPTER 4
BACKGROUND WORK: BUS
SCHEDULING ALGORITHMS
Our research group focused on developing an abstraction of the scheduling prob-
lem as well as an effective methodology that predictably schedules the multipro-
cessor bus. My key contribution to this effort was to build a framework that would
allow the team to implement and test the performance of a class of scheduling al-
gorithms for CBEA, while maintaining high throughput. While the focus of this
thesis is to understand the framework; it is also important to know the scheduling
algorithm, as it is important background material needed to appreciate the over-
all solution. In this chapter, I briefly describe the real-time algorithm that was
designed.
The research team proposed to employ a software-controllable Multi-Domain
Ring Bus (MDRB) architecture to increase system predictability and tighten WCET
estimation. The problem of scheduling periodic real-time transactions on MDRB
is challenging because the bus allows multiple non-overlapping transactions to
be executed concurrently, and because the degree of concurrency depends on the
topology of the bus and of executed transactions. The team proposed a practical
abstraction for the scheduling problem together with novel scheduling algorithms.
The first algorithm is optimal for transaction sets under restrictive assumptions
while the second induces a competitive sufficient schedulable utilization bound
for more general transaction sets.
This chapter is divided into three sections. The first section briefly reviews
important terminology that is needed to understand the algorithms and implemen-
tation. This will also be helpful in understanding the description of the imple-
mentation of the framework (Chapter 5). The second section describes Real-Time
Bus Transaction and Scheduling Model. The last section presents the scheduling
algorithm for the proposed real-time transaction sets on the ring buses. Also, the
relevant proofs that demonstrate the effectiveness of these algorithms have been
included.
33
4.1 Terms and Terminology
4.1.1 Transaction
An event in which data is transferred from one processor to another is referred
to as a transaction. We model three types of transactions in our implementation,
which are described below.
4.1.1.1 Atomic Transaction
An atomic transaction (see Figure 4.1) is defined as the smallest non-interruptible
transaction on the bus. The size of an atomic transaction on the EIB is 128 bytes.
Figure 4.1: Types of Transactions
4.1.1.2 Element Transaction
An element transaction is defined as a transaction that is started by a transfer
command issued by the bus scheduler. In other words, an element transaction is
a sequence of atomic transactions that are scheduled without interrupts from the
bus scheduler’s standpoint. The size of an element transaction is defined in terms
of the number of atomic transactions it is composed of.
4.1.1.3 Data Transaction
It represents a request made by an application for transferring a certain amount
of data between CBEA’s components. A data transaction comprises one or more
element transactions. The size of a data transaction is defined in terms of the
number of atomic transactions it is composed of.
34
The bus scheduler starts a transaction by first putting the data of the transaction
into a DMA buffer of the MFC and then issues a sending command to the MFC.
The transaction is transferred by the MFC in a hop-by-hop manner from the source
to the destination through the shortest possible bus segment. For example, in
Figure 2.4, the route of a transaction from SPE1 to SPE2 is SPE1-PPEMIC-SPE0-
SPE2 on either of the two counterclockwise rings. We define two transactions to
be overlapping if they share a segment of their route on the same ring.
Let us call a transaction that is started by a sending command an element
transaction. A 128-byte chunk is an atomic transaction which operates in a non-
interrupted manner. However, between atomic transactions of an element trans-
action, the bus may transfer atomic transactions of other element transactions on
the same bus segment. In other words, an element transaction can be interrupted
between its atomic transactions.
As already mentioned previously in Chapter 2, every atomic transaction goes
through five phases sequentially, out of which three phases have a higher signif-
icance in terms of building the schedules: the command, the arbitration, and the
data phase. Consequently, the remaining two phases, the arbitration and data
phases, are called the post-command phase. During the command phase of an
atomic transaction, the arbiter sets up a route for the transaction and puts it into
a post-command phase queue. Let the command phase latency be Lc. The com-
mand phase latency can be hidden by pipelining.
4.1.2 Contention
A bus interface unit (BIU) can issue one command every cycle even while the
command phase of its previous atomic transaction has not been completed. The
arbitration phase starts at the end of the command phase. Due to some bus con-
straints, contention between atomic transactions may occur in their post-command
phases. A low-level round robin scheduler arbitrates bus accesses between con-
tending transactions stored in the post-command phase queue. This phase occurs
when the BIU has to wait for bus access due to the contention between atomic
transactions on the bus. There are three types of contentions described in the
following.
35
4.1.2.1 Overlap Contention
Overlap contention occurs when there is an atomic transaction finishing the com-
mand phase while there are other overlapping transactions in their post-command
phases. The atomic transaction has to wait at least until the other active overlap-
ping transactions complete. Suppose if on one of the clockwise rings two element
transactions, SPE0 to SPE1 and SPE3 to SPE5 (Figure 4.2), are in their postcom-
mand phases. An atomic transaction SPE 1 to SPE 5 on the same ring has to
wait for the overlapping transaction SPE3 to SPE5 to finish. Even though there
are currently two transactions and according to the bus architecture, there can be
three atomic transactions simultaneously on the ring. The atomic transaction from
SPE1 to SPE5 has to wait due to overlap on the bus segment.
Figure 4.2: Overlap Contention
4.1.2.2 Overload Contention
Overload contention occurs when there is an atomic transaction finishing the com-
mand phase while, on the same ring, there are at least three non-overlapping trans-
actions in their post-command phases (see Figure 4.3). Because each bus ring can
support at most three atomic transactions at a point of time, the atomic transaction
has to wait at least until any one of the previously active atomic transactions com-
pletes. For example, suppose there are three transactions SPE0 to SPE1, SPE3 to
SPE5 and SPE4 to SPE2; if you want to start another transaction SPE2 to SPE0,
you need to wait for any one of the above transactions to complete because at any
given time there can be only three non-overlapping simultaneous transactions on
the bus ring.
36
Figure 4.3: Overload Contention
4.1.2.3 Start-Time Contention
Start-time contention occurs when there is an atomic transaction finishing the
command phase while on the same ring there are k < 3 other non-overlapping
transactions which have been in their post-command phases for less than three cy-
cles. Since a bus ring can only start the data phase of one atomic transaction every
Lo = 3 cycles, the atomic transaction has to wait for at most k ∗Lo cycles. Let us
define the delay due to the start-time contention to be the start-time latency.
4.2 Real-Time Bus Transaction and Scheduling Model
We consider a scheduling problem where applications request periodic data trans-
fers (data transactions) on the bus. A data transaction comprises of an infinite
number of periodic jobs.
Without loss of generality, let the bus elements be indexed clock-wise. We
define T as the set of data transactions T = {τi : i = [1, N ]}. A data transaction
τi is characterized by a tuple τi = (ei, pi, 1i , 
2
i ) where ei is the time that the
bus spends to transmit a job of τi, pi is the period of τi, and 1i , 
2
i are the two
indexes of the two endpoints which are called the first and the second index of
τi, respectively. Each job must complete within its period, i.e. relative deadlines
are equal to periods. A transaction has two endpoints 1i and 
2
i if it uses all bus
elements in [1i , 
2
i ] in the clockwise direction. Transaction τi is said to go through
element  if  ∈ [1i , 2i ) (excluding the second endpoint of τi). The bus utilization
37
ui of τi is calculated as ui = ei/pi. We assume that all data transactions arrive at
time 0. Let hyper-period h of T be the least common multiple of the periods of
all transactions in T .
Two transactions are said to overlap and cannot be transferred concurrently
on the bus if they use the same bus segment between any two elements. Based on
the endpoint definition, it is obvious that two transactions overlap if and only if
they go through the same element. Given a data transaction set T , we define an
overlap indicating function OV : T ×T 7→ {0, 1} where OV (τi, τj) = 1 if τi and
τj overlap, and 0 otherwise.
A pairwise overlap set (PO-set) D is defined as a maximal subset of T such
that ∀τi, τj ∈ D : OV (τi, τj) = 1. For convenience, we consider that a non-
overlapping transaction belongs to a PO-set that contains only that transaction.
In general a transaction may belong to more than one PO-set. Figure 4.4 shows
an example of a transaction set with four PO-sets: D1 = {τ1, τ2, τ3, τ4}, D2 =
{τ2, τ4, τ5}, D3 = {τ4, τ5, τ6}, D4 = {τ7, τ8}. Let the total number of PO-sets in a
transaction set be ND. Notice that although τ2 and τ6 have one common endpoint
(element 5), they do not overlap because they do not share any bus segment. Since
each PO-set contains at least one element different from each other PO-set and
transactions are arranged in an one dimension space, ND ≤ N .
Figure 4.4: Non-circular Transaction Set
A transaction set is said to be circular if its overlapping transactions create
a cycle on the bus and to be non-circular otherwise. Figures 4.4 and 4.5 show
an example of a non-circular and a circular transaction set, respectively. A non-
circular transaction set can be represented as a set of overlapping intervals on
an indexed straight line where each interval corresponds to a transaction and the
straight line is indexed by the indexes of the bus elements. Figure 4.6 shows the
indexed straight line representation of the non-circular transaction set shown in
38
Figure 4.5: Circular Transaction Set
Figure 4.6: Indexed Straight Line Representation
Figure 4.4. Let the left-most transaction on the straight line be the first transaction.
If there are more than one left-most transactions, any one of them can be the
first transaction. For simplicity, we index the bus elements such that the first
transaction has the first index to be the smallest index.
Due to the discrete nature of transactions’ execution times and periods, we
adopt the discrete scheduling model used in [7]. More specifically, we assume
that every transaction’s execution time and period are integral values. Scheduling
decisions are also made at integral values of time, starting from 0. The real interval
between time t ∈ N and time t + 1, i.e. [t, t + 1), is called slot t. A schedule S
is defined as a function S: Γ × N 7→ {0, 1} where S(τi, t) = 1 if and only if τi
is scheduled at slot t. A schedule S is valid if and only if according to S, it never
happens that a transaction is scheduled in the same slot together with one or more
other transactions that overlap with it.
Given the constraint on overlapping transactions, a necessary condition on the
schedulability of a transaction set can be easily derived as in Theorem 4.2.1.
Theorem 4.2.1. A transaction set T is schedulable only if:
∀D ⊂ T : uD =
∑
∀τi∈D
ui ≤ 1. (4.2.1)
Proof. Since, by definition, no two transactions of a PO-set D can be scheduled
39
concurrently, all transactions ofD must be scheduled in sequence. In other words,
the transactions of D can be considered to be sharing one resource. Therefore,
Inequality 4.2.1 must be satisfied.
Let E(k) be a set of all transactions in T that go through same bus element
indexed k. The following lemma is necessary for later discussion.
Lemma 4.2.1. Given a transaction set T that satisfies the necessary condition,
the following inequality holds: ∑
∀τi∈E(k)
ui ≤ 1.
Proof. Since transactions in E(k) pairwise overlap, there existsD such that E(k) ⊆
D. Therefore the lemma is implied by Theorem 4.2.1.
4.3 Scheduling Algorithms for Ring Buses
In this section we present our scheduling algorithms for the proposed real-time
transaction sets on the ring buses. The discussion is divided into three parts.
First, we propose an algorithm, namely POBase, which schedules every non-
circular transaction set whose transactions have the same period. We will prove
that the necessary condition (Theorem 4.2.1) is also the sufficient condition for
same-period non-circular transaction set to be schedulable by POBase. Therefore,
POBase is optimal for these transaction sets.
Second, a scheduling algorithm, namely POGen, is proposed to schedule non-
circular transaction sets whose transactions do not have the same period. POGen,
which is built based on POBase, can schedule all transaction sets for whose PO-set
utilizations satisfy the following utilization bound:
∀D ⊂ T : uD ≤ L− 1
L
, (4.3.1)
where L is defined as the greatest common divisor of all transaction periods. Al-
though the utilization bound is sufficient, it approximates 1 when L is large. We
believe that this assumption holds in most practical real-time applications [8]. As
we will show in the implementation section, with the speed of the state of the art
40
multicore chip buses [4], the practical time slot size is about 1 µs to 1 µs. Mean-
while, the period granularity in practical real-time applications [8] is at the level
of milliseconds. That means L has practical values ranging from 10 to 100 time
units.
Finally, we will discuss the issue of scheduling circular transaction sets and
our proposed initial solution.
4.3.1 The POBase Algorithm
The problem of scheduling a non-circular same-period transaction set is similar
to the problem of interval graph vertex coloring [9]. However, the optimal col-
oring algorithm in [9] can only schedule transactions which all have the same
execution time. POBase is a modification of this algorithm to handle the prob-
lem at hand. POBase is a first-fit algorithm with respect to a transaction ordering.
More specifically, in POBase, the transactions are ordered by their first indexes.
Then in ascending order, each transaction is assigned to the earliest slots where
no smaller-ordered overlapping transaction has been already assigned to1. Figure
4.7 shows an example of the schedule generated by POBase for the transaction
set shown in Figure 4.4 whose transactions have period equal to 8 and execution
times e1 = 2, e2 = 1, e3 = 2, e4 = 3, e5 = 4, e6 = 1, e7 = 4, e8 = 4. Consider
the schedule of transactions of D2 = {τ2, τ4, τ5}. Transaction τ5 is scheduled in
slots {0, 1, 3, 4} because its smaller-ordered overlapping transactions τ2 and τ4 are
scheduled in slots {2, 5, 6, 7}.
1The transactions can also be order by their second indexes and their schedules are generated
in descending order of the order list.
41
Algorithm 1 POBase
Input: transaction set T such that ∀τi ∈ T : pi = p where p is a constant
Output: schedule S for period p
1: L ← the list of all τi ∈ T ordered according to 1i
2: for each τi ∈ L in ascending order do
3: for each t ∈ [0, p) do
4: if
∑
x∈[0,p) S(τi, x) < ei then
5: if ∀τj ∈ L : OV (τi, τj) = 0 or S(τj, t) = 0 then
6: S(τi, t)← 1
7: end if
8: end if
9: end for
10: end for
Figure 4.7: An Example of the POBase Algorithm
Theorem 4.3.1. POBase is optimal for non-circular transaction sets
Proof. The generated schedule is valid because the condition at Step 5 guarantees
that a transaction is not scheduled in the same slot with its overlapping transac-
42
tions. It remains to show that if the transaction set satisfies the necessary condi-
tion, then at the end of the algorithm,
∀τi :
∑
x∈[0,p)
S(τi, x) = ei. (4.3.2)
We will prove this by induction.
Base case: Consider the first iteration of the for-loop starting at Step 2. In this
iteration, the schedule of τ1 in L is generated. Since τ1 is the first transaction
whose schedule is generated and e1 ≤ p, at the end of the iteration, we have∑
x∈[0,p) S(τ1, x) = e1. Furthermore, since 
1
1 ≤ 12, the following induction con-
dition also holds at the end of the iteration: ∀τi ∈ T if
∑
x∈[0,p) S(τi, x) > 0 then
1i ≤ 12.
Induction case: Assume after iteration k of the for-loop starting at Step 2, Equa-
tion 4.3.2 holds for all transactions {τi : i ∈ [1, k]} and the following induction
condition also holds at the end of iteration k: ∀τi ∈ T if
∑
x∈[0,p) S(τi, x) > 0
then 1i ≤ 1k+1. Consider iteration k+ 1. By contradiction, assume that at the end
of the iteration,
∑
x∈[0,p) S(τk+1, x) < ek+1. Let E(1k+1) be the set of transactions
that go through 1k+1. Since T is non-circular, by the induction condition, we have
∀τi ∈ T if OV (τi, τk+1) = 1 and
∑
x∈[0,p) S(τi, x) > 0 then τi ∈ E(1k+1). In
other words, among all the transactions that overlap with τk+1, only transactions
in E(1k+1) have their schedule generated. Therefore, the contradiction assumption
occurs only when: ∑
τi∈E(1k+1)
∑
x∈[0,p)
S(τi, x) = p. (4.3.3)
Since the following is true:
∀τi ∈ E(1k+1) \ {τk+1} :
∑
x∈[0,p)
S(τi, x) ≤ ei,
by the contradiction assumption and Equation 4.3.3 we have:
∑
τi∈E(1k+1) ei > p.
This contradicts Lemma 4.2.1 which implies that
∑
τi∈E(1k+1) ei ≤ p. Therefore,
at the end of the iteration, Equation 4.3.2 must hold for τk+1 and the induction
condition also holds. This completes the proof.
Algorithm analysis: An efficient sorting algorithm has time complexityO(N).
43
In addition, Step 5 can be implemented to have a time complexity ofO(N). There-
fore the time complexity of POBase to build a schedule of p slots for N transac-
tions is O(N2 ∗ p).
4.3.2 The POGen Algorithm
In this subsection we propose a scheduling algorithm (POGen) for non-circular
transaction sets whose transactions do not have the same period. In POGen, the
execution timeline from 0 to the hyper-period h, i.e. [0, h), is divided into a set of
consecutive scheduling intervals: {intk = [tk, tk+1) : k ∈ N ∧ 0 ≤ tk < tk+1 <
h}. Let |intk| = tk+1 − tk. In each scheduling interval intk, each transaction τi is
assigned an interval load lki which is the number of slots in the interval allocated
to schedule τi. The interval loads, of each transaction are calculated such that at
the end of each interval, the transaction’s execution approximates its execution in
the fluid scheduling model [10]. The interval load of a PO-set is the sum of the
interval loads of its transactions. Given the interval loads of all transactions in
interval intk, POBase is used to generate the schedule of intk. As shown in the
previous subsection, the interval schedule given by POBase will be feasible if and
only if:
∀D ⊂ T :
∑
τi∈D
lki ≤ |intk|.
A schedule of a transaction set, which is generated by POGen, is feasible if it
satisfies the following two conditions:
• Condition 1: for each transaction τi, the sum of the interval loads over each
of the transaction’s period is equal to ei.
• Condition 2: there is a feasible schedule for every scheduling interval.
In the following paragraphs, we will discuss our solution to identify the scheduling
intervals and the interval loads, which induces a feasible schedule.
Our proposed solution is inspired by the work in [11, 12]. However, since
neither of these works uses the transaction overlap assumption, their proposed
algorithms cannot be used for the problem at hand. In POGen, a scheduling in-
terval is defined as the interval between two closest arrival times (also deadlines)
of any two transactions. Figure 4.8 shows an example of the scheduling intervals
induced by the set of three transactions τ1 = {e1 = 1, p1 = 2, 11 = 1, 21 = 3},
τ2 = {e2 = 1, p2 = 3, 12 = 1, 22 = 4} and τ3 = {e3 = 1, p3 = 6, 13 = 2, 23 = 5}.
44
Figure 4.8: Scheduling Intervals on the Execution Timeline
With regard to the interval loads, we define for each transaction τi and schedul-
ing interval intk a lag function:
lag(τi, int
k) = ui ∗ tk+1 −
∑
x∈[0,tk)
S(τi, x).
The function calculates how much time τi must be executed in interval intk such
that at the end of intk it is scheduled as if by the fluid scheduling model [10]. We
also define for each PO-set D a similar lag function:
lag(D, intk) = uD ∗ tk+1 −
∑
τi∈D
∑
x∈[0,tk)
S(τi, x).
In POGen, at each interval intk, all interval loads must satisfy Inequality 4.3.4.
The lower bound and the upper bound of the interval loads in Inequality 4.3.4 are
the closest integral values of the lag functions.
∀τi ∈ T : blag(τi, intk)c ≤ lki ≤ dlag(τi, intk)e. (4.3.4)
Note that if all loads satisfy the lower bounds of Inequality 4.3.4, then the gen-
erated schedule satisfies Condition 1. The reason is as follows. Consider the last
scheduling interval of a period of transaction τi: int = [t, a ∗ pi) where t and a are
some integers; the lag function of τi is:
lag(τi, int) = a ∗ ui ∗ pi −
∑
x∈[0,t)
S(τi, x).
45
Since ui∗pi = ei is an integer, and so is S(τi, x), blag(τi, int)c = lag(τi, int). That
means the total interval loads of τi up to slot a ∗ pi, which is calculated as
blag(τi, int)c+
∑
x∈[0,t)
S(τi, x),
are equal to a ∗ ei and satisfy Condition 1. However using only the lower bound
loads does not guarantee the existence of a feasible schedule in each scheduling
interval (Condition 2). This is also true if only upper bound loads are used. The
following example illustrates this point. Consider again the example of the trans-
action set in Figure 4.8. If the algorithm runs with interval loads to be their lower
bound loads, then the schedule of interval [4, 6) is not feasible because the total
load in this interval is 3. If, on the other hand, only the upper bound loads are
used, then the schedule of interval [0, 2) is also not feasible because the total load
in this interval is 3. An algorithm that generates feasible schedules must use a
combination of these values, and computing this is not trivial. For the ease of
presentation, we split our discussion into two parts. First, we assume to have the
GenerateLoad procedure (used in Step 2 of POGen) which satisfies the following
proposition:
Proposition 4.3.1. Assume that all PO-sets satisfy the utilization bound in In-
equalities 4.3.1. If the following inequalities hold before the execution of
GenerateLoad for an interval intk:
∀D ⊂ T : blag(D, intk)c ≤ |intk|, (4.3.5)
∀D ⊂ T :
∑
τi∈D
∑
x∈[0,tk)
S(τi, x) ≥ buD ∗ tkc, (4.3.6)
then GenerateLoad generates a set of interval loads for intk which satisfy both
Inequalities 4.3.4 and:
∀D ⊂ T : blag(D, intk)c ≤
∑
τi∈D
lki ≤ |intk|. (4.3.7)
Inequalities 4.3.7 set conditions on the total interval load of each PO-set. The
right side of Inequality 4.3.7 guarantees that each PO-set with the generated inter-
val loads is schedulable in intk by POBase. Let us call a set of interval loads of
a scheduling interval that satisfies both Inequality 4.3.4 and 4.3.7, and therefore
46
Conditions 1 and 2, a feasible load set. We will prove in Lemma 4.3.1 that, if
Proposition 4.3.1 is true, then the conditions in Inequalities 4.3.5 and 4.3.6 are
indeed always satisfied for every interval intk. Given the defined GenerateLoad
procedure and Lemma 4.3.1, it is then obvious that POGen generates a feasible
schedule of T . In the second part (Section 4.3.3), we will detail how to construct
GenerateLoad and prove that Proposition 4.3.1 holds.
Algorithm 2 POGen
Input: transaction set T
Output: schedule S
1: for each scheduling interval intk do
2: {lki : ∀i ∈ [1, N ]} ← GenerateLoad(T ,intk)
3: T ′ ← {{lki , |intk|, 1i , 2i } : ∀i ∈ [1, N ]}
4: S for interval intk ← POBase(T ′)
5: end for
Lemma 4.3.1. If Proposition 4.3.1 is true, then Inequalities 4.3.5 and 4.3.6 hold
before the execution of GenerateLoad for every interval intk.
Proof. We prove by induction.
Base step: Consider the first scheduling interval int0 = [0, t1). Inequalities 4.3.5
for this interval hold because
∀D ⊂ T : blag(D, int0)c = buD ∗ t1c ≤ |int1|,
and Inequalities 4.3.6 hold because
∀D ⊂ T :
∑
τi∈D
∑
x∈[0,0)
S(τi, x) = 0 = buD ∗ 0c.
Induction step: Assume that Inequalities 4.3.5 and 4.3.6 hold in every schedul-
ing interval up to intk. We prove that Inequalities 4.3.5 and 4.3.6 also hold before
the execution of GenerateLoad at interval intk+1. Since Inequalities 4.3.5 and
4.3.6 are satisfied at interval intk, GenerateLoad generates a feasible load set and
POBase generates a feasible schedule for the interval. Therefore after Step 4, we
have:
∀D ⊂ T :
∑
τi∈D
∑
x∈[tk,tk+1)
S(τi, x) =
∑
τi∈D
lki .
47
Then by the left side of Inequalities 4.3.7, we obtain the following which proves
that Inequalities 4.3.6 hold for intk+1.
∀D ⊂ T :
∑
τi∈D
∑
x∈[0,tk+1)
S(τi, x)
=
∑
τi∈D
∑
x∈[0,tk)
S(τi, x) +
∑
τi∈D
lki
≥
∑
τi∈D
∑
x∈[0,tk)
S(τi, x) +⌊
uD ∗ tk+1
⌋
−
∑
τi∈D
∑
x∈[0,tk)
S(τi, x)
= buD ∗ tk+1.c
Now consider Inequalities 4.3.5. Notice that since S(τi, x) is integer, we have:
∀D ⊂ T : blag(D, intk+1)c =
⌊
uD ∗ tk+2
⌋
−
∑
τi∈D
∑
x∈[0,tk+1)
S(τi, x).
Since Inequalities 4.3.6 hold for intk+1, Inequalities 4.3.5 also hold because:
∀D ⊂ T : blag(D, intk+1)c
=
⌊
uD ∗ tk+2
⌋
−
∑
τi∈D
∑
x∈[0,tk+1)
S(τi, x)
≤ buD ∗ tk+2c − buD ∗ tk+1c
≤ duD ∗ (tk+2 − tk+1)e ≤ |intk+1|.
This completes the proof.
4.3.3 The GenerateLoad Procedure
As we mentioned, procedure GenerateLoad searches for a feasible load set of
each scheduling interval. There are two questions that have to be answered: (1)
Is there a feasible load set? (2) Is there an efficient algorithm to find it? We
will show that the problem at hand is equivalent to the problem of circulations
in graphs with loads and lower bounds [13]. This is the problem of finding a
feasible circulation flow in a directed graph where each edge has a capacity and
48
a lower bound. Furthermore, we will prove that if the utilization of each PO-set
is smaller than the utilization bound expressed by Inequalities 4.3.1, there always
exists a feasible solution, therefore answering Question 1. Then, since the Ford-
Fulkerson algorithm [13] can be used to solve the problem, Question 2 is also
answered.
In the following, we will intuitively describe the construction of a directed
graph from the input of GenerateLoad. Each vertex of the constructed graph rep-
resents a PO-set Dj . For each vertex, a PO-set edge gDj is defined which exits
from the vertex and whose flow value fDj represents the interval load of the corre-
sponding PO-set. A lower bound value bDj and a capacity c
D
j are defined for each
of the PO-set edges such that Inequalities 4.3.7 are imposed on their flow values:
∀Dj ⊂ T : bDj = blag(Dj, intk)c ≤ fDj ≤ cDj = |intk|. (4.3.8)
Furthermore, for each transaction τi, a transaction edge is defined whose flow
value fi represents the interval load of the corresponding transaction. A lower
bound value bi and a capacity ci are defined for each of the transaction edges such
that Inequalities 4.3.4 are imposed on their flow values:
∀τi ∈ T : bi = blag(τi, intk)c ≤ fi ≤ ci = dlag(τi, intk)e. (4.3.9)
The flow of a transaction edge entering a vertex represents the contribution of the
corresponding transaction’s interval load to the corresponding PO-set’s interval
load. The endpoints and the direction of each edge are defined in such a way that
the values of the flows in and out a vertex preserve the relationship between the
interval load of the corresponding PO-set and that of its transactions. The graph
has a feasible circulation flow which represents a feasible load set.
The following definition is necessary for the graph construction. Let the index
PO-set order of a transaction set T be an ordered list of all PO-sets in T where
PO-set D with smaller minτi∈Dj 2i has smaller index. Ties are broken arbitrarily.
Since each PO-set has only one value minτi∈Dl 
2
i , the order is well-defined. The
transaction set in Figure 4.4 has the index PO-set order be {Dj : j ∈ [1, 4]} where
D1 = {τ1, τ2, τ3, τ4}, D2 = {τ2, τ4, τ5}, D3 = {τ4, τ5, τ6}, D4 = {τ7, τ8}. Figure
4.9 shows the graph G constructed from the transaction set in Figure 4.4. Trans-
action edges are represented by solid lines while PO-set edges are represented by
dotted lines.
49
Graph construction: let us define a tuple G = (V,E) as follows:
• For each PO-set Dj in the index PO-set order, define a vertex vj .
• For each PO-set Dj in the index PO-set order, define a directed edge gDj
with capacity cDj = |intk| and lower bound bDj = blag(Dj, intk)c. Let gDj be
a PO-set edge.
• For each transaction τi, define a directed edge gi with capacity
ci = dlag(τi, intk)e, and lower bound bi = blag(τi, intk)c. Let gi be a trans-
action edge.
• {gi : τi ∈ D1} are edges that enter v1; gD1 are edges that exit v1.
• ∀j : 1 < j ≤ ND, {gi : τi ∈ Dj \ Dj−1} and gDj−1 are edges that enter
vj; {gi : τi ∈ Dj−1 \ Dj} and gDj are edges that exit vj . This construction
step deals with the situation where two PO-sets Dj−1,Dj share some trans-
actions. Intuitively, to preserve the relationship between the interval loads
of the PO-sets and that of its transactions, the transaction edge of a transac-
tion common to the two PO-sets would have to enter the two corresponding
vertexes vj−1, vj . Since in a qualified graph, each directed edge can enter at
most one vertex, this situation must be avoided. This can be accomplished
by representing the interval loads of the common transactions on the sec-
ond PO-set (vj) as the interval load of the first PO-set (i.e., gDj−1 enters vj)
minus the interval load of the transactions that are only in the first set (i.e.,
{gi : τi ∈ Dj−1 \ Dj} exit vj). Lemma 4.3.2 will detail the proof of this
argument.
• V = {vj : j ∈ [1, ND]}.
• E = {gi : j ∈ [1, N ]} ∪ {gDj : j ∈ [1, ND]}.
Finally, the graph flow is subject to the flow conversation constraint [13] in which
given a vertex, the sum of the flow values entering it minus the sum of the flow
values exiting it is zero.
As a graph construction example, consider vertex v2 that represents PO-setD2.
The vertex has an output PO-set edge gD2 which represents the interval load ofD2.
Since D1 has τ2 and τ4 in common with D2 but not τ1 and τ3, v2 has an input PO-
set edge gD1 which represents the interval load of D1 and two output transaction
edges g1 and g3 that represent the interval loads of τ1 and τ3, respectively. Finally
50
Figure 4.9: Constructed Graph G
v2 has an input transaction edge g5 that represents the interval load of τ5. Lemma
4.3.2 shows that G is indeed a directed graph.
Lemma 4.3.2. G is a directed graph.
Proof. Since every edge of G is directed, it remains to show that each edge has
only one or two indexes. There is one edge defined for each PO-set and one edge
defined for each transaction.
For each PO-set Dj , the PO-set edge gDj exits only vj . In addition, gDj enters only
vj+1 when j < ND. Therefore each PO-set edge exits exactly one vertex and
enters at most one vertex.
By the index PO-set ordering, if τi ∈ Dj \ Dj−1, then τi /∈ Dk \ Dk−1 where
j < k ≤ ND. Therefore, the elements of the following set are disjoint: A ={{gi : τi ∈ D1}, {gi : τi ∈ Dj \ Dj−1} : j ∈ (1, ND]}. By definition, A contains
the transaction edges ofG that enter some vertices. Also the union of the elements
of A is {gi : τi ∈ T }. Therefore, each transaction edge enters exactly one vertex.
By a similar proving technique, we can show that each transaction edge exits
at most one vertex. Due to space constraints, we skip the detailed proof. In
conclusion, every edge of G has at most two endpoints and is directed.
It remains to show that Proposition 4.3.1 holds and therefore POGen generates
a feasible schedule for all transaction sets that satisfy the utilization bound of
Inequalities 4.3.1. For simplicity of exposition, we split the proof in multiple
lemmas. Lemma 4.3.3 shows that a feasible load set can be found from an integral
feasible flow of the correspondent graph G. Then, Lemma 4.3.5 proves that graph
G has a feasible flow if Inequalities 4.3.5 and 4.3.6 are satisfied for interval intk
and furthermore all PO-sets satisfy an utilization constraint based on |intk|. Note
that we know from [13] that if graph G has a feasible flow, then it has an integral
51
feasible flow which can be found by the Ford-Fulkerson algorithm [13]. Finally,
we will show that the utilization bound of Inequalities 4.3.1 implies the utilization
bound used in Lemma 4.3.5. Hence, Proposition 4.3.1 holds.
Lemma 4.3.3. If there is an integral feasible flow in graph G, then there is a
feasible load set where ∀τi ∈ T : lki = fi .
Proof. Given an integral feasible flow, ∀τi ∈ T let lki = fi. The following in-
equality holds:
∀τi ∈ T : blag(τi, intk)c ≤ lki ≤ dlag(τi, intk)e.
Thus the interval loads satisfy Inequality 4.3.4. We now have to prove that the
interval loads also satisfy Inequality 4.3.7. We prove this by induction over the
ordered set of vertices.
Base case: By the flow conservation constraint at vertex v1, we have∑
τi∈D1
lki =
∑
τi∈D1
fi = f
D
1 .
Then, by the edge constraints of PO-set edge gD1 , the Inequalities 4.3.7 ofD1 hold.
blag(D1, intk)c ≤
∑
τi∈D1
lki ≤ |intk|.
Induction case: Assume Inequality 4.3.7 is satisfied up to Dj−1 and the following
induction condition holds: ∑
τi∈Dj−1
fi = f
D
j−1.
We prove that Inequality 4.3.7 and the induction condition are also satisfied for
Dj . Given the induction assumption and the flow conservation constraint at vertex
vj , the following equalities hold:∑
τi∈Dj
lki =
∑
τi∈Dj\Dj−1
fi +
∑
τi∈Dj∩Dj−1
fi
=
∑
τi∈Dj−1\Dj
fi + f
D
j − fDj−1 +
∑
τi∈Dj∩Dj−1
fi
= fDj−1 + f
D
j − fDj−1 = fDj .
52
Then, by the edge constraints of PO-set edge gDj , Inequality 4.3.7 holds for Dj .
blag(Dj, intk)c ≤
∑
τi∈Dj
lki ≤ |intk|.
Furthermore, the induction condition also holds:∑
τi∈Dj
fi = f
D
j .
This completes the proof.
The following lemma is necessary for later discussion and is a direct result
from the induction condition in the proof of Lemma 4.3.3.
Lemma 4.3.4. The following equalities hold for every vertex vj:∑
τi∈Dj
fi = f
D
j .
Lemma 4.3.5. There exists a feasible flow in graph G if Inequalities 4.3.5 and
4.3.6 are satisfied for interval intk and furthermore the PO-set utilizations satisfy
the following condition:
∀Dj ⊂ T : uDj ≤
|intk| − 1
|intk| . (4.3.10)
Proof. First note that Inequalities 4.3.5 are necessary for the edge constraints on
each PO-set edge (Inequality 4.3.8) to be satisfied. Let us construct a flow as
follows:
∀τi ∈ T : fi = lag(τi, intk)
∀Dj ⊂ T : fDj = lag(Dj, intk).
We will have to prove that the constructed flow satisfies the edge constraints and
the flow conservation constraints. Given the constructed flow, it is easy to verify
that the edge constraints of each transaction edge (Inequality 4.3.9) and the left-
side edge constrains of each PO-set edge (Inequality 4.3.8) are satisfied. The right-
side edge constraints of each PO-set edge are satisfied because, by the definition
53
of the lag function and by Inequalities 4.3.6, before the execution of GenerateLoad
for interval intk we have the following:
lag(Dj, intk) = uDj ∗ tk+1 −
∑
τi∈Dj
∑
x∈[0,tk)
S(τi, x)
≤ uDj ∗ tk+1 − buDj ∗ tkc
< uDj ∗ tk+1 − uDj ∗ tk + 1.
Now by Inequalities 4.3.10, the following holds:
lag(Dj, intk) < uDj ∗ tk+1 − uDj ∗ tk + 1 ≤ |intk|.
It remains to verify that the flow conservation constraints at each vertex also hold.
By Lemma 4.3.4, the total flow value entering vertex vj can be calculated as:∑
τi∈Dj\Dj−1
fi + f
D
j−1
= fDj −
∑
τi∈Dj∩Dj−1
fi + f
D
j−1
= fDj −
∑
τi∈Dj∩Dj−1
fi +
∑
τi∈Dj−1
fi
= fDj +
∑
τi∈Dj−1\Dj
fi,
which equals to the total flow value exiting vj , i.e.:
fDj +
∑
τi∈Dj−1\Dj
fi.
We can finally state our main theorem.
Theorem 4.3.2. Transaction set T is schedulable by POGen if:
∀Dj ⊂ T : uDj ≤
L− 1
L
.
Proof. Since L ≤ mink(|intk|), Inequalities 4.3.10 hold. Assume that Inequalities
4.3.5 and 4.3.6 hold for a specific interval intk. Then by Lemma 4.3.5 and [13],
54
the constructed graph G has an integral feasible flow. Hence, by Lemma 4.3.3
algorithm GenerateLoad computes a feasible load set, which proves Proposition
4.3.1. Since furthermore, according to Lemma 4.3.1, Inequalities 4.3.5 and 4.3.6
hold for every interval intk, it follows that Inequalities 4.3.4 and 4.3.7, and there-
fore feasibility Conditions 1 and 2, also hold for every interval. This concludes
the proof.
Algorithm analysis: The time complexity of the Ford-Fulkerson algorithm,
which is also the time complexity of GenerateLoad, is O(N ∗ |intk|). Therefore,
the time complexity of POGen is O(N2 ∗ h).
4.3.4 Scheduling Circular Transaction Sets
Unfortunately, neither algorithm POBase nor POGen can be applied to circular
transaction sets. This is because the necessary condition of Theorem 4.2.1 is not a
sufficient condition for same period circular transactions. As an example, consider
the transaction set in Figure 4.5. Assume that all transactions have transmission
times equal to 1 and periods equal to 2.
The utilization of each PO-set of this transaction set is 1, hence it satisfies
the necessary condition. However, it is easy to see that there exists no feasible
schedule. In fact, any valid schedule can have at most four transactions scheduled
in the first two slots; therefore, the fifth transaction misses its first deadline.
It is important to stress that in most cases, engineering approaches can be used
to replace a circular transaction set with a functionally equivalent non-circular
one. We believe that in many applications, software designers can avoid the trans-
action cycle by selecting the endpoints of transactions. That is possible because
endpoints are determined by the placement of software tasks which produce or
use the transaction data.
When this technique cannot be applied, we propose a solution to convert a
circular transaction set into a non-circular one as follows: select an element  on
the ring; then for each transaction τi that goes through , split τ into two transac-
tions such that one of them has endpoints {1i , } and the other one has endpoints
{, 2i }. Note that the two transactions do not overlap, hence the cycle is broken at
. However, to respect the ordering between transactions it might be necessary to
buffer the transaction data at  for one full transaction period.
55
4.4 Chapter Conclusion
In this chapter I described the bus scheduling algorithm that was developed along
with its mathematical proof. Real-Time Bus Transactions and Scheduling Model
were also discussed, along with the basic terms and terminology required to un-
derstand the algorithm. Since the real-time scheduling algorithm and its derivation
were not my contribution, I have included only a brief discussion as background
material. For a more detailed overview, one can contact my research group (in
particular Bach D. Bui and Rodolfo Pellizzoni) in the Real-Time Laboratory at
the Department of Computer Science, University of Illinois.
56
CHAPTER 5
IMPLEMENTATION
This chapter explains the implementation of the bus scheduling framework for
real-time applications. It describes the rationale and functions used in each step of
the framework. As mentioned earlier, we chose to implement this framework on a
CBEA. The components of this architecture including PPE, SPEs, IOIFs, and MIF
are synchronized to an accuracy of nanoseconds, which is good enough for most of
the critical real-time applications. Further, as discussed earlier, due to its hardware
features CBEA overcomes important limitations of modern architectures.
The CBEA implementation platform used in this work is the Sony Play Station
3 (PS3) which has 6 SPEs enabled out of 8. The 7th SPE on the chip is blasted
for power purposes and the last SPE is used for running the hypervisor. It is to
be noted that all 8 SPEs are enabled on the Cell Blades. Of the two application-
partitioning models mentioned in Chapter 2 our implementation is based on the
PPE-centric model. In this type of model the PPE runs the main application, and
the individual jobs are loaded onto the SPEs for execution. The PPE waits while
the SPEs execute, and then coordinates the results returned by them. Furthermore,
within this framework I implemented the sub-model called parallel stages model,
the one with serial data and parallel computation.
Keeping parallel stage model in mind I developed four versions of code (see
Figure 5.1), and from these I picked the interrupt based model where PPE sends
the complete schedule table to all SPEs. This version was optimized to conduct
experiments.
In this chapter I describe the selected code in more detail. This scheduling
framework code is divided into two main parts - PPE side and SPE side, which
are explained in the following sections. Within each section, rationale and func-
tions for each step are discussed at length. In the end, using a flowchart, I explain
the flow of the implementation code, and provide an overview of the build envi-
ronment of this project.
57
Figure 5.1: Different Versions of Implemented Code
5.1 Scheduling Framework: PPE Side
On the PPE side, first all the input arrays and variables are initialized. Then sched-
ules, in table format, are solicited via the User Interface, referred to as PPE Table.
Next the PPE creates SPE contexts and loads them with this PPE Table.
5.1.1 Input Arrays and Variables
Before understanding the User Interface let us take a look at the parameters re-
quired for building the table based schedule using an example. A data transaction
of size 100 kB is to be transferred from SPE-X (source SPU id) to SPE-Y (des-
tination SPU id). Based on the experimentation done, which will be described in
Chapter 6, the slot size we use is 10 kB, so the 100 kB data transaction has to be
divided into a series of 10 element transactions each of size 10 kB. The SPE-X
is considered as the source spu id from which the data is sent, while SPE-Y is
the destination spu id which receives the data. The amount or size of data to be
DMAed is 10 kB, which basically means the size of element transaction is 10 kB.
The above example represents only one SPE to SPE DMA. When we place three
simultaneous data transactions of sizes 100 kB, 200 kB and 150 kB transferring
data from SPE-X to SPE-Y, SPE-A to SPE-B and SPE-I to SPE-J respectively,
together in a table, it is called the PPE table.
58
5.1.2 User Interface
The technique of picking the various combinations of SPEs for sending and re-
ceiving the data in specific slots is defined as a scheduling algorithm. In our
framework the schedules are built based on an algorithm that is computed offline.
The user is required to enter some details of the scheduling framework into the
arrays on the PPE side. They are described as follows:
/**** Initialize Schedule Table ****
***********************************/
/**** PLEASE ENTER THE FOLLOWING ****
************************************/
1. Number of SPEs (2-6):
************************
Number of SPEs, as the name suggests, is the number of SPEs for which you want
to create contexts; it can either be 2, 3, 4, 5 or 6. This number means that many
different SPEs are part of the schedule and they all need to be initialized. The
variable that holds this number is called num spes.
2. Array Size (Number of entries in the
schedule table (1-30))
*************************************************
Array Size is the total number of entries you will be entering into the PPE Table.
The value of the variable array size defines the number of data transactions the
user wants to schedule on the SPEs. The maximum number of entries you can
have in the User Interface is 30, as all the arrays defined in this implementation are
static arrays and thus they need to be within the provided range for the application
to work properly.
3. Data Size in Kilobytes (minimum being 10KB)
**********************************************
Data Size is the size of the data transaction DMAed from the source SPU id to
the destination SPU id. To make the User Interface user friendly, we ask the
59
user to enter the data size in kB (where, 1 kB = 1024-bytes). According to our
implementation we have a lower limit on the size of the data transaction, which
is 10 kB (size of the element transaction) and it is advisable to keep the data
transaction a multiple of 10, as it helps in analyzing and scheduling the initializa-
tion and completion of element transactions. Through the User Interface the data
sizes are entered into an array called data size[i]. These sizes are later converted
from kilobytes to bytes in the function size conversion() and stored in the variable
spu dma data. The size of element transaction is always 10 kB, irrespective of
the size of the data transaction entered in the User Interface. Data transactions
are divided into element transactions of size 10 kB in the SPU side. The element
transactions are scheduled in the interrupt handler function (described in later sec-
tion). For example, if a data transaction of size 50 kB is to be transferred from
SPE1 to SPE3, then it is divided into five element transactions each of size 10 kB,
and transferred based on its schedule from SPE1.
If in the previous entry we entered the Array Size as 3, it implies that we have
three data transactions in our PPE Table. Thus, for all the three data transactions
the data size in kB has to be entered, which is later converted to bytes in the
function size conversion() (1024 * data size).
4. Enter the Source SPU ids (0-5)
*********************************
As mentioned earlier, the source SPU ids are the SPEs that send the data. The
number of source SPU ids to be entered is equal to the Array Size value already
entered previously. The array spu source id[i] holds the ids of the source SPEs.
5. Enter the Destination SPU IDs (0-5)
**************************************
Destination SPU ids are the SPEs that receive the data. For every source SPU id
entered above, we need to enter a destination SPU id respectively. The array that
holds the destination SPU ids is the spu destn id[i].
5.1.3 Creating SPE Contexts
After entering all the details of the data transactions into the User Interface, the
PPE creates contexts for the number of SPEs the user wants (num spes). As seen
60
in the code snippet below, the for loop goes from 0 to num spes, where num spes
is the number of SPEs for which you want to create contexts.
/*** CREATE CONTEXT ***/
for(i = 0; i < num_spes; i++){
if ((id[i] = spe_context_create(SPE_MAP_PS, NULL))
== NULL){
perror ("Failed creating context");
exit (1);
}
}
5.1.3.1 Loading the Program on SPE Contexts
After the contexts are created, the program to be run on the SPE is loaded onto
the context (refer to the piece of code below). In our implementation, all the SPEs
execute the same program which is - simple spu. However, it is to be noted that,
though they run the same program and the contain the same information, the SPEs
are scheduled to execute different schedules.
/*** LOAD THE SPES WITH THE PROGRAM ***/
for(i = 0; i < num_spes; i++){
spe_program_load (id[i], &simple_spu) ;
}
5.1.3.2 Running the SPEs
Once the program is loaded, the next step is to start running the SPEs. The PPE
intrinsic spe context run() function serves the purpose of running the SPES for
which the contexts have already been created.
/*** SPE CONTEXT RUN FUNCTION ***/
if (spe_context_run(ctx, &entry,
0, NULL, NULL, NULL) < 0){
perror ("Failed running context");
exit (1);
}
pthread_exit(NULL);
61
5.1.3.3 Wait for SPEs
After the SPEs have completed their execution, the control returns back to the
PPE, where the PPE is waiting for them to complete their job. The function
pthread join() waits for all the SPEs that are running.
for (i = 0; i<num_spes; i++) {
if (pthread_join (threads[i], NULL)){
perror("Failed pthread_join");
exit (1);
}
}
5.1.3.4 Destroy SPE Contexts
The function spe context destroy destroys all the SPE contexts created. Before we
detroy the contexts, we need to wait for all the SPEs to complete their execution
and return the control to the PPE. Once that is complete, the PPE then destroys all
the contexts.
/*** Destroy context ***
************************/
for (i = 0; i<num_spes; i++){
if (spe_context_destroy (id[i]) != 0){
perror("Failed destroying context");
exit (1);
}
}
5.1.4 Control Block Structures
The next important section of the PPE code is the Control Block (CB) structure.
There are two CB structures - CONTROL BLOCK MAIN and
CONTROL BLOCK DATA ADDR. The Control Block structure provides infor-
mation about SPE tables to their respective SPEs. SPEs use this information to
execute the data transactions.
62
5.1.4.1 CONTROL BLOCK MAIN
The CONTROL BLOCK MAIN, as the name suggests, is the main CB structure
that holds important information like the transactions (source and destination SPU
ids), the size (size of the DMA between them) of these transactions, DMA exist
value etc. After the respective arrays are filled through the User Interface, PPE
creates tables for the SPEs with complete information of all the entered transac-
tions. In addition to creating the CB for the SPEs, the PPE also writes the CB
addresses into the LS of the SPEs. Providing the CB address to the SPEs makes it
convenient for them to access all the information pertaining to its schedule.
5.1.4.2 CONTROL BLOCK DATA ADDR
The CONTROL BLOCK DATA ADDR structure is sent to all the SPES from the
PPE. This structure has only one member dma addr array[i] which holds the
local store address (LSA) of all the SPEs. Destination SPU ids require the LSA of
the source SPU ids to enable them to DMA data from their respective source SPU
id’s LS.
5.1.5 DMA Exist Array
DMA exist array is an array of type int and is of size 6, as the maximum number
of SPEs that can be enabled is 6. The first element of the array is the DMA
exist value of SPE0, the second element of the array is the DMA exist value of
SPE1 and so on till SPE5. According to our implementation, the SPEs that do not
receive any data (i.e., not a destination for any transaction in the framework) are
not required to enter a certain part of the SPE code (interrupt handler section). If
the SPEs are required to send data, then they do their job and return the control
back to the PPE. Thus, the SPEs which do not receive any data need not execute
the complete code on SPE. If a SPE receives data, then its DMA exist value is
loaded into the respective array element as 7, else it is loaded as 0. In other words,
DMA exist value tells the SPE it is a destination for a certain data transaction. On
the PPE side the DMA exist array is filled based on the PPE table entered through
the User Interface. All SPEs receive their DMA exist array value (either 7 or 0)
through their respective CBs.
Another version of the implementation was also developed where PPE creates
63
SPE contexts for the SPEs and loads them with their individual SPE Tables. These
SPE Tables are created using PPE Table where they contain only those schedules
in which they are involved.
5.2 Scheduling Framework: SPE Side
SPEs are responsible for compute-intensive tasks. It is critical to implement a
framework that ensures data transfer among the SPEs is fast and efficient. Keeping
this in mind, I developed two types of framework (see Figure 5.1) - polling based
and interrupt based. I focused on the latter because after experimentation it was
found to be faster, better optimized and efficient.
5.2.1 Types of Framework
Initially I implemented a polling based scheduling framework. However, after
analyzing its performance several disadvantages were identified. Consequently,
I later switched to an interrupt-based scheduling framework. These are briefly
discussed below.
5.2.1.1 Scheduling Framework : Polling Based
In the initial implementation of the scheduling framework, the program on the
SPE side was polling based, where the SPEs poll at the time base register. The
SPEs poll at time base register waiting for the correct time to execute DMA. Thus,
the concept of polling results in some disadvantages: (1) as the program polls on
the time base register the SPEs cannot perform any work, (2) the SPEs waste a lot
of computation time in waiting for correct time to start the DMAs. In the newer
version of the implementation we have enhanced the code so that programs on
the SPE side, besides running the schedule (transactions) and transferring data at
the precise time required, can also do their work. This has been accomplished
by making the SPE program interrupt based. In addition to the SPEs doing their
work, the framework of the interrupt handler also increases the predictability and
robustness of the code and the application as a whole.
64
5.2.1.2 Scheduling Framework: Interrupt Based
As mentioned above, the scheduling framework implemented is interrupt based.
To implement the interrupts we need to take advantage of certain hardware and
software features of the CBEA like the interrupt handlers, barrier function touch-
ing all pages of target buffer etc. Making the framework interrupt based enables
the SPEs to do their work (described in the work function below) when they are
not scheduled to execute any transaction. This prevents the SPEs from going into
an idle state (in Cell Broadband Engine Architecture idle state means going into
a sleep mode). If SPEs go into an idle state they need to be awakened at the re-
quired time to execute the transactions, and reawakening of the SPEs is considered
an expensive job on CellBE’s platform. Thus, we should try to avoid it under all
circumstances.
Another important aspect that needs to be ensured while transferring data
among SPEs is to guarantee all SPEs begin executing data transactions at the
same time in the required slot. To enable this I have developed a barrier function
(described in sections below) that provides a barrier to SPEs until they are all are
ready to execute simultanously. The Scheduling Slot structure, as the name sug-
gests, provides the SPEs with the information of its schedule (when they need to
transfer data and when not). The work SPEs are required to do is provided in the
Work Function.
While transferring data from one SPE to another SPE, sometimes accesses to
the main memory are made, which is not only time consuming but also results in
TLB and cache misses. To avoid these misses the code was enhanced which is
described in Section 5.2.1.6.
Interrupt Handlers The CBE’s software development kit’s (SDK) SPU Timer
Library [14] provide services like timer and virtual clock for SPU programs to
help us implement the interrupt handler.
The Interval Timers in the SPU Timer Library enable us to register a user-
defined handler that can be called at a specific interval, whose value is also set by
the user. The specific interval is the time after which the Interval Timer expires
and it sends an interrupt to the SPE. In our framework the Interval Timer is set to
1 ms value; that is, SPE receives an interrupt every 1 ms (called slot). When the
SPE receives an interrupt, it goes into its interrupt handler and executes the next
data transaction based on the schedule. After the data transaction is executed,
65
the SPE restores the state and resumes its work. This continues till all the data
transactions on the SPE side are completed. Virtual clock is a simple, software
managed, monotonically increasing time base counter. It can count up to a value
of 79 800 000 in one second. This number is also called the time base frequency
(79 800 000 Hz).
For servicing the timer requests, the clock and timer services require the use of
first and second level interrupt handlers (FLIH and SLIH, respectively). The SPU
library provides both FLIH and SLIH for handling decrementer interrupt. The use
of the library-supplied SLIH is required for using the clock and timer services.
First Level Interrupt Handler (FLIH) For implementing our framework,
we have used the library supplied FLIH. To use it we need to call the provided
spu slih reg() service to register spu clock slih() as the SLIH for the
MFC DECREMENTER EVENT. This service is a part of the library FLIH and
the symbol reference to it causes it to be linked into the application. However,
if we wish to use our own FLIH, we must register the spu clock slih() using our
own mechanism. In our implementation, we have used our own FLIH, and I have
registered it in the spu slih reg(). The decrement counter has been given a value of
1 ms. Thus for every 1 ms the program gets an interrupt and it goes into the FLIH
and executes the code in the handler. After executing it, the decrement count is
reset to the original value and the timer is restarted and the execution comes out of
the handler. This continues until the interrupts are disabled. It can be done using
the i disable() function. Similarly, when we initialize the decrement counter and
start the timer, we enable the interrupts as well (for the program to get interrupts)
using i enable() function.
Second Level Interrupt Handler (SLIH) The spu slih reg, as mentioned
above, is the SPU second level interrupt handler (SLIH) manager. This file
spu slih reg.h consists of an interrupt handler dispatch table (spu slih handlers),
an interrupt handler registration routine (spu slih reg), and a default interrupt han-
dler (spu default slih). This file is readily available in the CBEA platform.
5.2.1.3 Barrier Function
As the name suggests, the barrier heavy function provides a barrier, a compulsory
wait on SPEs from progressing until all the SPEs are ready to start executing at
66
the same time. Even if one SPE is delayed for some reason, all the other ready
SPEs have to wait for it. This function is important because it is neccessary to
make sure that all the SPEs start at the same instant (have all SPEs start their
DMAs at the same time) and also as it helps calculate the time taken by these
DMAs more accurately. Thus, the purpose of the barrier heavy function is to
provide robustness and to maintain synchrony in order to make sure all SPEs
start simultaneously. The file barrier heavy.h is a part of the SPE code, and the
function barrier heavy receives the value of nprocs or num spes from the PPE.
5.2.1.4 Work Function
Besides the interrupt handler, the work function work() is an important part of the
SPE code. In this function the SPE performs its actual work. Every 1 ms the SPE
receives an interrupt from the interrupt handler. In response to the interrupt, the
SPE stops doing its work, there is a context switch, and the code for the interrupt
is loaded and executed. After servicing the interrupt, the execution returns back
to the original state at which it was interrupted. The SPE resumes doing its work
from there (in the work() function) till it is interrupted again after 1 ms.
5.2.1.5 SCHEDULING SLOT Structure
SCHEDULING SLOT structure holds the schedule for all destination SPU ids.
The structure has slots which are scheduled based on the algorithm that is com-
puted offline. It might happen that some slots are scheduled while some are not.
The size of data scheduled in each slot is called the element transaction, which is
fixed at 10 kB.
5.2.1.6 Touch all Pages of Target Buffer
According to processor architecture, whenever there is a translation lookaside
buffer (TLB) miss or a page table miss, the application has to fetch the data from
the main memory, which takes a large number of cycles in comparison to the fetch
from the cache. This will delay the task at hand and cause deadline misses. The
same applies to our case as well, where a TLB miss can cause a delay in the ex-
ecution of an element transaction, which in turn can cause deadline misses. To
mitigate this problem we made sure that the scheduling framework touches all the
67
pages of the target buffer (data to be DMAed from source SPE) well ahead of time
before the actual access to the data happens. This way the data will sit in the TLB
well ahead of time, and an access to this data will always be a hit and unnecessary
delays can be avoided.
5.3 Development Structure
The development structure or the code structure describes how the application
code is arranged. The application is divided into two main directories - PPU (the
main directory containing the PPE code) and SPU (a subdirectory of PPU). The
PPU directory contains the file simple.c which creates the contexts for the SPEs,
loads the program (simple spu) onto the SPE contexts and starts running them.
The SPU subdirectory contains file simple spu.c which has the code for executing
data transactions (transferring of data at the scheduled time). It also contains three
more files - barrier heavy.h, spu slih reg.c and spu slih reg.h - whose purposes
have already been described in detail above. There are two main Makefiles in the
application - one is the PPU Makefile and the other is the SPU Makefile.
5.3.1 Compiling and Running
Once you are in the main directory (PPU) to compile the application, you need
to type make clean; make and press enter. This creates a binary ELF (Extensible
Linking Format) image named simple. Additionally, the name of the ELF image
can be anything of your choice. As PPE and SPE have different instruction sets,
they are compiled using separate libraries. The SPU directory is first compiled and
its executable is added to the PPU library. Then the PPU Makefile compiles the
PPU directory and generates the final ELF image - simple. To run the application
you need to type ./< binaryELFimagename > which in our case is simple.
Typing ./simple runs the application.
5.4 Flow of Implementation Code
We now explain the flow of implementation code using an example. This ex-
ample executes the schedules on the framework from the schedule table that was
68
generated offline using the real-time algorithm. Please refer to Figure 5.2 to better
follow the example.
Figure 5.2: Flow of Implementation Code
On the PPE side, the schedule table that is generated offline is uploaded into
PPE through the User Interface as shown in Figure 5.3.
Figure 5.3: Snapshot of User Interface
After uploading the schedule table, PPE creates contexts for SPEs and loads
the program context (contains code and data) into them. Next PPE sends CB
69
Table 5.1: Schedule Table
SNo. Source SPU id Destn SPU id Execution time(slots)
1. 0 2 3
structures (MAIN and DATA ADDR) containing the PPE table and details for
DMAing to all the SPEs.
Coming to SPE side, they receive the CB structures containing the schedule
table (see Table 5.1 as an example) and go into their work functions, after source
SPEs load data into their LS and destination SPEs touch all pages of the target
buffer. In work function all SPEs do their work until the Interval Timer expires
(every 1 ms) and SPEs go into their own interrupt handlers. The scheduling slot
structure contains the required schedule details for transferring data by SPEs in
every slot.
After going into interrupt handler, the SPE checks the schedule table to see
whether it has a DMA scheduled. If yes, the timer is first reset and SPE0 does an
mfc get of 10 kB from SPE2 and returns control to the work function. If it does
not need to do a DMA, the timer is reset and control still returns to work function
where the SPE continues doing its work, until it is interrupted again. This repeats
for all the slots according to the executed schedule. Once scheduling is complete,
SPEs return the control back to PPE, where it waits for all the SPEs and destroys
their contexts before exiting.
5.5 Chapter Conclusion
In this chapter, I described the interrupt based scheduling framework in detail. I
talked about the different versions of code that were implemented during the entire
project. I also illustrated the flow of code by using an example.
70
CHAPTER 6
EXPERIMENTAL RESULTS,
CONCLUSION AND FUTURE WORK
In this chapter, we discuss the experiments conducted with the bus scheduling en-
gine on a real system and analyze the obtained results. More importantly, we show
the effect of scheduling enforcement on the real-time behavior of bus transactions.
6.1 Experimental Setup
As already mentioned in Chapter 5, I chose Sony Play Station 3 (PS3) as the ex-
perimental platform. I implemented a scheduling engine, one instance of which
runs on each processing element (SPEs). The engine is in fact a timer based inter-
rupt handler. When an interval timer expires it fires an interrupt at the beginning of
each time slot. The scheduling engine makes a scheduling decision for the current
slot. The scheduling decision is made based on a scheduling table which is gener-
ated off-line by the POGen algorithm or by executing the algorithm itself. I chose
a table based approach in the following experiments. If a transaction is scheduled
in a slot, a DMA packet of the transaction with a given size will be transferred in
that slot. The size of a DMA packet is dictated by the bus bandwidth and the slot
size.
Although the CBEA bus has two rings in each direction, there exist transaction
layouts that use only one ring in each direction. I chose one of those layouts for
my experiments as the algorithm assumes EIB has one ring in each direction. In
addition, all transactions in the experiments have endpoints that are SPEs.
In the first experiment I determined the slot size and the amount of data to
be transferred in each slot. This was measured by transmitting packets of vari-
ous sizes. The next experiment I conducted was to show the effect of real-time
scheduling enforcement.
71
Table 6.1: DMA Transmission Time
DMA size (KB) 8 10 12 14 16
Transmission time (ticks) 34 41 48 54 60
6.2 Experimental Results
The first experiment conducted was to determine the slot size and the amount of
data to be transferred in each slot. We determined it by measuring the transmission
time of DMA packets with various sizes. The results are shown in Table 6.1, where
a tick is a SPE real-time clock tick and is equal to 12.5 ns. The transmission times
are rounded up to the smallest integral number of ticks. Referring to Table 6.1,
we can see that the bus achieves a higher bandwidth when the DMA packet size
is bigger. Based on the above measurement, we can also show that the time a
SPE spends to execute a timer interrupt handler to transfer a DMA packet is 2
ticks. Thus, in the following experiments I chose a slot size to be 43 ticks and
DMA packet size in each slot to be 10 kB. The slot size includes 41 ticks of a 10
kB DMA packet transmission time and 2 ticks of the interrupt handler processing
overhead.
Figure 6.1: Experimental Transaction Set
Next, to show the effect of the real-time scheduling enforcement, I performed
an experiment with a transaction set with five transactions whose layout is shown
in Figure 6.1 and with values (the time unit is the number of slots) of timing
parameters equal to τ1 : e1 = 4, p1 = 20; τ2 : e2 = 6, p2 = 10; τ3 : e3 = 6, p3 =
60; τ4 : e4 = 6, p4 = 10; τ5 : e5 = 4, p5 = 20. The transaction set has L = 10
and two PO-sets: D1 = {τ1, τ2, τ3} and D2 = {τ3, τ4, τ5}. The utilization of
each PO-set is 9/10 = (L−1)/L. In the first experiment, we initiate each periodic
transaction as soon as it arrives in the system; hence the transactions are scheduled
by the low-level round-robin bus arbiter. As a result, transactions τ2 and τ4 miss
their deadline with the maximum relative completion time of 12 slots as opposed
to their relative deadline of 10 slots. When the transaction set is scheduled by
72
POGen, all transactions meet their deadlines.
6.3 Conclusion
The main research issue is to how to provide software designers with: (1) a practi-
cal and accurate abstraction of the real scheduling problem on multiprocessor bus
and (2) an effective scheduling methodology that maximizes multiprocessor bus
utilization. The present research focuses on addressing this problem on a specific
multiprocessor bus architecture, specifically CBEA. In this thesis, I developed
an interrupt based scheduling framework that abstracts away from the low-level
physical bus implementation, to provide a platform to implement and test the per-
formance of a class of scheduling algorithms. I have also conducted experiments
to show that the real-time transactions of feasible transaction sets are executed
before deadline when scheduled according to a real-time scheduling algorithm,
while the same transactions can miss their deadlines when scheduled according to
an arbitrary (non-real-time) scheduling policy.
I also presented a detailed discussion about hardware and software features of
the Cell processor and its advantages when compared to other modern computer
architectures.
6.4 Future Work
In our future work we would like to (1) schedule on-line the data transactions
instead of having the algorithm computed offline, so we would like to have a
scheduling algorithm implemented into the scheduling framework itself; (2) ad-
dress the issue of transforming circular transaction sets into non-circular ones and
in order to apply the proposed scheduling framework to circular transaction sets.
73
APPENDIX A
IMPLEMENTATION CODE AND BIT
ORDERING IN CELL PROCESSOR
A.1 PPU: simple.c
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ A l l R i g h t s R e s e r v e d . ∗ /
/∗ ∗ /
/∗ U n i v e r s i t y o f I l l i n o i s a t Urbana Champaign ∗ /
/∗ Depar tment o f Computer S c i e n c e ∗ /
/∗ Real−Time S y s t e m s L a b o r a t o r y ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ − D e e p t i Kumar C h i v u k u l a − d c h i v u k 2 @ i l l i n o i s . edu ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ BUS SCHEDULING ON THE CELL PROCESSOR ∗ /
/∗ ∗ /
/∗ − F i l e Name : s i m p l e . c ∗ /
/∗ ∗ /
/∗ T h i s i s t h e PPE s i d e code ∗ /
/∗ − i n i t i a l i z e s a l l t h e ar rays , ∗ /
/∗ − i n c l u d e s t h e ’ User I n t e r f a c e ’ f o r t h e u s e r ∗ /
/∗ t o e n t e r t h e p a r a m e t e r s o f t h e t a s k s . ∗ /
/∗ − c r e a t e s , l o a d s and runs t h e SPEs t h r e a d s ∗ /
/∗ and w a i t f o r them t o c o m p l e t e . A l so d e s t r o y s ∗ /
/∗ them i n t h e end . ∗ /
/∗ − c o n t a i n s t h e C o n t r o l B lock s t r u c t u r e s t o ∗ /
/∗ c r e a t e SPE t a b l e s ∗ /
/∗ ∗ /
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ PROLOG END TAG zYx ∗ /
# inc lude < s t d l i b . h>
74
# inc lude <s t d i o . h>
# inc lude <s t d b o o l . h>
# inc lude <e r r n o . h>
# inc lude < l i b s p e 2 . h>
# inc lude <p t h r e a d . h>
# inc lude <s y s / t y p e s . h>
# inc lude < l i b m i s c . h>
# inc lude <u n i s t d . h>
# inc lude <s y s / s t a t . h>
# inc lude < f c n t l . h>
# inc lude <e r r n o . h>
# inc lude <sched . h>
# inc lude < s t r i n g . h>
# inc lude <s y s / t ime . h>
# inc lude <d i r e n t . h>
/∗ ∗∗ C o n s t a n t s ∗∗∗ /
# de f i n e ARRAY SIZE 30 / / A l l a r r a y s are s t a t i c and o f t y p e i n t −
max v a l u e i s 30
# de f i n e NUM ELEMENTS 4096 / / Maximum DMA s i z e i s 16KB [ ( 40 96 ∗
4) = 16KB]
# de f i n e NUM SPES 6 / / Max number o f SPEs t h a t can be e n a b l e d i s
6
# de f i n e CACHE LINE SIZE 128 / / R e q u i r e d f o r f i l e b a r r i e r h e a v y . h
/∗ ∗∗ Name o f SPE t h r e a d c r e a t e d − s i m p l e s p u ∗∗∗ /
ex tern s p e p r o g r a m h a n d l e t s i m p l e s p u ;
/∗ ∗∗∗ DECLARING MAIN C o n t r o l B lock (CB) ∗∗∗∗ /
t ypede f s t r u c t {
unsigned long ∗ s p u s o u r c e l s ;
unsigned long ∗ s p u d e s t n l s ;
long s p u d m a d a t a a r r a y [ ARRAY SIZE ] ;
long s p u s o u r c e l s a r r a y [ ARRAY SIZE ] ;
long s p u d e s t n l s a r r a y [ ARRAY SIZE ] ;
long s p u s o u r c e i d a r r a y [ ARRAY SIZE ] ;
long s p u d e s t n i d a r r a y [ ARRAY SIZE ] ;
long s p u d e s t n i d n u m b e r s [ ARRAY SIZE ] ;
long s p u s o u r c e i d n u m b e r s [ ARRAY SIZE ] ;
long p e r i o d s [ ARRAY SIZE ] ;
75
long s p u n u m j o b s [ ARRAY SIZE ] ;
i n t s p u a r r a y s i z e ;
i n t t a s k ;
i n t d m a e x i s t ;
i n t n p r o c s ;
i n t s p e i d c b ;
i n t ∗ b a r p t r ;
i n t i d s p u ;
unsigned char pad [ 3 6 ] ;
}CONTROL BLOCK MAIN;
/∗ ∗∗ CB s t r u c t u r e f o r Loca l S t o r e Addres s ∗∗∗ /
t ypede f s t r u c t {
long d m a a d d r a r r a y [ ARRAY SIZE ] ;
unsigned char pad [ 8 ] ;
}CONTROL BLOCK DATA ADDR;
/∗ ∗∗ Make CB a d d r e s s − 128− b y t e a l i g n e d ∗∗∗ /
CONTROL BLOCK MAIN cb a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) ) ) ;
CONTROL BLOCK DATA ADDR c b d a t a a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) ) ) ;
/∗ ∗∗ Cache− l i n e s i z e d b l o c k s f o r use i n b a r r i e r c a l l s − 128− b y t e
a l i g n e d ∗∗∗ /
s t a t i c unsigned i n t b a r [ CACHE LINE SIZE /
s i z e o f ( unsigned i n t ) ] a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) )
) ;
/∗ ∗∗ D e c l a r i n g Globa l V a r i a b l e s ∗∗∗ /
unsigned i n t i d [NUM SPES ] ;
p t h r e a d t t h r e a d s [NUM SPES ] ;
i n t a r r a y s i z e ;
i n t num spes ;
i n t n u m d e s t i n a t i o n i d s ;
/∗ ∗∗∗ Source and Des tn SPU v a l u e s f o r t h e s c h e d u l e t a b l e ∗∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
long s p u s o u r c e l s [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
76
long s p u d e s t n l s [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long s p u s o u r c e c o n t r o l [ ARRAY SIZE]=
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long s p u d e s t n c o n t r o l [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long s p u d m a d a t a [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long d a t a s i z e [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long s p u d m a d a t a a r r a y [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long dma addr [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long s p u s o u r c e i d [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long s o u r c e i d s [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long d e s t n i d s [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long s p u d e s t n i d [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long p p e p e r i o d s [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long num jobs [ ARRAY SIZE ] =
{−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,
−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1,−1};
long d m a e x i s t a r r a y [NUM SPES]= {−1, −1, −1, −1, −1, −1};
77
/∗ ∗∗ SPE CONTEXT RUN FUNCTION ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
void ∗ p p u p t h r e a d f u n c t i o n ( void ∗ a r g ) {
s p e c o n t e x t p t r t c t x ;
unsigned i n t e n t r y = SPE DEFAULT ENTRY ;
c t x = ∗ ( ( s p e c o n t e x t p t r t ∗ ) a r g ) ;
i f ( s p e c o n t e x t r u n ( c tx , &e n t r y , 0 , NULL, NULL, NULL) < 0) {
p e r r o r ( ” F a i l e d r u n n i n g c o n t e x t ” ) ;
e x i t ( 1 ) ;
}
p t h r e a d e x i t (NULL) ;
}
/∗ ∗∗ I n i t i a l i z e a l l t h e a r r a y s ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
void i n i t i a l i z e ( ) {
i n t i ;
f o r ( i = 0 ; i<a r r a y s i z e ; i ++ ) {
cb . s p u d m a d a t a a r r a y [ i ]= −1;
cb . s p u s o u r c e l s a r r a y [ i ]= −1;
cb . s p u d e s t n l s a r r a y [ i ] = −1;
cb . s p u s o u r c e i d a r r a y [ i ]= −1;
cb . s p u d e s t n i d a r r a y [ i ]= −1;
cb . s p u d e s t n i d n u m b e r s [ i ]= −1;
c b d a t a . d m a a d d r a r r a y [ i ] = −1;
cb . p e r i o d s [ i ] = −1;
cb . s p u n u m j o b s [ i ] = −1;
}
}
/∗ ∗∗ To c o n v e r t t h e S i z e o f da ta from k i l o b y t e s t o b y t e s ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
void s i z e c o n v e r t ( ) {
i n t i = 0 ;
f o r ( i = 0 ; i<a r r a y s i z e ; i ++){
s p u d m a d a t a [ i ] = ( d a t a s i z e [ i ] ∗ 1024) ;
}
}
/∗ ∗∗ Pe r i od c o n v e r s i o n from m i l l i s e c s t o number o f t i c k s ∗∗∗
78
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
void p e r i o d c o n v e r s i o n ( ) {
i n t i ;
/ / one ms has 79800 t i c k s
f o r ( i = 0 ; i<a r r a y s i z e ; i ++){
p p e p e r i o d s [ i ] = ( 7 9 8 0 0 / p p e p e r i o d s [ i ] ) ;
}
}
/∗ ∗∗ F u n c t i o n t o c a l c u l a t e number o f j o b s i n t r a n s a c t i o n s ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
void n u m o f j o b s c a l c ( ) {
i n t r ;
f o r ( r = 0 ; r < a r r a y s i z e ; r ++){
num jobs [ r ] = ( unsigned i n t ) ( d a t a s i z e [ r ] / 1 0 ) ;
/ / we have assumed s i z e o f j o b t o be 10KB
}
}
/∗ ∗∗ THE MAIN FUNCTION ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
i n t main ( )
{
i n t i =0 , l =0 , j =0 , k =0;
i n t t e m p s o u r c e ;
i n t t e m p d e s t n ;
i n t r c ;
unsigned i n t c b a r r a y [ 1 ] ;
unsigned i n t c b d a t a a d d r [ 1 ] ;
unsigned long g a r b a g e ;
unsigned long s p e d m a d a t a a d d r [NUM SPES] ;
unsigned long s p u s o u r c e l s d m a [NUM SPES ] ;
unsigned long s p e d m a a d d r [NUM SPES ] ;
i n t s t a r t s c h e d u l e [ 1 ] ={1 0 0 0} ;
unsigned i n t t e m p i d [NUM SPES ] ;
i n t c g a r b a g e = 0 ;
i n t c g a r b a g e c b l s a = 0 ;
i n t ge tBar ID ;
/∗ ∗∗ F i l l i n CB ∗∗∗∗
79
− wi th add r o f t h e d a t a a r r a y
− wi th t h e a d d r e s s o f t h e cb
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
c b a r r a y [0]=& cb ;
c b d a t a a d d r [0]=& c b d a t a ;
/∗ ∗∗ USER INTERFACE FOR FILLING PARAMETERS OF THE TASKS ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n ” ) ;
p r i n t f ( ” THE USER INTERFACE \n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n\n ” ) ;
p r i n t f ( ” P l e a s e E n t e r The F o l l o w i n g :\ n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n\n ” ) ;
p r i n t f ( ” 1 . Number o f SPEs ( c r e a t i o n o f c o n t e x t s ) (1−6) :\ n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n ” ) ;
s c a n f ( ”%d ” ,& num spes ) ;
f f l u s h ( s t d i n ) ;
p r i n t f ( ”\n ” ) ;
f f l u s h ( s t d o u t ) ;
p r i n t f ( ” 2 . Array S i z e ( no . o f e n t r i e s o r d a t a t r a n s ) :\ n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n ” ) ;
s c a n f ( ”%d ” ,& a r r a y s i z e ) ;
f f l u s h ( s t d i n ) ;
p r i n t f ( ”\n ” ) ;
f f l u s h ( s t d o u t ) ;
p r i n t f ( ” 3 . S i z e o f d a t a t o be DMAed\n ” ) ;
p r i n t f ( ” − ( minumum 10K) \n ” ) ;
p r i n t f ( ” − v a l u e s h o u l d be a m u l t i p l e o f 10 \n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n ” ) ;
f o r ( i = 0 ; i<a r r a y s i z e ; i ++) {
s c a n f ( ”%d ” , &d a t a s i z e [ i ] ) ;
f f l u s h ( s t d i n ) ;
}
p r i n t f ( ”\n ” ) ;
f f l u s h ( s t d o u t ) ;
80
p r i n t f ( ” 4 . E n t e r t h e Source SPE i d s (0−5)\n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n ” ) ;
f o r ( i = 0 ; i<a r r a y s i z e ; i ++) {
s c a n f ( ”%d ” , &s p u s o u r c e i d [ i ] ) ;
f f l u s h ( s t d i n ) ;
}
p r i n t f ( ”\n ” ) ;
f f l u s h ( s t d o u t ) ;
p r i n t f ( ” 5 . E n t e r t h e D e s t i n a t i o n s SPE i d s (0−5)\n ” ) ;
p r i n t f ( ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗\n ” ) ;
f o r ( i = 0 ; i <a r r a y s i z e ; i ++) {
s c a n f ( ”%d ” , &s p u d e s t n i d [ i ] ) ;
f f l u s h ( s t d i n ) ;
}
p r i n t f ( ”\n ” ) ;
f f l u s h ( s t d o u t ) ;
/∗ ∗∗ F i l l i n g t h e DMA EXIST a r r a y f o r t h e Des tn Spus ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
f o r ( i = 0 ; i<num spes ; i ++) {
f o r ( j =0 ; j<a r r a y s i z e ; j ++){
i f ( s p u d e s t n i d [ j ]== i )
d m a e x i s t a r r a y [ i ]= 7 ;
}
}
/∗ ∗∗ I n i t i a l i z e a l l t h e a r r a y s used t o f i l l CB ∗∗∗ /
i n i t i a l i z e ( ) ;
/∗ ∗∗ Conver t S i z e i n KB t o Number o f e l e m e n t s f o r a r r a y ∗∗∗ /
s i z e c o n v e r t ( ) ;
/∗ ∗∗ To c a l c u l a t e number o f j o b s ∗∗∗
∗∗∗∗∗ i n e v e r y d a t a t r a n s a c t i o n ∗∗∗ /
n u m o f j o b s c a l c ( ) ;
81
/∗ ∗∗ Do SPE c o n t e x t c r e a t e and load i t i n t o SPE i d s (0−6) ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
f o r ( i = 0 ; i<num spes ; i ++) {
/∗ ∗∗ Cr ea t e c o n t e x t s ∗∗∗ /
i f ( ( i d [ i ] = s p e c o n t e x t c r e a t e ( SPE MAP PS , NULL) ) == NULL)
{
p e r r o r ( ” F a i l e d c r e a t i n g c o n t e x t ” ) ;
e x i t ( 1 ) ;
}
/∗ ∗∗ l oad t h e SPEs w i t h t h e program − s i m p l e s p u ∗∗∗ /
s p e p r o g r a m l o a d ( i d [ i ] , &s i m p l e s p u ) ;
}
/∗ ∗∗ Pthread C r e a t i o n ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
f o r ( i = 0 ; i < num spes ; i ++) {
/∗ ∗∗ Do PPU t h r e a d c r e a t e and c o n t e x t run ∗∗∗ /
i f ( p t h r e a d c r e a t e (& t h r e a d s [ i ] , NULL, &p p u p t h r e a d f u n c t i o n
, &i d [ i ] ) ) {
p e r r o r ( ” F a i l e d c r e a t i n g t h r e a d ” ) ;
e x i t ( 1 ) ;
}
}
/∗ ∗∗ F i l l t h e Source and D e s t i n a t i o n a r r a y s a c c o r d i n g
∗∗∗∗∗∗∗∗∗∗∗∗∗∗ t o S c h e d u l e Tab le ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
f o r ( i = 0 ; i<a r r a y s i z e ; i ++)
{
t e m p s o u r c e = s p u s o u r c e i d [ i ] ;
s o u r c e i d s [ i ] = s p u s o u r c e i d [ i ] ;
s p u s o u r c e i d [ i ] = i d [ t e m p s o u r c e ] ;
t e m p d e s t n = s p u d e s t n i d [ i ] ;
d e s t n i d s [ i ] = s p u d e s t n i d [ i ] ; / / r e q u i r e d f o r compar i son
on SPE s i d e t o do m f c g e t on d e s t n i d s
s p u d e s t n i d [ i ] = i d [ t e m p d e s t n ] ;
/∗ ∗∗ l oad t h e a r r a y w i t h l s o f t h e s p e s a c c o r d i n g t o
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ s c h e d u l e t a b l e ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
s p u s o u r c e l s [ i ] = s p e l s a r e a g e t ( s p u s o u r c e i d [ i ] ) ;
82
s p u d e s t n l s [ i ] = s p e l s a r e a g e t ( s p u d e s t n i d [ i ] ) ;
s p u s o u r c e c o n t r o l [ i ] = ( s p e s p u c o n t r o l a r e a t ∗ )
s p e p s a r e a g e t
( s p u s o u r c e i d [ i ] , SPE CONTROL AREA) ;
s p u d e s t n c o n t r o l [ i ] = ( s p e s p u c o n t r o l a r e a t ∗ )
s p e p s a r e a g e t
( s p u d e s t n i d [ i ] , SPE CONTROL AREA) ;
}
/∗ ∗∗ Loading SPEs w i t h i t s r e s p e c t i v e CB ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
cb . s p e i d c b = −1;
f o r ( i = 0 ; i<num spes ; i ++)
{
k = 0 ;
cb . b a r p t r = &b a r ;
cb . i d s p u = i ;
/∗ ∗∗ We assume t h a t D e s t i n a t i o n i s n o t r e p e a t e d ∗∗∗ /
cb . n p r o c s = n u m d e s t i n a t i o n i d s ;
ge tBar ID =0;
f o r ( j = 0 ; j<a r r a y s i z e ; j ++)
{
cb . s p u d m a d a t a a r r a y [ j ]= s p u d m a d a t a [ j ] ;
cb . s p u s o u r c e l s a r r a y [ j ] = s p u s o u r c e l s [ j ] ;
cb . s p u d e s t n l s a r r a y [ j ] = s p u d e s t n l s [ j ] ;
cb . s p u s o u r c e i d a r r a y [ j ] = s p u s o u r c e i d [ j ] ;
cb . s p u d e s t n i d a r r a y [ j ] = s p u d e s t n i d [ j ] ;
cb . s p u d e s t n i d n u m b e r s [ j ] = d e s t n i d s [ j ] ;
cb . s p u s o u r c e i d n u m b e r s [ j ] = s o u r c e i d s [ j ] ;
cb . p e r i o d s [ j ] = p p e p e r i o d s [ j ] ;
cb . s p u n u m j o b s [ j ] = num jobs [ j ] ;
cb . t a s k = i ;
/ / i n i t bar params
i f ( ! ge tBar ID ) {
cb . s p e i d c b ++;
ge tBar ID =1;
}
}
83
/∗ ∗∗ F i l l t h e a r r a y s i z e f o r each i n d i v i d u a l SPE ∗∗∗ /
cb . s p u a r r a y s i z e = a r r a y s i z e ;
cb . d m a e x i s t = d m a e x i s t a r r a y [ i ] ;
/∗ ∗∗ I n i t i a l i z i n g o t h e r e n t r i e s i n a r r a y s t o −1 ∗∗∗ /
f o r ( l = a r r a y s i z e ; l<ARRAY SIZE ; l ++)
{
cb . s p u d m a d a t a a r r a y [ l ]= −1 ;
cb . s p u s o u r c e l s a r r a y [ l ] = −1 ;
cb . s p u d e s t n l s a r r a y [ l ] = −1 ;
cb . s p u s o u r c e i d a r r a y [ l ] = −1 ;
cb . s p u d e s t n i d a r r a y [ l ] = −1 ;
cb . s p u d e s t n i d n u m b e r s [ l ] = −1 ;
cb . s p u s o u r c e i d n u m b e r s [ l ] = −1 ;
cb . p e r i o d s [ l ] = −1 ;
cb . s p u n u m j o b s [ l ] = −1 ;
}
/∗ ∗∗ W r i t e CB t o Loca l S t o r a g e ∗∗∗ /
r c = s p e i n m b o x w r i t e ( i d [ i ] , &c b a r r a y , 1 ,
SPE MBOX ANY BLOCKING ) ;
/∗ ∗∗ Wait f o r SPE t o read CB ∗∗∗
∗∗∗ ( r e a d g a r b a g e v a l u e on PPE s i d e ) ∗∗∗ /
whi le ( ! s p e o u t m b o x s t a t u s ( i d [ i ] ) ) ;
s p e o u t m b o x r e a d ( i d [ i ] , &garbage , 1 ) ;
c g a r b a g e ++;
}
/∗ ∗∗ To g e t Loca l S t o r e Data Array ∗∗∗
∗∗∗∗∗ a d d r e s s e s o f a l l s o u r c e SPES ∗∗∗ /
f o r ( i = 0 ; i<a r r a y s i z e ; i ++){
t e m p s o u r c e = s o u r c e i d s [ i ] ;
s p u s o u r c e l s d m a [ i ] = s p e l s a r e a g e t ( i d [ t e m p s o u r c e ] ) ;
whi le ( ! s p e o u t m b o x s t a t u s ( i d [ t e m p s o u r c e ] ) ) ;
s p e o u t m b o x r e a d ( i d [ t e m p s o u r c e ] , &s p e d m a d a t a a d d r [ i ] , 1 )
;
84
/∗ ∗∗ p u t i t i n t o a d i f f e r e n t a r r a y ∗∗∗ /
s p e d m a a d d r [ i ] = s p u s o u r c e l s d m a [ i ] + s p e d m a d a t a a d d r
[ i ] ;
}
f o r ( i = 0 ; i<num spes ; i ++){
f o r ( j = 0 ; j<a r r a y s i z e ; j ++){
c b d a t a . d m a a d d r a r r a y [ j ] = s p e d m a a d d r [ j ] ;
}
/∗ ∗∗ I n i t i a l i i n g o t h e r e n t r i e s i n a r r a y s t o −1 ∗∗∗ /
f o r ( l = a r r a y s i z e ; l<ARRAY SIZE ; l ++) {
c b d a t a . d m a a d d r a r r a y [ l ] = −1 ;
}
/∗ ∗∗ W r i t e CB LSA t o Loca l S t o r a g e Area o f d e s t n SPU ’ s o n l y
∗∗∗ /
r c = s p e i n m b o x w r i t e ( i d [ i ] , &c b d a t a a d d r , 1 ,
SPE MBOX ANY BLOCKING) ;
/∗ ∗∗ Wait f o r SPE t o read CB ( read garbage v a l u e on PPE s i d e
) ∗∗∗ /
whi le ( ! s p e o u t m b o x s t a t u s ( i d [ i ] ) ) ;
s p e o u t m b o x r e a d ( i d [ i ] , &garbage , 1 ) ;
c g a r b a g e c b l s a ++;
}
i f ( c g a r b a g e c b l s a == num spes ) {
f o r ( i = 0 ; i<num spes ; i ++){
/∗ ∗∗ Send mai lbox w i t h s t a r t s c h e d u l e v a l u e t o a l l s p e s
∗∗∗ /
r c = s p e i n m b o x w r i t e ( i d [ i ] , &s t a r t s c h e d u l e , 1 ,
SPE MBOX ANY BLOCKING) ;
}
}
85
/∗ ∗∗ Wait f o r SPU−t h r e a d t o c o m p l e t e e x e c u t i o n ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
f o r ( i = 0 ; i<num spes ; i ++) {
i f ( p t h r e a d j o i n ( t h r e a d s [ i ] , NULL) ) {
p e r r o r ( ” F a i l e d p t h r e a d j o i n ” ) ;
e x i t ( 1 ) ;
}
}
/∗ ∗∗ D e s t r o y c o n t e x t ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
f o r ( i = 0 ; i<num spes ; i ++) {
i f ( s p e c o n t e x t d e s t r o y ( i d [ i ] ) != 0 ) {
p e r r o r ( ” F a i l e d d e s t r o y i n g c o n t e x t ” ) ;
e x i t ( 1 ) ;
}
}
re turn ( 0 ) ;
/∗ ∗∗ END OF CODE ∗∗∗ /
}
A.2 PPU: cFile.c
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ A l l R i g h t s R e s e r v e d . ∗ /
/∗ ∗ /
/∗ U n i v e r s i t y o f I l l i n o i s a t Urbana Champaign ∗ /
/∗ Depar tment o f Computer S c i e n c e ∗ /
/∗ Real−Time S y s t e m s L a b o r a t o r y ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ − D e e p t i Kumar C h i v u k u l a − d c h i v u k 2 @ i l l i n o i s . edu ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ BUS SCHEDULING ON THE CELL PROCESSOR ∗ /
/∗ ∗ /
/∗ − F i l e Name : c F i l e . c ∗ /
86
/∗ ∗ /
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ PROLOG END TAG zYx ∗ /
# inc lude < s t d l i b . h>
# inc lude <s t d i o . h>
# inc lude <e r r n o . h>
# inc lude < l i b s p e 2 . h>
# inc lude <p t h r e a d . h>
# inc lude <s y s / t y p e s . h>
# inc lude < l i b m i s c . h>
# inc lude <u n i s t d . h>
# inc lude <s y s / s t a t . h>
# inc lude < f c n t l . h>
# inc lude <e r r n o . h>
# inc lude <sched . h>
# inc lude < s t r i n g . h>
# inc lude <s y s / t ime . h>
# inc lude <d i r e n t . h>
# de f i n e ARRAY SIZE 30
# de f i n e NUM ELEMENTS 4096
# de f i n e NUM SPES 6
# de f i n e CACHE LINE SIZE 128
ex tern s p e p r o g r a m h a n d l e t s i m p l e s p u ;
A.3 PPU: Makefile
# −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
# A l l R i g h t s Rese rved .
#
# U n i v e r s i t y o f I l l i n o i s a t Urbana Champaign
# Depar tmen t o f Computer S c i e n c e
# Real−Time Systems L a b o r a t o r y
#
#
# − D e e p t i Kumar Ch ivuku la − d c h i v u k 2 @ i l l i n o i s . edu
#
87
#
# BUS SCHEDULING ON THE CELL PROCESSOR
#
# − F i l e Name : m a k e f i l e . c
#
# −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
# PROLOG END TAG zYx
########################################################
# S u b d i r e c t o r i e s
########################################################
DIRS := spu
#########################################################
# T a r g e t
#########################################################
PROGRAM ppu := s i m p l e
# O B J S s p u i n t e r r u p t := s p u f l i h . o \
s p u i n t e r r u p t . o \
s p u s l i h r e g . o
# O B J S s p u i n t e r r u p t f a s t := s p u h a n d l e r f a s t . o \
s p u i n t e r r u p t f a s t . o \
###########################################################
# Loca l D e f i n e s
###########################################################
IMPORTS = − l s p e 2 spu / l i b s i m p l e s p u . a − l p t h r e a d
INSTALL DIR = \$ ( EXP SDKBIN ) / t u t o r i a l
INSTALL FILES = \$ (PROGRAMS ppu)
############################################################
88
# b u i l d u t i l s / make . f o o t e r
############################################################
i f d e f CELL TOP
i n c l u d e \$ ( CELL TOP ) / b u i l d u t i l s / make . f o o t e r
e l s e
i n c l u d e . . / . . / . . / b u i l d u t i l s / make . f o o t e r
e n d i f
A.4 SPU: simple spu.c
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ A l l R i g h t s R e s e r v e d . ∗ /
/∗ ∗ /
/∗ U n i v e r s i t y o f I l l i n o i s a t Urbana Champaign ∗ /
/∗ Depar tment o f Computer S c i e n c e ∗ /
/∗ Real−Time S y s t e m s L a b o r a t o r y ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ − D e e p t i Kumar C h i v u k u l a − d c h i v u k 2 @ i l l i n o i s . edu ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ BUS SCHEDULING ON THE CELL PROCESSOR ∗ /
/∗ ∗ /
/∗ − F i l e Name : s i m p l e s p u . c ∗ /
/∗ ∗ /
/∗ T h i s i s t h e SPE s i d e code ∗ /
/∗ − i n c l u d e s t h e i n t e r r u p t based s c h e d u l i n g framework ∗ /
/∗ − c o n t a i n s t h e s c h e d u l i n g s l o t s t r u c t u r e ∗ /
/∗ − does DMA(” m f c g e t ”) from t h e s o u r c e SPE i d s ∗ /
/∗ ∗ /
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗/
/∗ PROLOG END TAG zYx ∗ /
# inc lude <s t d i o . h>
# inc lude < s t d i n t . h>
# inc lude <s t d b o o l . h>
# inc lude <s y s / t y p e s . h>
# inc lude < s p u i n t r i n s i c s . h>
# inc lude <b a r r i e r h e a v y . h>
89
# inc lude < s p u s l i h r e g . h>
# inc lude <s p u m f c i o . h>
# inc lude <m a l l o c a l i g n . h>
# inc lude < l i b m i s c . h>
# inc lude <u n i s t d . h>
/∗ ∗∗ C o n s t a n t s ∗∗∗ /
# de f i n e NUM ELEMENTS 1024 ∗ 100
# de f i n e ARRAY SIZE 30
# de f i n e NUM SPES 6
# de f i n e CACHE LINE SIZE 128
# de f i n e MAX COUNT 0x100000ULL
# de f i n e DECR COUNT 79800
# de f i n e GB to B (1024∗1024∗1024)
# de f i n e KB to B ( 1 0 2 4 )
# de f i n e SIZE LOCAL 10240
# de f i n e NUMBER SLOTS 20
/∗ ∗∗ R e q u i r e d t o read t h e Time Base v a l u e ∗∗∗ /
s p u r e a d d e c r e m e n t e r ;
s p u w r i t e d e c r e m e n t e r ;
/∗ ∗∗ DECLARING MAIN C o n t r o l B lock (CB) ∗∗∗ /
t ypede f s t r u c t {
unsigned long ∗ s p u s o u r c e l s ;
unsigned long ∗ s p u d e s t n l s ;
long s p u d m a d a t a a r r a y [ ARRAY SIZE ] ;
long s p u s o u r c e l s a r r a y [ ARRAY SIZE ] ;
long s p u d e s t n l s a r r a y [ ARRAY SIZE ] ;
long s p u s o u r c e i d a r r a y [ ARRAY SIZE ] ;
long s p u d e s t n i d a r r a y [ ARRAY SIZE ] ;
long s p u d e s t n i d n u m b e r s [ ARRAY SIZE ] ;
long s p u s o u r c e i d n u m b e r s [ ARRAY SIZE ] ;
long p e r i o d s [ ARRAY SIZE ] ;
long s p u n u m j o b s [ ARRAY SIZE ] ;
i n t s p u a r r a y s i z e ;
i n t t a s k ;
i n t d m a e x i s t ;
i n t n p r o c s ;
i n t s p e i d c b ;
90
i n t ∗ b a r p t r ;
i n t i d s p u ;
unsigned char pad [ 3 6 ] ;
}CONTROL BLOCK MAIN;
/∗ ∗∗ CB s t r u c t u r e f o r LSA ∗∗∗ /
t ypede f s t r u c t {
long d m a a d d r a r r a y [ ARRAY SIZE ] ;
unsigned char pad [ 8 ] ;
}CONTROL BLOCK DATA ADDR;
/∗ ∗∗ S c h e d u l e s t r u c t u r e f o r Tasks ∗∗∗ /
t ypede f s t r u c t {
boo l t a s k [ 2 ] ; / / r i g h t now we have two t a s k s
unsigned char pad [ 1 2 6 ] ;
}SCHEDULING SLOT ;
/∗ ∗∗ Array o f s c h e d u l i n g s l o t s f o r s c h e d u l e t a b l e ∗∗∗ /
SCHEDULING SLOT s c h e d u l i n g t a b l e [ 2 0 ] ;
/∗ ∗∗ D e c l a r i n g a r r a y s and making them 128− b y t e a l i g n e d ∗∗∗ /
char d a t a t o t a l [NUM ELEMENTS] a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) ) ) ;
char d a t a r e c e i v e [NUM ELEMENTS] a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) ) ) ;
s t a t i c char l s b u f [ 1 2 8 ] a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) ) ) ;
/∗ ∗∗ Make CB a d d r e s s − 128− b y t e a l i g n e d ∗∗∗ /
CONTROL BLOCK MAIN cb1 a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) ) ) ;
SCHEDULING SLOT s s 1 a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) ) ) ;
CONTROL BLOCK DATA ADDR c b d a t a 1 a t t r i b u t e ( ( a l i g n e d ( 1 2 8 ) ) ) ;
/∗ ∗∗ F i l l i n g t h e S c h e d u l i n g Tab le ∗∗∗ /
void s c h e d u l e t a b l e ( ) {
s c h e d u l i n g t a b l e [ 0 ] . t a s k [ 0 ] = t r u e ;
s c h e d u l i n g t a b l e [ 1 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 2 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 3 ] . t a s k [ 0 ] = t r u e ;
s c h e d u l i n g t a b l e [ 4 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 5 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 6 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 7 ] . t a s k [ 0 ] = t r u e ;
s c h e d u l i n g t a b l e [ 8 ] . t a s k [ 0 ] = f a l s e ;
91
s c h e d u l i n g t a b l e [ 9 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 0 ] . t a s k [ 0 ] = t r u e ;
s c h e d u l i n g t a b l e [ 1 1 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 2 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 3 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 4 ] . t a s k [ 0 ] = t r u e ;
s c h e d u l i n g t a b l e [ 1 5 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 6 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 7 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 8 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 9 ] . t a s k [ 0 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 0 ] . t a s k [ 1 ] = t r u e ;
s c h e d u l i n g t a b l e [ 1 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 2 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 3 ] . t a s k [ 1 ] = t r u e ;
s c h e d u l i n g t a b l e [ 4 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 5 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 6 ] . t a s k [ 1 ] = t r u e ;
s c h e d u l i n g t a b l e [ 7 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 8 ] . t a s k [ 1 ] = t r u e ;
s c h e d u l i n g t a b l e [ 9 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 0 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 1 ] . t a s k [ 1 ] = t r u e ;
s c h e d u l i n g t a b l e [ 1 2 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 3 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 4 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 5 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 6 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 7 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 8 ] . t a s k [ 1 ] = f a l s e ;
s c h e d u l i n g t a b l e [ 1 9 ] . t a s k [ 1 ] = f a l s e ;
}
/∗ ∗∗ D e c l a r i n g Globa l v a r i a b l e s ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
i n t i =0 , k =0;
i n t i n t e r r u p t i d =0;
boo l w a i t ;
unsigned i n t d e s t n i n d e x = 0 ;
unsigned i n t s o u r c e i n d e x = 0 ;
unsigned i n t i n t r h d l r i n d e x = 0 ;
unsigned i n t t i c k s i n d e x = 0 ;
92
unsigned i n t c o u n t e r = 0 ;
unsigned i n t t i c k s 1 [NUMBER SLOTS ] ;
unsigned i n t t i c k s 2 [NUMBER SLOTS ] ;
long s i z e t e m p =0; / / temp v a r i a b l e
long s i z e a r r a y t e m p [ 2 ] ; / / temp v a r i a b l e
i n t i n c r s i z e = 0 ;
/∗ ∗∗ I n t e r r u p t s and Timers d e c l a r a t i o n ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
unsigned i n t d e c r h a n d l e r ( unsigned i n t ) ;
/∗ ∗∗ P r o f i l e d Work F u n c t i o n ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
void work ( unsigned long long i d ) {
i n t nprocs , s p e i d ;
void ∗ b a r p t r ;
/∗ ∗∗Load t h e S c h e d u l i n g Tab le ∗∗∗ /
b a r p t r = cb1 . b a r p t r ;
n p r o c s = cb1 . n p r o c s ;
s p e i d = cb1 . s p e i d c b ;
/∗ ∗∗ C a l l i n g s c h e d u l e t a b l e b e f o r e
∗∗∗ s t a r t i n g i n t e r r u p t h a n d l e r ∗∗∗ /
s c h e d u l e t a b l e ( ) ;
/∗ ∗∗ use SPU t i m e r l i b r a r y FLIH and SLIH t o imp lemen t
I n t e r r u p t s ∗∗∗ /
s p u s l i h r e g (MFC DECREMENTER EVENT, d e c r h a n d l e r ) ; / / e v e r y 1
ms
s p u w r i t e c h ( SPU WrEventMask , MFC DECREMENTER EVENT) ;
s p u w r i t e c h ( SPU WrDec , DECR COUNT) ;
/∗ ∗∗ w a i t u n t i l a l l SPEs are done ∗∗∗ /
b a r r i e r h e a v y ( ( unsigned i n t ) b a r p t r , s p e i d , l s b u f , n p r o c s ) ;
/∗ ∗∗ Enable t h e i n t e r r u p t s ∗∗∗ /
s p u i e n a b l e ( ) ;
93
/∗ ∗∗ Do work u n t i l i n t e r r u p t comes ∗∗∗ /
whi le ( 1 ) {
i f ( w a i t == t r u e ) {
m f c w r i t e t a g m a s k (1 << ( cb1 . t a s k ) ) ;
m f c r e a d t a g s t a t u s a l l ( ) ;
/∗ ∗∗ Record end t i c k v a l u e ∗∗∗ /
t i c k s 2 [ t i c k s i n d e x ] = s p u r e a d c h ( SPU RdDec ) ; / / SPU t i m e r
l i b r a r y f u n c t i o n
/ / used t o read t h e c h a n n e l
f o r number o f t i c k s
w a i t = f a l s e ;
t i c k s i n d e x ++;
}
i f ( i n t r h d l r i n d e x == NUMBER SLOTS+1) {
break ;
}
}
/∗ ∗∗ To p r i n t t h e number o f t i c k s t h e DMA t a k e ∗∗∗ /
i f ( cb1 . d m a e x i s t == 7) {
f o r ( i =0 ; i< t i c k s i n d e x ; i ++){
p r i n t f ( ”\n ID = %d\ t t i c k s [%d ]= %d\ t ” ,
cb1 . i d s p u , i , t i c k s 1 [ i ]− t i c k s 2 [ i ] ) ;
f f l u s h ( s t d o u t ) ;
}
}
re turn 0 ;
}
/∗ ∗∗ THE MAIN FUNCTION ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
i n t main ( unsigned long long id , void ∗ argp , unsigned i n t envp ) {
unsigned long s t a r t s c h e d u l e ;
unsigned long ∗ d m a d a t a a d d r ;
unsigned long g a r b a g e =1234;
CONTROL BLOCK MAIN ∗ c b a d d r ;
CONTROL BLOCK DATA ADDR ∗ c b d a t a a d d r ;
long t g t a d d r ;
long t g t i n c r ;
i n t j = 0 ;
94
i n t p = 0 ;
i n t q = 0 ;
/∗ ∗∗ Wait f o r c o n t r o l b l o c k a d d r e s s from PPE ∗∗∗ /
c b a d d r = (CONTROL BLOCK MAIN ∗ ) s p u r e a d i n m b o x ( ) ;
/∗ ∗∗ DMA over c o n t r o l b l o c k & w a i t u n t i l done ∗∗∗ /
m f c g e t (&cb1 , ( unsigned i n t ) cb add r , s i z e o f ( cb1 ) , 5 , 0 , 0 ) ;
/∗ ∗∗ Mask o u t t a g we are i n t e r e s t e d i n ∗∗∗ /
m f c w r i t e t a g m a s k (1 << 5) ;
/∗ ∗∗ Wait f o r DMA c o m p l e t i o n ∗∗∗ /
m f c r e a d t a g s t a t u s a l l ( ) ;
/∗ ∗∗ SPE s e n d s t o PPE garbage v a l u e a f t e r r e c e i v i n g CB a d d r e s s
∗∗∗ /
s p u w r i t e o u t m b o x ( g a r b a g e ) ;
/∗ ∗∗ C r e a t i n g Array o f numbers t o be DMAed
∗∗∗∗∗∗ between SEND and RECEIVE SPE ∗∗∗ /
f o r ( i =0 ; i<cb1 . s p u a r r a y s i z e ; i ++){
s i z e a r r a y t e m p [ i ] = cb1 . s p u d m a d a t a a r r a y [ i ] ;
}
/∗ ∗∗ D e t e r m i n i n g t h e d e s t n i d i n d e x ∗∗∗ /
f o r ( p =0; p<cb1 . s p u a r r a y s i z e ; p ++) {
i f ( cb1 . s p u d e s t n i d n u m b e r s [ p ]== cb1 . i d s p u ) {
d e s t n i n d e x = p ;
break ;
}
e l s e
d e s t n i n d e x = −1;
}
/∗ ∗∗ D e t e r m i n i n g t h e s o u r c e i d i n d e x ∗∗∗ /
f o r ( q =0; q<cb1 . s p u a r r a y s i z e ; q ++) {
95
i f ( cb1 . s p u s o u r c e i d n u m b e r s [ q ]== cb1 . i d s p u ) {
s o u r c e i n d e x = q ;
break ;
}
e l s e
s o u r c e i n d e x = −1;
}
/∗ ∗∗ F i l l i n g t h e Source SPEs w i t h SPEID
∗∗∗∗∗ s p e c i f i c d a t a t h a t i s DMAed ∗∗∗ /
i f ( cb1 . s p u s o u r c e i d n u m b e r s [ s o u r c e i n d e x ] == cb1 . i d s p u ) {
f o r ( i = 0 ; i < s i z e a r r a y t e m p [ s o u r c e i n d e x ] ; i ++){
d a t a t o t a l [ i ] = ( ( i d & (0 x000000f0 ) )>> 4 ) + 1 ;
}
/∗ ∗∗ Addres s o f SPE da ta a r r a y ∗∗∗ /
d m a d a t a a d d r = ( unsigned long )&d a t a t o t a l [ 0 ] ;
/∗ ∗∗ Send mai lbox w i t h SEND SPE ’ s
∗∗∗∗ d a t a a r r a y a d d r t o PPE ∗∗∗ /
s p u w r i t e o u t m b o x ( d m a d a t a a d d r ) ;
}
/∗ ∗∗ Wait f o r c o n t r o l b l o c k a d r e s s from PPU ∗∗∗ /
c b d a t a a d d r = (CONTROL BLOCK DATA ADDR ∗ ) s p u r e a d i n m b o x ( ) ;
/∗ ∗∗ DMA over c o n t r o l b l o c k & w a i t u n t i l done ∗∗∗ /
m f c g e t (& cbda ta1 , ( unsigned i n t ) c b d a t a a d d r , s i z e o f ( c b d a t a 1 ) ,
5 , 0 , 0 ) ;
/∗ ∗∗ Mask o u t t a g we are i n t e r e s t e d i n ∗∗∗ /
m f c w r i t e t a g m a s k (1 << 5) ;
/∗ ∗∗ Wait f o r DMA c o m p l e t i o n ∗∗∗ /
m f c r e a d t a g s t a t u s a l l ( ) ;
/∗ ∗∗ SPE s e n d s t o PPE garbage v a l u e
∗∗∗ a f t e r r e c e i v i n g CB a d d r e s s ∗∗∗ /
96
s p u w r i t e o u t m b o x ( g a r b a g e ) ;
/∗ ∗∗ Read s t a r t s c h e d u l e v a l u e from t h e PPE ∗∗∗ /
s t a r t s c h e d u l e = s p u r e a d i n m b o x ( ) ;
/∗ ∗∗ F u n c t i o n t o C a l l t h e Work F u n c t i o n ∗∗∗ /
i f ( s t a r t s c h e d u l e == 1000) {
i f ( cb1 . s p u d e s t n i d n u m b e r s [ d e s t n i n d e x ]== cb1 . i d s p u ) {
/∗ ∗∗ Touch each page o f t h e t a r g e t b u f f e r so t h a t ∗∗∗∗
∗∗∗∗∗∗∗ t h e page t a b l e s and TLBs a r e a l l l o a d e d up ∗∗∗ /
f o r ( i =0 ; i<cb1 . s p u a r r a y s i z e ; i ++){
t g t a d d r = ( unsigned long ) c b d a t a 1 . d m a a d d r a r r a y [ i ] ;
t g t i n c r = 4096 ; /∗ ∗∗ one page ∗∗∗ /
s i z e t e m p = cb1 . s p u d m a d a t a a r r a y [ i ] ;
i f ( s i z e t e m p < 4096)
s i z e t e m p = 4096 ;
e l s e
s i z e t e m p = cb1 . s p u d m a d a t a a r r a y [ i ] ;
f o r ( j =0 ; j<=s i z e t e m p ; j += t g t i n c r ) {
m f c g e t (& d a t a r e c e i v e [ 0 ] , ( unsigned long ) t g t a d d r ,
128 , 5 , 0 , 0 ) ;
m f c w r i t e t a g m a s k (1 << 5 ) ;
m f c r e a d t a g s t a t u s a l l ( ) ; /∗ ∗∗ Wait f o r DMA t o
c o m p l e t e ∗∗∗ /
m f c p u t (& d a t a r e c e i v e [ 0 ] , ( unsigned long ) t g t a d d r , 1 2 8 ,
5 , 0 , 0 ) ;
m f c w r i t e t a g m a s k (1 << 5) ;
m f c r e a d t a g s t a t u s a l l ( ) ; /∗ ∗∗ Wait f o r DMA t o
c o m p l e t e ∗∗∗ /
t g t a d d r += t g t i n c r ;
}
}
}
/∗ ∗∗ SPE work ∗∗∗ /
work ( i d ) ;
}
re turn 0 ;
}
97
/∗ ∗∗ I n t e r r u p t Handler ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
unsigned i n t d e c r h a n d l e r ( unsigned i n t s t a t u s ) {
s t a t u s &= ˜MFC DECREMENTER EVENT ;
/∗ ∗∗ R e s e t t i n g i n t e r r u p t as soon as i t e n t e r s t h e h a n d l e r ∗∗∗ /
/∗ ∗∗ R e s e t Counter ∗∗∗ /
s p u w r i t e c h ( SPU WrDec , DECR COUNT) ;
/∗ ∗∗ ALGORITHM FUNCTION ∗∗∗
∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ /
i f ( ( cb1 . s p u d e s t n i d n u m b e r s [ d e s t n i n d e x ]== cb1 . i d s p u ) ) {
i f ( ( s c h e d u l i n g t a b l e [ i n t r h d l r i n d e x ] . t a s k [ d e s t n i n d e x ] ==
1) &&
( i n t r h d l r i n d e x <
NUMBER SLOTS) ) {
w a i t = t r u e ;
/∗ ∗∗ Record s t a r t t i c k v a l u e ∗∗∗ /
t i c k s 1 [ t i c k s i n d e x ] = s p u r e a d c h ( SPU RdDec ) ;
/∗ ∗∗ m f c g e t − do DMA from s o u r c e SPE ∗∗∗ /
m f c g e t (& d a t a r e c e i v e [ 0 ] + i n c r s i z e ,
( unsigned long ) ( c b d a t a 1 . d m a a d d r a r r a y [ d e s t n i n d e x ]+
i n c r s i z e ) ,
SIZE LOCAL , cb1 . t a s k , 0 , 0 ) ;
i n c r s i z e += SIZE LOCAL ; / / i n c r t o t h e n e x t 10KB o f da ta
}
}
i n t r h d l r i n d e x ++ ;
re turn ( s t a t u s ) ;
}
A.5 SPU: Makefile
# −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
# A l l R i g h t s Rese rved .
98
#
# U n i v e r s i t y o f I l l i n o i s a t Urbana Champaign
# Depar tmen t o f Computer S c i e n c e
# Real−Time Systems L a b o r a t o r y
#
#
# − D e e p t i Kumar Ch ivuku la − d c h i v u k 2 @ i l l i n o i s . edu
#
#
# BUS SCHEDULING ON THE CELL PROCESSOR
#
# − F i l e Name : m a k e f i l e s p u . c
#
# −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
# PROLOG END TAG zYx
########################################################
# T a r g e t
########################################################
#OBJS := s i m p l e s p u 0 . o s i m p l e s p u 1 . o
PROGRAMS spu := s i m p l e s p u # s i m p l e s p u 0 s i m p l e s p u 1
LIBRARY embed := l i b s i m p l e s p u . a
OBJS s imple spu := s p u f l i h . o \
s i m p l e s p u . o \
s p u s l i h r e g . o
# O B J S s p u i n t e r r u p t f a s t := s p u h a n d l e r f a s t . o \
s p u i n t e r r u p t f a s t . o \
INSTALL DIR = \$ ( EXP SDKBIN ) / t u t o r i a l
INSTALL FILES = \$ ( PROGRAM spu )
#INCLUDE = −I \$ (SDKPRINC)
#LDFLAGS += −L\$ ( SDKPRLIB )
#########################################################
99
# Loca l D e f i n e s
#########################################################
#IMPORTS := −l m i s c − l s p u t i m e r
#########################################################
b u i l d u t i l s / make . f o o t e r
#########################################################
i f d e f CELL TOP
i n c l u d e \$ ( CELL TOP ) / b u i l d u t i l s / make . f o o t e r
e l s e
i n c l u d e . . / . . / . . / . . / b u i l d u t i l s / make . f o o t e r
e n d i f
A.6 SPU: barrier heavy.h
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ A l l R i g h t s R e s e r v e d . ∗ /
/∗ ∗ /
/∗ U n i v e r s i t y o f I l l i n o i s a t Urbana Champaign ∗ /
/∗ Depar tment o f Computer S c i e n c e ∗ /
/∗ Real−Time S y s t e m s L a b o r a t o r y ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ − D e e p t i Kumar C h i v u k u l a − d c h i v u k 2 @ i l l i n o i s . edu ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ BUS SCHEDULING ON THE CELL PROCESSOR ∗ /
/∗ ∗ /
/∗ − F i l e Name : b a r r i e r h e a v y . h ∗ /
/∗ ∗ /
/∗ f u n c t i o n : b a r r i e r h e a v y ( ea , id , l s b u f , t o t a l ) ∗ /
/∗ − T h i s f u n c t i o n i m p l e m e n t s a s p e c i a l i z e d b a r r i e r ∗ /
/∗ f u n c t i o n t h a t i s r o b u s t and e n s u r e s t h a t a l l ∗ /
/∗ p a r t i e s l e a v e t h e b a r r i e r a t as c l o s e t o t h e same ∗ /
/∗ t i m e as p o s s i b l e . ∗ /
/∗ − The b a r r i e r u s e s a s y s t e m memory cache− l i n e ∗ /
/∗ b u f f e r s ’ ea ’ . ∗ /
/∗ − The ea b u f f e r c o n t a i n s an a r r a y f l a g s t h a t are ∗ /
100
/∗ s e t by each o f t h e SPEs when t h e y e n t e r t h e ∗ /
/∗ b a r r i e r . ∗ /
/∗ − SPE w i t h an i d o f 0 , i s c o n s i d e r t h e m as t e r . ∗ /
/∗ He w a i t s u n t i l a l l t h e f l a g s are s e t by t h e ∗ /
/∗ s l a v e SPEs , and t h e n c l e a r s t h e f l a g s t o r e l e a s e s ∗ /
/∗ t h e s l a v e SPEs . ∗ /
/∗ − ’ t o t a l ’ i s t h e number o f SPEs p a r t i c i p a t i n g i n ∗ /
/∗ t h e b a r r i e r . ∗ /
/∗ ∗ /
/∗ R e s t r i c t i o n s : ’ ea ’ must p o i n t t o a 16− b y t e a l i g n e d ∗ /
/∗ a d d r e s s i n s y s t e m memory . No o t h e r v a r i a b l e s s h o u l d ∗ /
/∗ r e s i d e i n t h i s c a c h e l i n e . The ’ l s b u f ’ must p o i n t t o ∗ /
/∗ a 128− b y t e a l i g n e d a d d r e s s i n l o c a l s t o r e memory . ∗ /
/∗ ∗ /
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ PROLOG END TAG zYx ∗ /
# i f n d e f b a r r i e r h e a v y h
# de f i n e b a r r i e r h e a v y h
# inc lude <s p u m f c i o . h>
s t a t i c i n l i n e void b a r r i e r h e a v y ( unsigned i n t ea ,
unsigned i n t id , v o l a t i l e vo id ∗ l s b u f , unsigned i n t t o t a l )
{
i n t i ;
unsigned i n t f l a g ;
unsigned i n t mask ;
v o l a t i l e unsigned i n t ∗ l s p t r ;
l s p t r = ( v o l a t i l e unsigned i n t ∗ ) l s b u f ;
/∗ Save t h e c a l l e r s t a g mask ∗ /
mask = s p u r e a d c h ( MFC RdTagMask ) ;
s p u w r i t e c h ( MFC WrTagMask , 1 ) ;
i f ( i d == 0) {
/∗ Master SPE ∗ /
/∗ Wait f o r a l l t h e s l a v e SPEs t o e n t e r t h e b a r r i e r . ∗ /
i f ( t o t a l >1){
do {
101
spu mfcdma32 ( l s b u f , ea , 128 , 0 , MFC GETLLAR CMD) ;
( void ) s p u r e a d c h ( MFC RdAtomicStat ) ;
f o r ( i =1 , f l a g =1; i <( i n t ) t o t a l ; i ++) f l a g &=
l s p t r [ i ] ;
} whi le ( f l a g == 0) ;
/∗ Clear t h e f l a g s f o r a l l t h e s l a v e SPEs . ∗ /
f o r ( i =1 ; i <( i n t ) t o t a l ; i ++) l s p t r [ i ] = 0 ;
spu mfcdma32 ( l s b u f , ea , 128 , 0 , MFC PUT CMD) ;
/∗ Read t h e b u f f e r t o a f f e c t an e q u i v a l e n t d e l a y
∗ on t h e m a s t e r SPE .
∗ /
spu mfcdma32 ( l s b u f , ea , 128 , 0 , MFC GETB CMD) ;
s p u m f c s t a t (MFC TAG UPDATE ALL) ;
}
} e l s e {
/∗ S l a v e SPE ∗ /
/∗ W r i t e t o i t s f l a g word t o s i g n a l t h a t
i t has r e a c h e d t h e b a r r i e r ∗ /
l s p t r [ i d ] = 1 ;
spu mfcdma32(& l s p t r [ i d ] , ea + i d ∗ s i z e o f ( unsigned i n t ) ,
4 , 0 , MFC PUT CMD) ;
s p u m f c s t a t (MFC TAG UPDATE ALL) ;
do {
spu mfcdma32 ( l s b u f , ea , 128 , 0 , MFC GETLLAR CMD) ;
( void ) s p u r e a d c h ( MFC RdAtomicStat ) ;
} whi le ( l s p t r [ i d ] ) ;
}
/∗ R e s t o r e t h e c a l l e r s t a g mask ∗ /
s p u w r i t e c h ( MFC WrTagMask , mask ) ;
}
# end i f /∗ b a r r i e r h e a v y h ∗ /
102
A.7 SPU: spu slih reg.c
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ A l l R i g h t s R e s e r v e d . ∗ /
/∗ ∗ /
/∗ U n i v e r s i t y o f I l l i n o i s a t Urbana Champaign ∗ /
/∗ Depar tment o f Computer S c i e n c e ∗ /
/∗ Real−Time S y s t e m s L a b o r a t o r y ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ − D e e p t i Kumar C h i v u k u l a − d c h i v u k 2 @ i l l i n o i s . edu ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ BUS SCHEDULING ON THE CELL PROCESSOR ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ − F i l e Name : s p u s l i h r e g . c ∗ /
/∗ ∗ /
/∗ Example SPU second l e v e l i n t e r r u p t h a n d l e r ∗ /
/∗ s l i h ) manager . T h i s example c o n s i s t s o f an ∗ /
/∗ i n t e r r u p t h a n d l e r d i s p a t c h t a b l e ∗ /
/∗ ( s p u s l i h h a n d l e r s ) , a i n t e r r u p t h a n d l e r ∗ /
/∗ r e g i s t r a t i o n r o u t i n e s p u s l i h r e g ) , and a d e f a u l t ∗ /
/∗ i n t e r r u p t h a n d l e r s p u d e f a u l t s l i h ) . For t h i s ∗ /
/∗ i m p l e m e n t a t i o n , a second l e v e l i n t e r r u p t h a n d l e r ∗ /
/∗ t a k e s as i t s i n p u t , t h e c u r r e n t e v e n t s t a t u s word . ∗ /
/∗ The h a n d l e r can assume t h a t i t was d i s p a t c h e d t o ∗ /
/∗ by t h e most s i g n i f i c a n t non−z e r o e v e n t b i t . The ∗ /
/∗ second l e v e l h a n d l e r i s assumed t o p r o c e s s any or ∗ /
/∗ a l l e v e n t s and r e t u r n a new e v e n t s t a t u s back t o ∗ /
/∗ t h e f i r s t l e v e l i n t e r r u p t h a n d l e r f o r f u r t h e r ∗ /
/∗ e v e n t p r o c e s s i n g . The f i r s t l e v e l i n t e r r u p t ∗ /
/∗ h a n d l e r has a l r e a d y acknowledged a l l r e c e i v e d ∗ /
/∗ e v e n t s b e f o r e c a l l i n g any s l i h . A s l i h s h o u l d ∗ /
/∗ o n l y per form s u b s e q u e n t acknowledgemen t s i f i t i s ∗ /
/∗ d e t e r m i n e d t h a t a d d i t i o n a l e v e n t s have been ∗ /
/∗ r e c e i v e d w h i l e i n t h e s l i h . ∗ /
/∗ ∗ /
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ PROLOG END TAG zYx ∗ /
# inc lude < s p u s l i h r e g . h>
103
# inc lude < s p u i n t r i n s i c s . h>
# de f i n e SPU EVENT ID ( mask ) \\
( s p u e x t r a c t ( s p u c n t l z ( s p u p r o m o t e ( mask , 0 ) ) , 0 ) )
/∗ s p u d e f a u l t s l i h
∗ −−−−−−−−−−−−−−−−
∗ Thi s f u n c t i o n i s c a l l e d whenever an e v e n t o c c u r s f o r which
∗ no second l e v e l e v e n t h a n d l e r was r e g i s t e r e d . The de f au l t
∗ e v e n t h a n d l e r does n o t h i n g and z e r o s t h e most s i g n i f i c a n t
∗ e v e n t b i t i n d i c a t i n g t h a t t h e e v e n t was p r o c e s s e d ( when
∗ i n r e a l i t y , i t was d i s c a r d e d . .
∗ /
s t a t i c unsigned i n t s p u d e f a u l t s l i h ( unsigned i n t e v e n t s )
{
unsigned i n t mse ;
mse = 0 x80000000 >> SPU EVENT ID ( e v e n t s ) ;
e v e n t s &= ˜ mse ;
re turn ( e v e n t s ) ;
}
/∗ s p u s l i h h a n d l e r s [ ]
∗ −−−−−−−−−−−−−−−−−−
∗ Here we i n i t i a l i z e 33 de f au l t e v e n t h a n d l e r s . The f i r s t
∗ e n t r y i n t h i s a r r a y c o r r e s p o n d s t o t h e e v e n t h a n d l e r f o r
∗ t h e e v e n t a s s o c i a t e d wi th b i t 0 o f Channel 0 ( E x t e r n a l
∗ Event S t a t u s ) . The 32 nd e n t r y i n t h i s a r r a y c o r r e s p o n d s
∗ t o b i t 31 o f Channel 0 (DMA Tag S t a t u s Update Event ) . The
∗ 33 rd e n t r y i n t h i s a r r a y i s a s p e c i a l case e n t r y t o h a n d l e
∗ ” phantom e v e n t s ” which o c c u r when t h e c h a n n e l c o u n t f o r
∗ Channel 0 i s 1 , c a u s i n g an a s y n c h r o n o u s SPU i n t e r r u p t ,
∗ b u t t h e v a l u e r e t u r n e d f o r a r e a d o f Channel 0 i s 0 .
∗ The i n d e x c a l c u l a t e d i n t o t h i s a r r a y by s p u f l i h ( ) f o r
∗ t h i s case i s 32 , hence t h e 33 rd e n t r y . ∗ /
s p u s l i h f u n c s p u s l i h h a n d l e r s [ 3 3 ] a t t r i b u t e ( ( a l i g n e d
( 1 6 ) ) ) = {
s p u d e f a u l t s l i h , s p u d e f a u l t s l i h , s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
104
s p u d e f a u l t s l i h , s p u d e f a u l t s l i h , s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h , s p u d e f a u l t s l i h , s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h , s p u d e f a u l t s l i h , s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h , s p u d e f a u l t s l i h , s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h , s p u d e f a u l t s l i h , s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h , s p u d e f a u l t s l i h , s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h , s p u d e f a u l t s l i h , s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
s p u d e f a u l t s l i h ,
} ;
/∗ s p u s l i h r e g
∗ −−−−−−−−−−−−
∗ R e g i s t e r s a SPU second l e v e l i n t e r r u p t h a n d l e r f o r
∗ t h e e v e n t s s p e c i f i e d by mask . The e v e n t mask c o n s i s t s o f a
∗ s e t o f b i t s c o r r e s p o n d i n g t o t h e e v e n t s t a t u s b i t s
∗ ( s e e c h a n n e l 0 d e s c r i p t i o n ) . A mask c o n t a i n i n g m u l t i p l e 1
∗ b i t s w i l l s e t t h e second l e v e l e v e n t h a n d l e r f o r each of
∗ t h e e v e n t s .
∗ /
void s p u s l i h r e g ( unsigned i n t mask , s p u s l i h f u n c func )
{
unsigned i n t i d ;
whi le ( mask ) {
i d = SPU EVENT ID ( mask ) ;
s p u s l i h h a n d l e r s [ i d ] = func ;
mask &= ˜ ( 0 x80000000 >> i d ) ;
}
}
A.8 SPU: spu slih reg.h
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ A l l R i g h t s R e s e r v e d . ∗ /
105
/∗ ∗ /
/∗ U n i v e r s i t y o f I l l i n o i s a t Urbana Champaign ∗ /
/∗ Depar tment o f Computer S c i e n c e ∗ /
/∗ Real−Time S y s t e m s L a b o r a t o r y ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ − D e e p t i Kumar C h i v u k u l a − d c h i v u k 2 @ i l l i n o i s . edu ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ ∗ /
/∗ BUS SCHEDULING ON THE CELL PROCESSOR ∗ /
/∗ ∗ /
/∗ − F i l e Name : s p u s l i h r e g . h ∗ /
/∗ ∗ /
/∗ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− ∗ /
/∗ PROLOG END TAG zYx ∗ /
# i f n d e f SPU SLIH REG H
# de f i n e SPU SLIH REG H 1
t ypede f unsigned i n t (∗ s p u s l i h f u n c ) ( unsigned i n t ) ;
ex tern s p u s l i h f u n c s p u s l i h h a n d l e r s [ ] ;
ex tern void s p u s l i h r e g ( unsigned in t , s p u s l i h f u n c ) ;
ex tern void s p u f l i h ( void ) ;
# end i f /∗ SPU SLIH REG H ∗ /
A.9 Bit Ordering and Numbering
Bit order is an important consideration in network programming, since two com-
puters with different byte orders may be communicating. Failure to account for
varying endianness when writing code for mixed platforms can lead to bugs that
can be difficult to detect. This section explains big-endian ordering used by the
CBE.
Storage of data and instructions in the Cell Broadband Engine is big-endian.
Big-endian ordering is shown in Figure A.1 and has the following characteristics:
• Most-significant byte is stored at the lowest address, and least-significant
byte is stored at the highest address.
106
• Bit numbering within a byte goes from most-significant bit (bit 0) to least-
significant bit (bit n). This differs from some other big-endian processors.
Figure A.1: Big Endian Ordering
It is important to note that neither the PPE nor the SPEs, including their re-
spective MFCs, supports little-endian byte ordering. The DMA transfers of the
MFC are simply byte moves, without regard to the numeric significance of any
byte. Thus, the big-endian or little-endian issue becomes irrelevant to the move-
ment of a block of data. The byte-order mapping only becomes significant when
data is loaded or interpreted by a processor element or an MFC.
107
REFERENCES
[1] R. Pellizzoni, B. D. Bui, M. Caccamo, and L. Sha, “Coscheduling of cpu
and i/o transactions in cots-based embedded systems,” in Proceedings of the
29th IEEE Real-Time Systems Symposium, 2008, pp. 221–231.
[2] A. Agarwal, C. Iskander, and R. Shankar, “Survey of network on chip ar-
chitectures and contributions,” in Journal of Engineering Computing and
Architecture, vol. 3, no. 1, 2009.
[3] T. Chen, R. Raghavan, J. Dale, and E. Iwata, “Cell broadband engine archi-
tecture and its first implementation: A performance view,” in IBM Journal
of Research and Development, vol. 51, no. 5, 2005, pp. 559–572.
[4] T. W. Ainsworth and T. M. Pinkston, “Characterizing the cell eib on-chip
network,” IEEE Micro, vol. 27, no. 5, pp. 6–14, 2007.
[5] Cell Broadband Engine Programming Tutorial. IBM Corporation, 2007.
[6] Cell Broadband Engine Programming Handbook. IBM Corporation, 2006.
[7] S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel, “Proportionate
progress: A notion of fairness in resource allocation,” Algorithmica, vol. 15,
pp. 600–625, 1996.
[8] C. D. Locke, D. R. Vogel, L. Lucas, and J. B. Goodenough, “Generic avion-
ics software specification,” Carnegie-Mellon Software Engineering Institute,
Tech. Rep. CMU/SEI-90-TR-8, 1990.
[9] M. C. Golumbic, Algorithmic Graph Theory and Perfect Graphs Annals of
Discrete Mathematics. Amsterdam, The Netherlands: North-Holland Pub-
lishing Company, 2004, vol. 57.
[10] P. Holman and J. H. Anderson, “Adapting pfair scheduling for symmetric
multiprocessors,” Journal of Embedded Computing, vol. 1, no. 4, pp. 543–
564, 2005.
[11] H. Cho, B. Ravindran, and E. D. Jensen, “An optimal real-time scheduling
algorithm for multiprocessors,” in The 27th IEEE International Real-Time
Systems Symposium, 2006, pp. 101–110.
108
[12] D. Zhu, D. Mosse, and R. Melhem, “Multiple-resource periodic scheduling
problem: How much fairness is necessary?” in The 24th IEEE International
Real-Time Systems Symposium (RTSS’03), 2003, p. 142.
[13] J. Kleinberg and E. Tardos, Algorithm Design. Reading, MA: Addison-
Wesley, 2005.
[14] SPU Timer Library Programmer’s Guide and API Reference. IBM Corpo-
ration, 2007.
109
