Programmable Parallel Coprocessor Architectures for Reconfigurable System-on-Chip by Williams, John A. & Bergmann, Neil W.
Programmable Parallel Coproces
Reconfigurable System
John Williams and Neil 
School of ITEE, The Universit
Brisbane, Austra
{jwilliams;n.bergmann}@it
Abstract 
We propose a hybrid rSoC parallel processing 
architecture consisting of a central 32-bit RISC 
microprocessor interconnected to an array of 8-bit 
microcontrollers as coprocessing nodes.  The central 
processor runs an embedded Linux operating system, 
with the coprocessor nodes mapped into a virtual file 
system, by which they can be controlled and 
reprogrammed.  The hardware and software 
architectures are detailed, and several useful 
application contexts are proposed. Supporting 
theoretical analysis is also presented. 
1. Introduction 
Performance predictability is one of the strongest 
constraints in real-time system design.  Unlike 
transformational computing (e.g. data processing and 
simulation systems), the feasibility of a real time 
system is measured by its ability to guarantee that 
computational tasks will be completed within a 
certain deadline.  The difficulty of assuring real-time 
performance increases dramatically with the number 
of interfering tasks [1]. 
Reconfigurable system-on-chip (rSoC) is a 
powerful tool for real-time system design.  Custom 
architectures that offload processing from the central 
microprocessor onto customised hardware or 
coprocessing units can result in more predictable 
overall system performance. 
In this work we propose a hybrid rSoC 
multiprocessing architecture consisting of multiple 
individually programmable microcontrollers 
interfaced to a central microprocessor system, and 
consider the subsequent impact on system 
performance and design flexibility. 
The hardware consists of a 32bit softcore CPU, 
connected in a star topology to multiple dynamically 
reprogrammable microcontroller cores.  We have 
implemented this architecture using the Xilinx 
Microblaze processor and its little cousin Picoblaze, 
as the main and coprocessor respectively.  We 
briefly introduce each of these in Section 2. 
The software architecture consists of the uClinux 
operating system (e.g. [2]) running on the 
Micro
mapp
availa
mapp
runnin
W
Firstly
copro
proce
acting
comm
low-s
copro
anoth
Th
comp
transf
cases,
separa
crypto
It 
tasks 
comp
throug
copro
and p
finally
future
2. Ba
Co
partic
imple
tested
In the
inform
2.1. M
Mi
with 3
instru
with 
instru
targetsor Architectures for
-on-Chip 
Bergmann 
y of Queensland 
lia 
ee.uq.edu.au 
blaze, with the programmable state machines 
ed directly into the kernel space, and made 
ble to user processes via a virtual file system 
ing.  We use the term picoware to refer to code 
g on the coprocessor nodes. 
e suggest two specific roles for the architecture.  
, by providing an off-chip interface from each 
cessor node, they may be used as intelligent IO 
ssors, offloading main processor load and 
 as virtual devices.  Low bit-rate 
unications such as RS232, I2C, SPI, or even 
peed USB can be implemented on the 
cessors.  PWM or sigma-delta modulation is 
er potential use. 
e second use-class is for parallelising 
utational tasks characterised by small data 
ers and long computational times.  In these 
 the computations can execute in parallel on 
te coprocessor nodes. Small block–based 
graphic algorithms are one such computation. 
can be argued that these IO and coprocessing 
should be implemented more efficiently and 
actly in customised hardware, rather than 
h general purpose programmable 
cessors.  We address this point in Section 5, 
resent a theoretical analysis of the architecture, 
 followed by conclusions and directions for 
 development. 
ckground 
nceptually the architecture is not tied to any 
ular processor or device, however its 
mentation as described in this paper has been 
 with Xilinx FPGAs and soft-processor cores.  
 following we provide supporting background 
ation. 
icroblaze 
croblaze is a compact 32 bit RISC processor, 
2 general purpose registers, and an orthogonal 
ction set.  It uses a 3 stage instruction pipeline, 
delayed branch capability for improved 
ction throughput.  Microblaze is specifically 
ed to logic primitives of Xilinx FPGA devices, 
such as hardware multipliers and on chip block RAM 
(BRAM).  Its maximum clock frequency is 125MHz 
in a modern Xilinx FPGA part [3]. 
2.1.1. Fast Simplex Links (FSL)  
Of particular relevance to the current work is the 
FSL (Fast Simplex Link) interface.  FSL is a 
unidirectional, point-to-point bus interface [4].  
Microblaze has eight each of FSL master and slave 
ports, named respectively FSL_M/FSL_S 0 through 
7.  FSL buses themselves are implemented as 32-bit 
x 16-deep FIFOs, which helps to decouple the timing 
of the FSL master from the FSL slave.  The 
symmetry of the master/slave interfaces allows FSL-
enabled peripherals or processors to be connected in 
systolic chains or arrays. 
In the Microblaze, each FSL port is mapped 
directly and orthogonally onto the register set.  Write 
operations are performed using 
put rS, FSLn 
where rS is the register containing the source data 
(r0…r31).  Similarly, FSL reads are performed with 
get rD, FSLn 
where rD specifies a destination register. 
By default, FSL operations are blocking – writes 
to a full master FIFO, or reads from an empty slave, 
will halt the processor until the blocking condition is 
removed.  Non-blocking instructions nput/nget
are also available.  Finally, there is a control bit that 
may be optionally set and tested, giving rise to the 
cput/cget instructions (and their non-blocking 
equivalents ncput and ncget).  This allows FSL 
interactions to be distinguished as data or control 
operations. 
The FSL ports are a powerful interconnect point 
for coprocessors and hardware accelerators, due to 
their close integration with the processor and 
registers, and predictable timing.  Non-blocking FSL 
operations incur a fixed two cycle delay, compared 
to the shared OPB bus whose transactions may 
require four cycles or more.  Advantages of FIFO-
based communications over buses for multiprocessor 
networks have previously been argued and 
demonstrated (e.g. [5, 6]). 
2.2. Microblaze uClinux 
uClinux is a mainstream variant of the Linux 
operating system with customised memory 
management code to operate on processors lacking a 
hardware memory management unit (MMU).  
Implications of this are no virtual memory or paging, 
no memory protection, and the absence of fork()’s 
copy-on-write semantics.  Most Linux applications 
run directly on uClinux, while some that use fork() 
require modification to the uClinux-compatible 
vfork() primitive. 
In 
is ver
proce
device
called
periph
memo
buildi
have 
interfa
applic
Th
recon
custom
hardw
intern
previo
system
Linux
2.2.1.
Lin
to fo
does n
latenc
addre
and th
These
execu
Linux
provid
thread
Ou
comp
than p
layer 
separa
guara
2.3. P
Pic
Progr
opera
imple
minim
[10]. 
versio
memo
stack 
is via
which
conne
in 2 c
Pre
accele
Progr
time 
custom
procemost respects, the Microblaze port of uClinux 
y similar to other ports to more conventional 
ssors such as the Motorola Coldfire and ARM 
s.  A standard Microblaze hardware platform 
 “mbvanilla” has been developed, with 
erals such as timers, interrupt controllers, 
ry controllers, GPIOs and an Ethernet MAC 
ng a complete system.  Linux device drivers 
been wrapped around these cores for 
cing with the kernel and user space 
ations. 
e power and novelty of running Linux on a 
figurable processor comes from the ability to 
ise the operating system to support custom 
are architectures.  For example, the Virtex-II 
al configuration access port (ICAP) has been 
usly mapped into a Microblaze uClinux 
, resulting in dynamically self-reconfiguring 
 systems [7]. 
Linux and Real-time Systems 
ux is not a real-time operating system.  It tends 
cus on improving average performance, and 
ot offer guarantees on metrics such as interrupt 
y.  Several software based approaches exist to 
ss this shortcoming, most notably RTLinux [8] 
e Real Time Application Interface (RTAI) [9].  
 systems implement a real-time microkernel 
ting real-time threads, and simply run the 
 kernel as a low priority process.  They also 
e communication channels between real-time 
s and processes running in Linux. 
r architecture is different, although 
lementary, to these existing approaches.  Rather 
ushing real time tasks down into a microkernel 
on the same processor, we push them out to 
te coprocessors (as picoware), executing with 
nteed performance. 
icoblaze 
oblaze, also known as the K (constant) Coded 
ammable State Machine (KCPSM), is a two-
nd, 8 bit microcontroller from Xilinx, 
mented directly in Xilinx FPGA primitives to 
ise logic usage and maximise performance 
 There several versions available – we use 
n 2 which has a 1024 x 18bit instruction 
ry, supporting conditional branches, a return 
for nested subroutine calls, and interrupts.  IO 
 256 addressable input and output ports, to 
 arbitrary external peripherals may be 
cted.  All Picoblaze instructions are completed 
lock cycles. 
vious uses of the Picoblaze as a programmable 
rator include implementation as a Field 
ammable Port Extender (FPX) module for real-
network packet processing [11], and as a 
izable application-specific data and IO 
ssor [12]. 
3. Hardware Architecture 
In the proposed architecture, Picoblaze 
coprocessor nodes are connected in a star topology to 
a central Microblaze, using the FSL links described 
previously.  In the following we describe the 
coprocessor nodes and interconnect architecture in 
detail. 
3.1. Picoblaze Coprocessor Nodes 
The architecture of the coprocessor node is 
illustrated in  
Figure 1, which serves as a reference for the 
subsequent discussion. 
A controller listens at the input FSL interface of 
the node.  A node control operation is indicated by 
the presence of the FSL control bit (Sect. 2.1.1). 
Otherwise, the lower 8-bits of the FSL word are 
pushed onto the Picoblaze’s input FIFO as data for 
subsequent processing. 
Currently supported node control operations are 
Picoblaze code space write, reset, and interrupt. 
Code writes are achieved by the master 
(Microblaze) sending a control command to set a 
write-address register, which is then automatically 
incremented on each subsequent opcode write.  This 
seek/sequential programming model efficiently 
supports both bulk code writes (reprogramming 
entire code memory), and small changes such as 
updating individual data items encoded in the 
Picoblaze opcodes.  It is advisable, although not 
mandatory, to hold the node in reset during code 
updates. 
Two 8-bit x 16 deep FIFOs are interfaced onto the 
Picoblaze’s IO space, one each for read and write.  
The FIFOs can be read and written by Picoblaze in 
either blocking or non-blocking modes, to mirror the 
FSL semantics (Sect. 2.1.1).  A “Halt” signal was 
added to the standard Picoblaze core, that forces an 
extended T-state if a blocking write is attempted on a 
full output FIFO, or a blocking read on an empty 
input FIFO.  Halt is asserted by the node controller if 
the input FIFO is empty and the Picoblaze attempts a 
blocking read operation (and respectively blocking 
write on a full output FIFO).  The Picoblaze may test 
the F
would
Th
block
nodes
synch
in the
A 
bit di
would
FPGA
logic 
outpu
into th
Lo
node,
and G
system
numb
Block
requir
In 
Micro
instru
nodes
part 
develo
V2MB
3.2. M
Ea
FSL m
conne
FSL 
chann
conne
Mi
Figure 1. Picoblaze coprocessor node ar
Sele
Numb
Numb
Numb
Numb
FSL Master interface
8-bit x 16
FIFO
Picoblaze
Controller
Reset / Halt / Interrupt
Dual Port
BRAM
18-bit x 1024IFO status, to determine whether an operation 
 block. 
e introduction of the FIFOs and optional 
ing/non-blocking operations to the process 
 supports elegant solutions to host-coprocessor 
ronization issues.  Examples are presented later 
 paper. 
final feature of each coprocessor node is an 8-
rection-programmable GPIO port.  The GPIO 
 typically be mapped to external pins of the 
, however it may also be connected to other 
in the design if required.  The GPIO inputs, 
ts, and direction control registers are mapped 
e Picoblaze IO space. 
gic usage statistics for a single coprocessor 
 comprising of a Picoblaze, controller, FIFOs 
PIO are shown in Table 1.  For multi-node 
s, the limiting factor is clearly the total 
er of FSL ports available, and also on-chip 
 RAM (BRAM) usage.  Each Picoblaze node 
es one full BRAM for its instruction memory.   
these experiments, a complete uClinux-capable 
blaze system, including Ethernet MAC, 
ction/data caches and 8 Picoblaze coprocessor 
, fit comfortably into a Xilinx Virtex2-1000 
(approx 75% logic utilisation).  The 
pment board used is a standard Insight/Memec 
1000 Microblaze development board. 
icroblaze to Coprocessor Hardware 
Integration 
ch coprocessor node presents one each of an 
aster and slave interface.  These interfaces are 
cted directly to a Microblaze master and slave 
port.  Microblaze currently supports 8 FSL 
els, and thus up to 8 coprocessors may be 
cted, as illustrated in Figure 2. 
croblaze communicates with each node either 
chitecture and interfaces 
Table 1. Logic usage – single 
coprocessor node 
cted Device : 2v1000fg456-4 
er of Slices:           151 out of  5120 2%
er of Slice Flip Flops: 114 out of 10240 1%
er of 4 input LUTs:     221 out of 10240 2%
er of BRAMs:              1 out of    40 2%
8-bit external GPIO interface
8-bit x 16
FIFO
FSL Slave interface
through control operations as described previously, 
or as regular data operations.  Data writes to a master 
port map connect to the node input FIFO, while 
reads access the node output FIFO.  
Master/slave symmetry between read and write 
operations is preserved – just as a node may 
optionally block, so too can Microblaze, with the 
blocking get/put operations described in Sect. 2.1.1. 
In practice, it is highly inadvisable for Microblaze 
to perform blocking operations – a stalled 
coprocessor could deadlock the entire system.  This 
is particularly the case when the Microblaze is 
running a multitasking operating system like 
uClinux.  The uClinux integration of the processor 
architecture forbids instruction-level blocking.  
Instead, user processes may elect to block, but at the 
device driver level the FIFO status is polled 
periodically, rather than utilizing hardware blocking 
get/put operations. 
4. Operating System Integration 
The hardware architecture just described can be 
used directly by low level Microblaze software.  The 
procedures to program, control, and communicate 
with each node are simply combinations of FSL put
and get instructions.  Table 2 outlines the primitives 
of the library developed to support low level 
coprocessor node communications and control. 
This API could also be used directly by user-
space uClinux programs.  However, we chose to map 
the architecture into the uClinux kernel, representing 
each Picoblaze as a system resource that may be 
acquired, read, and written.  We expand on this 
mapping below, before offering justification for this 
approach over a potentially lighter-weight direct 
programming model. 
4.1. The Linux Virtual File System 
Like most modern operating systems, Linux uses 
a virtual file system (VFS) abstraction model, 
beneath which exist physical file system 
implementations.  For example, the Network File 
Syste
light-
Linux
Th
file 
applic
recon
netwo
achiev
via N
they w
VF
which
instea
proce
listing
capab
Norm
provid
kerne
cases 
4.2. C
In 
copro
we c
procfs
interfa
Table
loada
direct
Th
/proc/
Picob
and o
• r
p
re
c
Figure 2. Microblaze and 
coprocessor node interconnection 
int 
num)
int 
int 
 
int 
int 
 
 
int 
 
 
int 
 
int 
 
int 
Coprocessor 0
FSL0 Master interface
FSL0 Slave interface
GPIO
Coprocessor 7
FSL7 Master interface
FSL7 Slave interface
GPIO
Microblaze
O
P
B
 B
u
s
OPB Busm (NFS) presents the same interface as does a 
weight ROMFS commonly used in embedded 
 systems. 
is has some powerful benefits – the underlying 
system is completely transparent at the 
ation level.  In the dynamically self-
figuring Linux work mentioned previously [7], 
rked dynamic self-reconfiguration was 
ed trivially by mounting a remote host system 
FS, and processing partial bitstreams as though 
ere regular local files. 
S also gives rise to truly virtual file systems 
 have no underlying physical manifestation but 
d are constructed dynamically in response to 
ss requests such as reads, writes, directory 
s and so on.  The best known use of this 
ility is the Linux proc file system (procfs).  
ally mounted into the /proc directory, procfs 
es a window into the internal operations of the 
l, allowing processes to inspect and in some 
modify the operation of the kernel. 
oprocessor array mapping into procfs 
accordance with the interpretation of the 
cessor nodes as configurable system resources, 
hoose to map these nodes directly into the 
.  This is achieved by writing a simple file-like 
ce wrapper around the low-level API shown in 
 2.  The procfs interface is implemented as a 
ble kernel module, which creates the virtual 
ory and file structure upon initialization. 
e virtual directory structure is 
picoblaze/pico0 through to /pico7. Within each 
laze virtual directory exist three virtual files, 
ne virtual device 
eset – writing a non-zero value to this file 
laces the Picoblaze in reset.  Writing a zero 
leases reset.  Reading this file returns the 
urrent reset state. 
Table 2. Low-level coprocessor 
management API 
pico_init(struct pico_data_t *pico, int 
; 
pico_get_reset(struct pico_data_t *pico); 
pico_put_reset(struct pico_data_t *pico, 
int reset_val); 
pico_interrupt(struct pico_data_t *pico); 
pico_code_read(struct pico_data_t *pico, 
off_t offset,  unsigned *buffer,  
size_t count); 
pico_code_write(struct pico_data_t *pico, 
off_t offset, unsigned *buffer,  
size_t count); 
pico_data_read(struct pico_data_t *pico, 
unsigned *data, unsigned *status); 
pico_data_write(struct pico_data_t *pico, 
unsigned data, unsigned *status); 
fsl_blocked(unsigned status); 
• interrupt – writing a non-zero value triggers the 
Picoblaze’s interrupt line.  A read operation has 
no meaning, and always returns zero. 
• code – a binary virtual file that maps the 
program memory of the Picoblaze node.  This 
file may be read or written, and the seek 
operation is also defined. 
• data – a Linux character device node.  Write 
operations place data into the Picoblaze’s input 
FIFO, and reads extract data from the output 
FIFO.  This device is discussed in greater detail 
below. 
Before expanding on the details of this structure, 
it is illuminating to point out some implications of 
this mapping, using some simple examples. 
4.2.1. Programming a node 
A coprocessor node is programmed simply by 
writing binary opcode data into its /code virtual file: 
$ cat pulse_pwm.hex >  
/proc/picoblaze/pico0/code 
4.2.2. Copying a coprocessor process 
A coprocessor “thread” executing on one node, 
may be duplicated onto another node: 
$ cp /proc/picoblaze/pico0/code  
 /proc/picoblaze/pico1/code 
This example is not true process duplication, 
since no state information is transferred.  However, 
with appropriately coded Picoblaze software, this 
could be achieved: 
$ echo 1 >  
/proc/picoblaze/pico0/interrupt 
$ cat /proc/picoblaze/pico0/data >   
/proc/picoblaze/pico1/data 
In this example, the picoware is coded such that 
when an interrupt is received, it outputs its current 
state information.  Thus, by copying the code space 
and then interrupting the first “thread”, its state 
dump is redirected into the new thread, which can 
then continues where the first left off. 
4.2.3. Cascading two nodes into a sequence 
Two coprocessor nodes may be cascaded 
together: 
$ cat /proc/Picoblaze/pico0/data >  
 /proc/Picoblaze/pico3/data 
This sort of approach would be useful if a 
particular algorithm was sequentially partitioned 
between two nodes.  The intermediate outputs from 
one stage are passed on to the next for completion.
While these examples are presented in simple 
shell script, the operations can just as easily be 
performed from C programs, accessing the virtual 
files just like regular files with open(), read() and 
write() library calls. 
4.3. T
Th
Linux
level 
hardw
and o
presen
gener
are us
data b
Th
of bu
system
reduc
intera
transf
single
node 
writes
If 
direct
begin
buffer
be re
opera
Th
round
as wa
fit –w
transm
W
hardw
hardw
SRL1
togeth
capac
would
asymp
increa
from 
cycle,
poten
5. An
W
outlin
which
most 
5.1. U
Th
charac
imple
time ehe FIFO Device Driver 
e /proc/picoblaze/picoN/data device is a regular 
 character device node, that implements kernel 
IO buffering.  This is in addition to the 
are buffering provided by the 16-deep input 
utput FIFOs on each processor node.  In its 
t incarnation the Picoblaze node is not able to 
ate Microblaze interrupts, thus kernel tasklets 
ed to poll the physical FIFO status and transfer 
etween kernel buffers and the hardware FIFOs.
e reasoning behind the additional kernel layer 
ffering is the same as for more conventional 
 devices such as disk units – buffered IO can 
e the overhead of excessive process-kernel 
ction by allowing larger chunks of data to be 
erred between processes and the kernel at a 
 time.  Maintaining separate buffers for each 
also eases the problem of scheduling reads and 
 across multiple nodes. 
the coprocessor nodes were to be accessed 
ly (the light-weight approach mentioned at the 
ning of this section), rather than the kernel 
ing model, significant custom software would 
quired in order to efficiently schedule IO 
tions to the multiple nodes.    
e naïve approach – polling each device in a 
 robin fashion, each time reading as much data 
s available, and writing as much data as could 
ould be very inefficient, particularly if the data 
ission rates varied across the nodes. 
e can easily trade between software and 
are buffering at compile/synthesis time.  The 
are FIFOs are implemented in the Virtex-II 
6 primitive, and can easily be cascaded 
er to provide arbitrary depths (up to the 
ity of the chip).  Larger hardware buffers 
 provide a measurable, although 
totically limited performance improvement, by 
sing the amount of data that can be transferred 
the kernel to the hardware on each transfer 
 and thus reducing task switching (or, 
tially, interrupt handling) overhead. 
alysis and discussion 
e present an analysis of the architecture, first by 
ing some of its useful characteristics, from 
 we propose two classes of application that can 
benefit.  
seful characteristics 
e following considers some of the 
teristics of the architecture and its 
mentation that are useful in the context of real-
mbedded systems. 
5.1.1. Improved predictability 
As mentioned previously, one of the greatest 
challenges facing real-time system designers is the 
ability to guarantee computational deadlines.  For 
conventional microprocessor systems, this 
commonly requires significant over-engineering of 
the central processor, to cover a rare conjunction of 
events (the worst case scenario). 
The implementation described in this paper uses 
Picoblaze coprocessors clocked at 66MHz, executing 
a consistent 33 million instructions per second, 
irrespective of what the central processor or other 
coprocessor nodes are doing. 
Migrating real-time tasks onto these coprocessor 
nodes has two positive side-effects: 
• Task execution predictability is improved 
• Central CPU load is reduced 
Communication between central and coprocessor 
nodes must still be considered, and ideally should be 
minimised to take maximum advantage of the 
architecture. 
5.1.2. Logic reuse 
Once deployed, an embedded system may require 
only very infrequent use of some external 
communication peripheral, such as a serial interface, 
for in-the-field testing or maintenance.  In cost and 
power sensitive applications it is wasteful to dedicate 
logic area to a peripheral device that may only be 
used with a duty cycle of perhaps 1% or potentially 
much less. 
FPGA power consumption is strongly influenced 
by static leakage currents, which are present even if a 
particular logic module is not actively switching.  
The only way to achieve significant power reduction 
is to fit the application into a smaller device. 
The coprocessor node(s) of rarely used IO 
functions implemented as picoware can be executing 
other useful tasks when not required for the 
infrequent IO operations.  This is a form of run-time 
logic reuse – the logic cost of the coprocessor is 
amortised over the entire application execution time. 
Dynamic and/or partial reconfiguration is 
commonly argued as a means of achieving run-time 
logic reuse.  While this is true in principle, in 
practice the technique is inadequately supported by 
design tools and FPGA devices.  Significant extra 
design effort is introduced to meet the floorplanning 
and inter-module signal routing requirements of 
partial-reconfiguration support.  The network of 
programmable coprocessors may not have the same 
performance or flexibility as a customised hardware, 
however it is dramatically simpler to use. 
Depending upon the degree of dynamic 
reconfiguration, the functional switching time of the 
reprogrammable architecture may also be faster.  It is 
certainly of finer granularity, with individual 
opcodes in a coprocessor program able to be 
modified. 
5.2. R
Cle
dedic
freque
netwo
parall
impro
centra
W
first i
secon
We 
constr
appro
5.2.1.
Th
Picob
exces
range
permi
impos
M
standa
USB 
some 
has a 
core [
Lo
outpu
node.
16KH
PWM
low q
In
embed
units 
serial
minia
tempe
sensin
interfa
copro
5.2.2.
In 
restric
If Mi
the 
comp
gained
the p
candid
encry
perfor
low 
effectoles for the architecture 
arly, an 8-bit Picoblaze will not outperform a 
ated 32-bit Microblaze at the same clock 
ncy.  Meaningful roles for the coprocessor 
rk architecture will leverage the available 
elism, and the subsequent predictability 
vements resulting from decreased load on the 
l processor. 
e propose two roles for the architecture.  The 
s for intelligent virtual IO devices, and the 
d is for parallel computational coprocessors.  
discuss each below, and consider specific 
aints that influence the feasibility of such an 
ach. 
Intelligent peripherals 
e consistent instruction throughput of the 
laze and its relatively high performance (in 
s of 50 8-bit MIPS) makes it suitable for a 
 of IO tasks.  The fixed instruction throughput 
ts software timing loops that would be either 
sible or wasteful on the central processor. 
edium bitrate communications – Serial IO 
rds like RS2323, SPI, I2C, and even low speed 
can be implemented in a coprocessor node.  In 
instances, for example SPI, the Picoblaze node 
smaller logic footprint than an equivalent logic 
13], resulting in an overall net logic reduction. 
w bandwidth IO controllers –an 8-bit PWM 
t controller was implemented on a Picoblaze 
  This virtual device has an output bandwidth of 
z at 66MHz (2056 Picoblaze instructions per 
 cycle) – adequate for servo motor control or 
uality audio. 
telligent IO controllers – Many legacy 
ded systems and devices, as well as modern 
like GPS modules use low bit-rate RS232 for 
 packet-based communications.  Some 
ture integrated sensor devices like current and 
rature sensors use PWM and/or I2C to initiate 
g and communicate sensor values.  All of these 
cing functions can be offloaded to a 
cessor node. 
Coprocessing 
the role as a parallel coprocessor for, the most 
tive constraint is communications bandwidth.  
croblaze takes longer to communicate data to 
coprocessor than to perform the actual 
utations, the no performance improvement is 
.  However it may still make sense to offload 
rocessing, if predictability is improved.  A 
ate for this role is low rate, high complexity 
ption.  Picoblaze has been demonstrated 
ming AES encryption [14], and although at 
speeds, that encryption would come at 
ively zero cost to the central processor. 
Micro
Pico3
Pico2
Pico1
time5.3. Theoretical analysis 
In this section we present a simple mathematical 
model and analysis of the architecture.  Certain 
reasonable assumptions are necessary – the intention 
is establish identities and inequalities that describe 
the range of useful applications. 
We assume that computational tasks are 
discretised such that each task involves 
communication of, and operations on n individual 
data items, and that there are N such tasks to be 
computed.  Let  
Tc(n) = To + nTd
be the time of communicating n data items between 
the central processor and a single node, To is the 
fixed overhead (e.g. system call entry), and Td is the 
time to transmit one data item. 
Next, let TPB(n) be the time for a single Picoblaze 
node to perform a computation on those n data items, 
and let TMB(n) be the time for a Microblaze to 
perform the same computation, also on n data items. 
For simplicity we assume that Microblaze and 
Picoblaze implementations use algorithms with the 
same underlying complexity (f(n)).  Then, let  
K = TPB(n)/TMB(n) (1) 
be the sequential speedup of the Microblaze over 
Picoblaze for the specific algorithm (e.g. due to word 
length differences and other processor architecture
benefits).
Finally, let P be the number of coprocessors 
available, and we assume that this is equal to N, the 
number of tasks.  That is, we do not consider the 
problem of scheduling N tasks onto P processors if N 
? P. 
Figure 3 illustrates the processing sequence of a 
central Microblaze communicating with N=3 nodes.  
The Microblaze spends time Tc(n) communicating 
data to each node, each of which takes TPB(n) to 
complete.  Finally, the communication is repeated to 
return data to the Microblaze.  In this model, the 
nodes are being used as transformational 
coprocessors, producing as much data as they are 
fed.  This is one of the worst-case scenarios for the 
archit
node 
Be
assum
single
Simil
data 
Picob
Fin
comp
writin
N.Tc(n
comm
the c
consid
Fo
Micro
gener
numb
(i.e. a
TS 
Fro
system
TP 
For ea
TP 
From 
• C
 N
• C
 N
• M
W
S, th
proce
achiev
S =
Fo
S ?
Th
archit
the n
seque
over 
bound
Fo
appro
S ?
Ho
(5), an
S <
Th
bound
aroun
work 
Figure 3. Simplified parallel processing 
sequence for theoretical modeling ecture, since it doubles the amount of inter-
communication. 
cause of the FIFO based communication, we 
e that each Picoblaze task can start as soon as a 
 item has been sent, that is after To+Td delay.  
arly, the central processor finishes receiving its 
with a delay To after the corresponding 
laze completes processing. 
ally, we assume that no Picoblaze task 
letes before the Microblaze has completed 
g data to all tasks, or approximately TPB(n) > 
).  This only holds for relatively cheap 
unications and small numbers of nodes P.  In 
urrent system we are limited to 8 nodes, so 
er this to be a reasonable approximation. 
r comparison, the sequential time required were 
blaze to perform all of the computation is 
ously approximated as the product of the 
er of tasks and the time to perform one task 
ssuming zero communication overheads): 
= N,TMB(n) (2) 
m Figure 3, execution time for the parallel 
 is 
= 2(To+Td)+(N-1)(To+nTd)+TPB(n) (3) 
se of analysis we approximate (3) as 
? N(T0+nTd) + TPB(n) (4) 
(4) we identify three different classes of tasks 
ommunication-bound:  
(T0+nTd) >> TPB(n) (5) 
ompute-bound:
(T0+nTd) << TPB(n) (6) 
ixed – neither dominates. 
e consider two metrics – the first is the speedup 
e ratio between the parallel and sequential 
ssing times.  It represents the overall speedup 
ed by a parallel implementation: 
 N.TMB(n)/(N(To+nTd)+TPB(n)) (7) 
r compute-bound systems (6), this simplifies to  
 N.TMB(n)/TPB(n) = N/K (8) 
us, for compute-bound systems, the proposed 
ecture will result in an overall speedup (S>1) if 
umber of coprocessor nodes N exceeds the 
ntial speedup of a Microblaze implementation 
a Picoblaze implementation (K).  Since N is 
ed, the scalability of this speedup is limited. 
r communication bound systems the speedup 
ximates as 
 TPB(n)/K(To+nTd) (9) 
wever, communication-bound implies Eqn. 
d thus we have  
<1  (10) 
is result is unsurprising – a communication 
 system spends too much time moving data 
d, and not enough time actually performing 
on that data. 
The theoretical analysis shows that migrating 
heavily compute-bound functions onto coprocessor 
nodes results in significant offloading of the central 
processor, and confirms that communication-bound 
processes are to be avoided. 
6. Conclusions and Future Development 
We have presented an architecture for parallel 
processing of computational and IO tasks on 
programmable microcontrollers linked to a central 
microprocessor in a star topology.  We have further 
shown how this coprocessor array may be naturally 
and efficiently integrated into the Linux operating 
system executing on the central processor, and that 
such a mapping provides uniform and simple access 
to the coprocessor array. 
Performance predictability – the impact on 
central CPU load is minimised by mapping real-time 
tasks onto coprocessor nodes. 
Flexibility – a node with access to an external IO 
port may be used as a regular computational node 
when not required for IO duties.  Coprocessing tasks 
not dependent on external node connectivity may be 
executed on any available node. 
Simplified logic re-use – the simplified 
programming model offers some logic re-use 
advantages of partial reconfiguration without the 
excessive complexity. 
The proposed architecture can just as easily be 
applied to systems with more powerful coprocessor 
nodes.  Indeed, with increased logic usage the 
architecture could be inverted, placing a simple 
microcontroller at the heart of an array of powerful 
microprocessors. 
Future developments in the work will include 
experiments with different types of coprocessors, 
including custom nodes with greater support for 
numeric processing such as DSP operations, as well 
as investigations to see how the coprocessors 
themselves may be used to accelerate specific 
operating system functionality.  More broadly, we 
are investigating a wide array of single-chip 
multiprocessing architectures, and their mappings 
into the Linux operating system paradigm. 
7. Acknowledgements 
The support of the Australian Research Council is 
gratefully acknowledged.  The authors would also 
like to thank Goran Bilski for advice on the 
implementation of the ideas presented in this paper. 
8. Re
[1] N
"
i
T
2
[2] A
e
[3] X
X
[4] X
S
M
[5] G
p
4
[6] C
S
S
P
C
[7] J
L
r
I
R
L
[8] M
S
I
[9] P
h
s
[10] K
V
A
[11] H
M
P
E
D
W
[12] U
I
f
R
a
[13] X
X
J
[14] M
b
eferences 
. W. Bergmann, G. Brebner, and J. P. Gray, 
Reconfigurable Computing and Reactive Systems," 
n Proc. Australasian Workshop on Parallel and Real-
ime Systems (PART '00), Newcastle, Australia, 
000. 
. Rubini and J. Corbet, Linux Device Drivers, 2nd 
d: O'Reilly and Associates, 2001. 
ilinx, "Microblaze Processor Reference Guide," 
ilinx, Inc, 2003, pp. 136. 
ilinx, "Fast Simplex Link Channel," Xilinx, Inc., 
an Jose, CA, Product Specification DS449, 22nd 
ar. 2004. 
. Kahn, "The semantics of a simple language for 
arallel programming," in Proc. IPIF '74, pp. 471-
75, Amsterdam, 1974. 
. Ross and W. Bohm, "Using FIFOs in Hardware-
oftware Co-Design for FPGA Based Embedded 
ystems," in Proc. IEEE Symposium on Field-
rogrammable Custom Computing Machines, Napa, 
A, 2004. 
. A. Williams and N. W. Bergmann, "Embedded 
inux as a platform for dynamically self-
econfiguring systems-on-chip," in Proc. 
nternational Conference on Engineering of 
econfigurable Systems and Algorithms (ERSA '04), 
as Vegas, Nevada, 2004. 
. Barabanov, "A Linux-based Real-Time Operating 
ystem." Sorocco, New Mexico: New Mexico 
nstitute of Mining and Technology, 1997, pp. 43. 
. Mantegazza, "Dissecting RTAI," online at 
ttp://www.aero.polimi.it/~rtai/documentation/article
/paolo-dissecting.html, accessed 7th June 2004. 
. Chapman, "PicoBlaze 8-Bit Microcontroller for 
irtex-II Series Devices," Xilinx, Inc., San Jose, 
pplication Note XAPP627, 4th Feb. 2003. 
. Fu and J. W. Lockwood, "The FPX KCPSM 
odule: An Embedded, Reconfigurable Active 
rocessing Module for the Field Programmable Port 
xtender (FPX)," Washington University, 
epartment of Computer Science, Technical Report 
UCS-01-14, July 2001. 
. Bidarte, A. Astarloa, A. Zuloaga, J. Jimenez, and 
. M. d. Alegria, "Core-Based Reusable Architecture 
or Slave Circuits with Extensive Data Exchange 
equirements," in Proc. Field Programmable Logic 
nd Applications (FPL '03), Lisbon, Portugal, 2003.
ilinx, "OPB Serial Peripheral Interface (SPI)," 
ilinx, Inc., San Jose, CA, Data Sheet DS464, 29th 
uly 2004. 
ediatronix, "Picoblaze code for Rijndael (AES-128) 
lock cipher," online at http://www.mediatronix.com/ 
xamples/Rijndael-2.htm, accessed 7th June 2004. 
