A real-time asymmetric multiprocessor-reconfigurable system-on-chip architecture by Xie, Xin et al.
A Real-time Asymmetric Multiprocessor Reconfigurable System-on-
Chip Architecturei  
Xin Xie∗, John. A. Williams∧，Neil.W. Bergmann# 
School of Information Technology and Electrical Engineering, 
The University of Queensland, Brisbane Q 4072, AUSTRALIA 
ABSTRACT 
We propose an asymmetric multi-processor SoC architecture, featuring a master CPU running uClinux, and multiple 
loosely-coupled slave CPUs running real-time threads assigned by the master CPU.  Real-time SoC architectures often 
demand a compromise between a generic platform for different applications, and application-specific customizations to 
achieve performance requirements. Our proposed architecture offers a generic platform running a conventional 
embedded operating system providing a traditional software-oriented development approach, while multiple slave CPUs 
act as a dedicated independent real-time threads execution unit running in parallel of master CPU to achieve performance 
requirements. In this paper, the architecture is described, including the application / threading development environment.  
The performance of the architecture with several standard benchmark routines is also analysed. 
Keywords:  FPGA, Reconfigurable Logic, System-on-Chip, Asymmetric multiprocessor, Real-Time, Embedded 
Systems 
1. INTRODUCTION 
Modern embedded systems are increasingly required to meet the competing requirements of real-time or near real-time 
performance, while satisfying tight time-to-market and interoperability requirements. Real-time performance is typically 
offered by dedicated hardware and microprocessors running real-time firmware or microkernel, while the rapid 
development and interoperability are more readily provided by conventional operating systems such as embedded Linux. 
One approach to meeting these requirements is to virtualise the operating system (OS) by running it as a low priority 
thread on top of a real-time kernel.  This approach is complicated, error-prone and requires an ongoing maintenance and 
porting effort. 
Another approach to satisfy real-time requirements is to use hardware custom logic in addition to the generic System-on-
Chip (SoC). Designing such custom logic on SoC requires the developers knowing both software and hardware 
perspective of development, debugging and testing, which can cause unacceptable delays in a rapidly changing market. 
Instead of multiplexing multiple software environments onto a single CPU or using hardware custom logic, we propose 
an asymmetric, reconfigurable System-on-Chip multi-processor architecture, implemented on commodity FPGA 
hardware, using multiple 32-bit CPUs, dual-port on-chip memory and FIFO type communication links.  This enables the 
embedded operating system(s) to be used unmodified, with inter-process communication provided by a device driver 
abstraction over the hardware FIFO interconnect.  Benefits of this architecture include (i) low latency solution for real-
time application threads, (ii) minimal software development change in operating system and application porting, and (iii) 
generic architecture that provide good expandability and flexibility.  
We review existing multi-processor SoC related researches and techniques in the Section 2.  Our hardware and 
interconnect architecture is then presented in Section 3, and Section 4 discusses the software/threading model and 
application development scenario.  In Section 5 we present experimental from testing the architecture, and finally in 
Section 6 we present our conclusions and future research plans. 
                                                          
i
 This work is partly supported by the Australian Research Council 
∗
 xxie@itee.uq.edu.au; Phone +61-7-33654307; Fax:+61-7-33654999 
∧
 jwilliams@itee.uq.edu.au; Phone +61-7-33658305; Fax: +61-7-33654999 
#
 n.bergmann@itee.uq.edu.au; Phone +61-7-33651182; Fax:+61-7-33654999 
Microelectronics: Design, Technology, and Packaging II, edited by Alex J. Hariz,
Proc. of SPIE Vol. 6035, 603508, (2006) · 0277-786X/06/$15 · doi: 10.1117/12.638216
Proc. of SPIE Vol. 6035  603508-1
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
2. BACKGROUND 
In our architecture we try to combine the strengths of both dedicated hardware acceleration and software-based real-time 
approach by using asymmetric multiprocessor architecture. The relevant works in the hardware real-time custom logic 
and software approach for real-time purpose are reviewed in this section. Then existing multi-processor architectures on 
SoC are discussed. In the last part of this section, the enabling technologies for asymmetric multi-processor utilized in 
this research are introduced. 
2.1. Real-time system-on-chip architectures 
One technique to achieve real-time performance on FPGA-based SoC is to use a customized processor core i.e. by 
adding reactive Instruction Set Architecture (ISA) and direct signal lines to existing CPU1, however this technique 
requires modification of software compiler to recognize above features, and the bottleneck between CPU speed and 
memory accessing remains a challenge. 
In addition to CPU customization, a completely new processor dedicated for real-time operation can be used2 which 
allows hardware control modules to shift some OS tasks into hardware implementation. This approach reduces the OS’s 
overhead, but can not necessarily accelerate a particular application for the real-time purpose. 
A more common approach in the FPGA community to tackle real-time performance is to design custom logic executing 
the critical section of a specific application while the CPU carries other less-critical sections.. This approach is the basis 
of many research efforts3. However, it requires significant effort in hardware/software co-design and therefore requires 
developers being experienced in both hardware and software designs. Further, using such custom logic can generally 
only target one specific application, so non-recurring development costs will be increased when new applications 
emerge. 
From a software perspective, real-time operating system can be achieved by using microkernel which offer guarantees on 
metrics such as interrupt and context-switch latency. Operating systems like Linux - which is implemented with a 
monolithic kernel - do not offer real-time performance. To make Linux suitable for real-time application, some 
architectures like RTLinux4 are needed, which treat the Linux kernel as a low priority process under another real-time 
microkernel. 
2.2. Multiple processors for system-on-chip 
Most common multiple processor implementations are based on a Symmetric Multi-Processor (SMP) architecture as 
widely used in desktop and server environment for High Performance Computation (HPC). There are SMP 
implementation on SoC5,6, which show the benefits of using SMP in SoC are similar to using it in HPC environment: 
exploiting thread parallelisms and standardizing software execution by creating one virtual CPU from multiple physical 
CPUs. 
An SMP architecture typically requires extensive OS integration and hardware cache management support; both can be 
expensive on SoC’s already limited resources. In real terms the performance improvement in SMP or the similar concept 
Simultaneous Multithreading (SMT) is also marginal due to OS associated overhead and the growing gap between 
processor speeds and memory latency7. 
In contrast to the SMP implementations which targeting primarily to HPC purpose, asymmetric multiple processor 
system proposed in this paper is suitable for real-time SoC system by having different properties: 
• While SMP system’s goal is to maximize the CPU average performance through CPU load balancing, real-time 
system relies on the worst-case performance. Asymmetric multi-processor system can assign dedicated 
computing resources to real-time tasks, hence having better worst-case performance. 
• Unlike SMP system, asymmetric multi-processor system can have distributed, local memories for each CPU, 
thus avoiding memory and bus bandwidth scaling issues. 
• An OS running on asymmetric multi-processor system can be no significant different from the OS running on 
single processor. 
We have previously proposed an asymmetric system with a master CPU and up-to eight co-processors connected with 
FIFOs8. The coprocessors were 8-bit microcontrollers, with limited computational capabilities.  It was found that the 
most suitable application was to use the coprocessors as predictable, real-time IO controllers or sequencers, rather than as 
Proc. of SPIE Vol. 6035  603508-2
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
ejpj pqBf 2
EbGV CP!L
ybbp EIEO
IATTCLQPJJX
rTcJllJrJx
computational offload engines.  However, some of the core ideas and abstractions from that work are brought forward to 
our current proposal. 
2.3. uClinux on MicroBlaze 
The hardware in our proposed architecture is based on primarily vendor-provided hardware IPs including the Xilinx 
MicroBlaze soft-core CPU, hardware FIFO channels and on-chip dual-port memory primitives. The operating system on 
master CPU is uClinux, a commonly used and well known embedded operating system. 
MicroBlaze is a 32bit RISC type soft-core CPU from Xilinx, which can be customized in different perspectives: bus 
interfaces, cache size, hardware multiplier etc.9. It has three different memory bus interfaces, upon these interfaces 
various flexible connection topologies for multiple MicroBlazes can be achieved, depending on the application.  
Graphical or text-based tools are used to specify MicroBlaze systems, including CPU and peripheral parameters and 
system interconnect topology. 
In our architecture, MicroBlaze running with uClinux operating system10 is used as the master system controller. uClinux 
is a Linux operating system variant designed for CPUs without a hardware memory management unit (MMU) – as is the 
case for MicroBlaze. uClinux requires only very few modifications on most existing Linux applications and, therefore, 
provides a large pool of existing applications.  However, uClinux is not a real-time operating system as mentioned in 
previous section. In our approach uClinux is used for interactive based applications and for the general non-real-time I/O. 
The motivation for using a fully-featured, “conventional” OS like uClinux in a SoC platform is quite simple. While 
certain aspects of FPGA-based computing approach are novel with respect to classic software systems, many of the same 
problems and methodologies still apply. It is our contention that instead of throwing out these conventional 
methodologies – and with them the skills of thousands of system designers –it is better to integrate support for this new 
class of computational device into an existing context.  
In the remainder of this paper we describe our proposed asymmetric multi-processor System-on-Chip architecture that 
employs uClinux on the MicroBlaze as a master system controller, with real-time thread distributed across multiple slave 
CPUs, which are also implemented as MicroBlaze cores. 
3. HARDWARE ARCHITECTURE FOR ASYMMETRIC MULTI-PROCESSOR SOC 
We propose an asymmetric multiprocessor architecture to suit real-time SoC system design. This architecture is based on 
Xilinx provided hardware logic core including MicroBlaze, shared system bus and dual-port on-chip memory. Figure 1 
shows it is made up by one master CPU and some number N of slave CPUs. The master CPU is designed as an 
independent subsystem running the uClinux embedded operating system. Each slave is also an independent subsystem 
with its own local memory, intended to run different real-time threads or section of code.  Communication among slave 
 
Figure 1. Asymmetric Multi-Processor SoC Architecture Overview 
Proc. of SPIE Vol. 6035  603508-3
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
CPUs, and between slaves and the master, is achieved through design-time configurable point-to-point connection.  The 
architecture also supports shared memory, via a global bus. These options are discussed in more detail later in the paper. 
Key features of our architecture include: 
• the addition of extra slave CPUs requires no kernel changes in the master CPU operating system; 
• each slave CPU has its own local memory, and reprogramming in slave CPUs is controlled by master CPU 
during run-time; 
• slave CPUs run their threads without interruption and with predictable timing performance, making them 
suitable for real-time tasks; 
• communication between slave CPUs and master CPU is removed from the global shared bus, reducing 
contention and improving predictability; and 
• the connection topology of master CPU and slave CPUs is flexible and versatile, suitable for a wide range of 
applications. 
In the following sections we outline the details of the master and slave subsystems respectively, and then describe the 
integration and interconnection strategies for master and slave CPUs.  
3.1. Master CPU subsystem 
The master CPU in our architecture serves the purposes of  
• a common platform for existing applications, 
• a central processing unit for common I/O peripherals attached to the global shared bus, and 
• control and reprogramming of slave CPUs. 
Similar to the most CPUs’ memory hierarchy, external off-chip memory accessing of MicroBlaze is slow compared with 
on-chip memory. To have good performance it is important to configure the master CPU with instruction and data cache 
units; this is in contrast to the slave CPU’s absence of cache which is discussed in the following section. 
The performance of the master CPU running existing applications on uClinux is not downgraded by additional slave 
CPUs subsystem. If we intend shifting some critical computation-intensive section of existing applications to slave 
CPUs, the load of master CPU will decreased and therefore performance will be actually increased compare with 
standalone master CPU. 
3.2. Slave CPU subsystem 
The primary motivation of having multiple asymmetric slave CPUs is 
• to benefit from parallelism from multiple codes being executed in parallel,  
• to offload the those computation-oriented tasks from master CPU to slave CPUs, and  
• to achieve a dedicated independent environment for each real-time thread execution with predictable running 
performance. 
In conventional FPGA system designs, the performance gain is primarily come from certain algorithms can be running 
faster by hardware logic than by software based approach. However, we can also argue that any hardware logic is 
dedicated to one computational requirements running independently in parallel (or in a streaming state) to other 
computation tasks. 
In a similar way, slave CPUs can achieve the performance improvement by running a dedicated thread in an independent 
environment run-time assigned by the master CPU in runtime. Another benefit of using slave CPUs is that the 
application development is still traditional software-based approach, therefore it reduces the cost of development. 
 
Proc. of SPIE Vol. 6035  603508-4
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
v\TCTJJOL
eIopJI jJJLGqpn
2FVAF
CM]
JTJfLI1CfTOTJ
ouj? D'f
JLTf(
 
Figure 2.  Memory Configuration for the Slave CPU 
 
The software based real-time application bottleneck usually comes from memory accessing latency and the OS overhead 
(interrupts latency, context switching, etc). In our proposed architecture, slave CPUs have their own low latency on-chip 
local memory running one thread exclusivelyii.  
Like many other RISC CPUs, MicroBlaze uses a Harvard architecture.  In contrast to von Neumann machines, Harvard 
CPUs have independent interfaces for accessing instruction and data memory. Typical MicroBlaze systems use a single 
dual port on-chip memory or memory arbitrator for both instruction and data memory accessing.  
In our approach instead of having one dual-port memory for both memory interface of CPU, two dual-port memories are 
used to form separated instruction and data memories.  One port of each memory is connected to the MicroBlaze (either 
instruction or data interface), while the other post is connected to the global system bus (See Figure 2.).  This has the 
benefit that each slave CPU has fast, contention-free access to its instruction and data memory, while that same memory 
is also accessible from the master CPU for the purposes of initialising and programming thread code or data. 
It is our design intention that running multiple slave CPUs should have minimum impact on the global bus.  For this 
reason, inter-processor communications are performed via hardware FIFO links as the primary connection. 
In most circumstances, the slave CPU’s interface to global bus is optional (the dotted line in Figure 2). However, this 
interface is needed if the slave CPU requires: 
• access to any IO devices on the bus, 
• access to the program or data memory of any other slave CPU in the system (for shared memory 
multiprocessing), or 
• the capability to write to its own instruction memory.  The Harvard architecture prevents write transactions on 
the instruction-side local memory bus, so for the slave to write its own instruction memory it must do so via the 
shared bus.  In most systems this functionality is not needed. 
The slave’s instruction and data memories can be sized independently.  To achieve the best use of the limited on-chip 
memory, the text and data sizes for each slave CPU’s threads are analysed to decide appropriate instruction and data 
memories’ size.  
The total number of slave CPUs will depend on the logic resource available on the FPGA. Table 1 shows the resources 
usage for a single slave CPU (with 8KB for each instruction and data on-chip memory) on Xilinx Virtex2-1000 device. 
From this table it can be seen the most consuming resources for the slave CPU is on-chip memory, it is possible to 
achieve more slave CPUs by using less instruction/data memory according to the specific application or using a larger 
FPGA chip.  
 
                                                          
ii
 In principle there is no reason against slave CPUs running a light weight real-time threading kernel, however we do not consider that 
option further in this work. 
Proc. of SPIE Vol. 6035  603508-5
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Selected Device: 2v1000fg456-4  
Number of Slices:              789 out of 5120      15% 
Number of Slice Flip Flops:  596 out of 10240      5% 
Number of 4 input LUTs       1023 out of 10240      9% 
Number of BRAMs:                  8 out of 40        20% 
Table 1. FPGA logic usage of single slave CPU on Virtex2-1000 
 
 FSL Read FSL Write LMB read LMB write 
Clock Cycles 2 2 1 1 
Table 2. Slave buses access cycles11 
 
 Read 
(cache hit) 
Write 
(cache hit) 
Read 
(cache miss)  
Write 
(cache miss)  
Clock Cycles 1 1 10 7 
Table 3. Master CPU external memory access cycles11 
 
The slave CPU does not requires instruction or data cache. Slave CPU’s program code will run from the on-chip memory 
which is already very fast, adding cache (which is based on the same on-chip memory) to slave CPU will not increase the 
performance. Using cache is actually avoided for many real-time embedded systems due to the unpredictability of worst 
case performance; lacking the cache actually make the slave CPU more suitable for real-time application. 
Each slave MicroBlaze subsystem is a stand alone software execution environment and is only loosely coupled to the 
master MicroBlaze subsystem. Slave MicroBlaze subsystem has no direct role in uClinux on Master MicroBlaze, 
therefore threads running on slave has all the independent resources available to optimize the real-time performance. 
Each slave MicroBlaze has its own local memory bus and FIFO link providing predictable access timing as shown in 
Table 2. Compared with the master MicroBlaze’s off-chip memory access time as shown in Table 3, clearly slave CPUs 
have a faster and fixed memory access time. 
3.3. Interconnecting slave CPUs with master CPU 
To integrate the slave CPUs to the master CPU running uClinux, we address the following issues: 
• providing three separated bus data flows for master and slave CPUs 
• communication and synchronization between the master CPU and slave CPUs,  
• slave CPU control mechanism  
• slave CPU re-programmability  
MicroBlaze has various bus interfaces which can be configured: dedicated local on-chip memory bus (LMB), 
unidirectional hardware FIFO (FSL) and global shared bus (OPB). In our proposed architecture, all the above bus is used 
to achieve an effective but flexible integration between master CPU and slave CPU.  
3.3.1. Separated Bus Data Flows in the Master and Slave CPUs 
In our proposed architecture, slave CPUs are loosely coupled to master CPU. To suit for the real-time requirements, there 
are three clearly separated data flows in three mediums designed for the master and slave CPUs as shown in Figure 3.  
The global shared bus (OPB) connects most on-chip peripherals; therefore it has the heavy and unpredictable 
asynchronized data flows. In order to run uClinux which requires a large memory footprint, master MicroBlaze use OPB 
bus to connect to off-chip memory interface and other necessary I/O devices. Applications with no real-time requirement 
are suited running on master CPU.  
Proc. of SPIE Vol. 6035  603508-6
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Cbf
2FVAE
2Pq G!P! B ObB Pfl2
seei
cbr
2VAE
cbr
The local memory bus (LMB) used in slave MicroBlazes is suitable for real-time thread execution. LMB is a dedicated 
bus for connecting MicroBlaze and on-chip memory, which having a single clock cycle memory accessing for both data 
load and store (Table 1). Real-time thread running in slave CPU is using LMB for local memory access, and a shortest 
and predictable memory accessing time can be expected. Except for the case mentioned in Section 3.2, each slave CPU is 
separated from using OPB bus which is already congested. 
Hardware FIFO (FSL) usage is an important feature in our proposed architecture. The application specific point-to-point 
connection network is based on hardware FIFOs to achieve a low latency and predictable communication channel 
between master CPU and slave CPUs which will be discussed in Section 3.3.4.  
3.3.2. Reprogramming and Controlling slave CPUs 
Reprogramming of the slave CPU gives the advantage that different real-time threads can loaded into one slave CPU 
during the runtime. This contrasts the approach that using hardware custom logic which is only capable running one 
fixed algorithm for an application during the runtime. 
The reprogramming is achieved by flushing the slave CPU’s local memory contents externally by master CPU utilising 
the dual-port on-chip memory available on the FPGAs and multiple bus interfaces available to the MicroBlaze. To 
reprogram the slave CPU, master CPU (1) first gives a halt signal to slave CPU, (2) then push the slave CPU program’s 
binary into corresponding slave CPU’s data and instruction memory from the external storage device,  (3) last step is to 
clear the halt signal of the slave CPU. The above three steps is shown in Figure 4. 
 
Slave CPU N's 
instruction memory 
Slave CPU N's 
data memory 
arbitrarily assigned OPB mask number 0xN0000000 0xN000000 
LMB (local) base address 0x00000000 0x00000400 
OPB (global) baseaddress 0xN0000000 0xN0000400 
Table 4. Slave CPU memory mapping scheme 
 
We designed a bus address mapping technique: each dual-port memory’s OPB bus address will be equal the LMB bus 
address masked with an arbitrary mask number assigned to each slave CPU. Table 4 shows such example. The master 
CPU can utilize such memory mapping technique performing the reprogramming of each slave CPU.  
Apart from the data communication network and reconfiguration network (in Section 3.3.3) between the master CPU and 
slave CPUs, there are also control signals for master CPU restart and halt slave CPUs. In our proposed architecture, there 
is a separate network for reset/halt functions. MicroBlaze has only a reset port, in our experiment the slave CPU’s halt 
 
 
 
 
 
 
Figure 3. Data flows in the asymmetric multi-processor 
architecture 
Figure 4. Three steps in reprogramming slave CPU 
Proc. of SPIE Vol. 6035  603508-7
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
J92L
Cbfl
(Lnuuiua
flCI!UflX)
2VAE 2VAE 2VAE
CbC
2LG9W sbbI!cs!ou
9LM9LG
EIEO
2VAE
Cbfl
2VAE 2VAE
(nb o )
J92GL
Cbfl
(Lnuu!ua
flCI!UflX)
9LM9LG
EIEO
function is achieved by apply a constant high signal on the reset. We also designed a custom logic that can control at 
most 32 slave CPUs from master CPU.  
3.3.3. FIFO link and shared memory between master and slave CPUs  
There are two communication methods for master CPU and slave CPUs: dedicated FIFO links and shared memory. Both 
options are supported in our proposed architecture as shown in Figure 1 and the actual choice will depend on the 
application.  
FIFO link is the recommended method for communication between master CPU and slave CPUs. MicroBlaze has inbuilt 
FIFO links bus: Fast Simplex Link (FSL) which is a unidirectional, point-to-point bus interface12. Using FSL links can 
offer the advantage of short and predictable access cycles and a natural synchronization due to FIFO semantics. 
The similarity between this hardware FIFO and the software FIFO implementation of OS’s (“pipe” and “fifo” in UNIX 
term), makes it a natural choice to map the hardware FIFOs into the uClinux environment as a software FIFO-like device 
through the kernel device driver abstraction. To communicate with the threads in the slave CPUs from the master 
uClinux environment, the hardware mapped software FIFO is simply opened and then read or wrote to as required like a 
normal file operation. We have previously discussed the motivation and implementation of the hardware FIFO and OS 
integration in detail13. 
The master CPU has 8 read/write FSL link pairs, one pair is used for making the slave CPU reset/halt control logics (see 
Section 3.3.2). Thus the maximum number of slave CPUs which have direct connection to master CPU is seven.   
However it should be noted the total number of slave CPUs inside the chip can be greater than this, because slave CPUs 
for streaming application does not all necessarily need have direct connection to the master CPU. 
It is also possible to use shared memory as the communication mechanism. The slave CPUs have the read/write access to 
its own data memory through the local memory bus (LMB), while the master CPU can have the read/write access to the 
same memory through the shared OPB bus. However, the shared memory communication between master-to-slave or 
slave-to-slave requires buss access on OPB which can not guarantee predictable access timing.  Therefore, shared 
memory as the communication mechanism for master/slave CPUs should be used cautiously especially for the real-time 
thread execution. 
4. SOFTWARE DEVELOPMENT FOR ASYMMETRIC MULTI-PROCESSOR SOC SYSTEM 
By utilizing asymmetric multiple processor architecture, one of the most important benefits is that application code can 
be mostly reused with minimum changes, since the system uses the existing OS environment, C compiler and toolset. In 
the following sections we will outline some methodologies for this architecture. 
4.1. Applications suitable for asymmetric multiprocessor 
The asymmetrical multi-processor architecture is flexible in hardware configuration depending on the different 
applications, so the loosely coupled master-slave CPU design allows us to form various master-slave CPUs connection 
topologies to suit the application requirement. We proposed three different configuration of connecting master CPU and 
slave CPUs.  
 
 
 
 
Figure 5. Streaming configuration Figure 6. Server-Client configuration 
 
Proc. of SPIE Vol. 6035  603508-8
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
1. For streaming data applications, it is natural to use FSL link which is a hardware implementation of FIFO link. 
Master MicroBlaze can have up to 7 read/write FSL links, each link can be connected to the multiple FSL-
linked slave CPUs as shown in Figure 5.  
2. For the server-client relationship applications which requires direct connection between each slave CPUs and 
Master CPU, maximum 7 slave CPUs can connect to the master CPU using FSL link as shown in Figure 6. 
Master CPU schedule and assign different jobs to the slave CPUs and waiting for the processed data coming 
back.  
3. Because each slave MicroBlaze can also have up to 8 FSL links in it own and optional OPB bus interface for 
connecting various I/O devices and custom logic, it is also possible to using slave CPUs in a fashion that slave 
CPUs has a completely logic disconnection from master CPU as an intelligent I/O agent while is on the same 
chip with the master CPU. 
4.2. Application development environment on slave CPUs 
To increase the system performance, thread level parallelism needs to be explored on this multiprocessor architecture. A 
conventional profiling methodology can be used to identify the critical program section which can benefit from thread 
level parallelism.  Those sections will be executing on the slave CPUs and using slave/master CPU communication or 
synchronization hardware provided by this architecture (see Section 3.3.3).  
While a slave CPU is designed with runtime re-programmability, during the initial designing stage we can still estimate 
the optimal memory allocation for each slave CPU.  As discussed in Section 3.3.4 the chip memory size will be the 
dominant factors for the number of slave CPUs on a single chip, a compromise need to be made between re-
programmability and the total number of the slave CPUs available.  
The slave CPU can be programmed in C. Compared with the C compilation on the master, the linking script is the only 
place needs to be modified for the each slave CPU program compilation. From the MicroBlaze, the data side on-chip 
memory allows read/write access while the instruction side support only read access. The linking script need to know the 
address range and the size of these two memories of the slave CPUs. Each slave MicroBlaze may requires different 
linking script if they use different instruction/data memory size. In our future research plan, linking script can be 
generated from the hardware definition file automatically.  
4.3. FSL link driver in uClinux 
The most important piece software related to asymmetrical multiprocessor architecture is for the communication between 
the master CPU and slave CPUs. From the hardware level, we prefer to use FSL (FIFO) links as discussed in previous 
sections. From the operating system perspective, a device driver serves the purpose of linking the hardware and software 
by providing the abstraction of hardware control inside the kernel.  
The MicroBlaze instructions related to the FSL link are basically “get” and “put” which have blocking and non-blocking 
versions. Blocking version of FSL instructions can possibly lock the whole MicroBlaze without any other measures to 
unlock if the data on FSL is not available, therefore they are avoided in the FSL device driver for uClinux.  
To make sure the data is accurately written or read on the FSL link, in our first version of FSL driver we use polling after 
FSL Read/Write to verify the result is FSL operation. However, during our experiment we realized that the polling on 
FSL read and write not only can waste a lot of CPU cycles in polling but also greatly reduce the FSL driver’s data 
throughput and the predictability. In our later version, we adopted to use the interrupt driven FSL device driver which 
would lead to less CPU load and shorter latency. Using interrupt mechanism for the FSL driver, the uClinux kernel can 
respond to the FSL link’s activities timely without increasing kernel’s load when FSL is inactive in data transmitting.  
5. EXPERIMENTAL RESULTS 
5.1. Testing environment 
To test our experiment this architecture in real life, we decided to use MiBench as the benchmark testing suit. MiBench 
is specifically designed for embedded systems and has a selection of different applications domains; it is also freely 
available to the public14. So far we had tested three programs: bitcount, sha and ADPCM encode/decode. 
Currently we have implemented our proposed architecture with only one slave CPU plus the master CPU by using Xilinx 
Virtex II 2v1000fg456-4 FPGA chip on the Insight V2MB1000 development board. The limited on-chip memory 
Proc. of SPIE Vol. 6035  603508-9
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
resources of this FPGA restrict us to one slave subsystem.  However, with one slave CPU we can still demonstrate the 
characteristics of this system with expected performance. 
5.2. Results 
We aim to test two different aspects of the system – the first is to measure overall speedup achieved by migrating critical 
code tasks to the slave CPU.  The second is to measure the relative improvement obtained in system performance when 
the master CPU is under heavy load, from other processing tasks. 
5.2.1. Speedup 
Making MiBench running on this architecture is relatively easy using method described in Section 4.2: we first use 
profile tools to identify the critical section of the program, then compiling the critical section as an standalone application 
on slave CPU, the last step is use master CPU send and receive the date to and from slave CPU instead of running those 
critical section of code. The above three benchmark applications, however, use different partition strategy on master and 
slave load allocation. The speedup factor for three testing application is shown in Figure 7.  
 
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
bitcount sha ADPCM
S
p
e
e
d
u
p
small data set
large data set
 
Figure 7 MiBench Benchmark Results 
 
In the bitcount program, the application on master CPU generates a random number and sends to the slave CPU for 6 
different bit counting algorithms. Without more slave CPUs, we also let the master CPU doing half of the computation 
work along with slave CPU. This actually limits the speedup factor to the master CPU’s load, the time spend on master 
CPU’s computation become the bottleneck of the whole program. 
In ADPCM benchmark, the application read the input data and compressed it by 4 times when send it back to the output 
file. We use master CPU to streaming the input file data to the slave CPU, while slave CPU processing on these data and 
send it back to the master CPU. The speedup factor become smallest in above benchmarks, because the master CPU also 
need to receive the data from slave CPU after every time send data to the slave CPU, this significantly add the master-to-
slave communication overhead.  
In the sha benchmark, the application is required to read a large amount data from the input file, and then generate a 20 
byte output upon the SHA hash function. When running this benchmark, regular application streaming method is used: 
the master reads part of the input file to a software buffer, then passes the data to the slave and read next portion of the 
input data, in the end the master CPU read back the 20 byte output from the slave CPU.  
To make the streaming at the maximum speed the processing time in master and slave CPU should be equal, the software 
buffer size is an important parameter for the above adjustment. The speedup factor fall below 2 in our implementation is 
because the master CPU also need spend overhead time on sending data to the slave CPU through the FSL link. 
5.2.2. Impact of Master CPU load 
We decided to use sha to test this architecture’s performance under the different load of operating system. In our 
experiment sha is heavily depend on the master CPU writing data to slave CPU through the one FIFO channel, this can 
Proc. of SPIE Vol. 6035  603508-10
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
reduce the complexity in analysis. Dhrystone is common integer benchmark available on most system; it can be viewed 
as a heavy integer computational user process which does not involve system call to the kernel. We use 0, 1 and 2 
Dhrystone running in the background to simulate the light, medium and heavy system load under uClinux. 
Table 5 shows different section timing of sha running on 1 CPU and this asymmetric multiprocessor, Tfile_I/O donates to 
the timing on input file reading, TFSL_I/O is for the timing of FSL kernel driver accessing, Toverhead is timing of adding 
timing routing inside the main program, Tcal is the timing of main sha algorithm routing inside the original benchmark 
program. Ttotal is the timing for the whole application execution time but excluding the programming loading time inside 
the uClinux.  
 
  Light load Medium load Heavy load 
Tfile_I/O  1.80s Tfile_I/O  3.83s Tfile_I/O  5.43s 
TFSL_I/O  4.79s TFSL_I/O  8.63s TFSL_I/O  12.70s 
Toverhead 0.71s Toverhead 2.04s Toverhead 3.41s 
Master and 
Slave 
Ttotal  7.30s Ttotal  14.50s Ttotal  21.54s 
Tfile_I/O  1.80s Tfile_I/O  3.69s Tfile_I/O  5.51s 
Tcal  11.27s Tcal  22.19s Tcal  33.87s 
Toverhead 0.83 Toverhead 1.71s Toverhead 1.98s 
Master only 
Ttotal  13.90s Ttotal  27.59s Ttotal  41.36s 
Speedup  1.9  1.9  1.9 
Table 5. sha timing under different system load 
From Table 5, it can be seen that under the different load of system, the proposed architecture retained a stable speedup 
of around 190%.  The speedup is slightly higher than the results shown in Figure 7.  In this experiment, the file read 
buffer size in the original benchmark program (master-only) was reduced to match that used in the master-slave version.   
We also measured the time spent on the slave CPU’s major sha algorithm routines with same data size but excluding the 
data receiving and sending part which is 6.2s, this time can be compared to the master CPU’s Tcal under the light load 
which is 11.27s. Apart from in the original sha the master CPU running the OS at the same time, this speedup can be also 
contributed by the slave CPU’s faster memory access through the dedicated memory bus which makes it a independent 
execution environment.  
The input file reading speed is consistent between the master-slave and master-only execution under the same load 
environment. This suggest using the interrupt-based FSL kernel driver does not significantly impact the system 
performance when interrupt routine scheduled by the kernel. 
The FSL driver was implemented as a device file under the kernel, which can be used similar to the normal Linux file 
operation. Similar to the input file reading performance under the load, the accessing timing of FSL driver is affected by 
the load the whole system. Under the light load, the execution times on the slave and master are similar which yield the 
best result for the streaming/parallelisation purpose. Under the medium and heavy load, the FSL writing in addition to 
the input file reading became the critical path of while program.  
Underneath the FSL operations in addition to the CPU’s register file transferring to the FSL device, most activities is 
transferring the external memory to the FSL device which MicroBlaze is less efficient at especially running an OS. To 
further improve the performance of this system, a different mechanism of reading and writing to the FSL FIFO from the 
master CPU’s register files and the external memory in this architecture is desirable.  
Proc. of SPIE Vol. 6035  603508-11
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
6.  CONCLUSIONS AND FUTURE RESEARCH 
Using the asymmetric multi-processor system for the embedded SOC on FPGA has following advantages: 
1. The system can be ported to most current embedded OS without kernel rework; hence the development can be 
based on existing development with minimum change to cut the development cost and shorten the development 
time.  
2. Compared with SMP style system, the global bus contention can be reduced by using Point-to-Point (FIFO) 
type communication between master and slave CPUs; therefore it offers the predictable timing which is 
essential for real-time applications. 
3. By using FIFO communication, master and slave CPU have a natural synchronization mechanism in hardware 
which can be mapped into the OS to be used like a software FIFO. 
4. Adding more slave CPUs has minimal effect on the performance of master CPU and other slave CPU 
subsystems, providing a more linear performance increase for a particular application. 
Future research activities of the asymmetric multi-processors will be about reducing the communication overhead 
between master CPU and slave CPU. Another plan is to test more benchmark applications for the measurement of the 
computational and real-time performance improvement and power consumption of this architecture.  
REFERENCES 
1. Salcic, Z. and P. Roop. “Customizing Processor Cores to Supply Reactivity”. in Proc. International Conference 
on Engineering of reconfigurable System and Algorithms (ERSA '04). 2004. Las Vegas, Nevada. 
2. Isaacson, S. and D. Wilde. “The Task-Resource Matrix: Control for Distributed Reconfigurable Multi-Processor 
Hardware RTOS”. in Proc. International Conference on Engineering of reconfigurable System and Algorithms 
(ERSA '04). 2004. Las Vegas, Nevada. 
3. Bergmann, N., P. Waldeck, and J. Williams. “A Catalogue of Hardware Acceleration Techniques for Real-Time 
Reconfigurable System on Chip”. in International Workshop on System-on-Chip for Real-Time Applications. 
Jun 2003. Calgary Canada. 
4. Barabanov, M., “A Linux-based Real-Time Operating System”. 1997, New Mexico Institute of Mining and 
Technology: Sorocco, New Mexico. p. 43. 
5. James-Roxby, P., P. Schumacher, and C. Ross. “A Single Program Multiple Data Parallel Processing Platform 
for FPGA”. in Proc. Field-Programmable custom computing machine (FCCM 04). 2004. Napa, California. 
6. Hung, A., W. Bishop, and A. Kennings. “Symmetric Multiprocessing on Programmable Chips Made Easy”. in 
Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE'05). 2005. Munich, Germany. 
7. Ruan, Y., et al. “Evaluating the impact of simultaneous multithreading on network servers using real hardware”. 
in Joint International Conference on Measurement and Modeling of Computer Systems Proceedings of the 2005 
ACM SIGMETRICS international conference on Measurement and modeling of computer systems. 2005. 
Banff, Alberta, Canada. 
8. Williams, J.A. and N.W. Bergmann, “Programmable Parallel Coprocessor Architectures for Reconfigurable 
System-on-Chip”. IEEE International Conference on Field Programmable Technologies (FPT04), 2004. 
9. Xilinx, MicroBlaze Processor Reference Guide. 2003, Xilinx Inc. p. 136. 
10. http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux/. 
11. Xilinx, MicroBlaze Processor Reference Guide. 2003, Xilinx Inc. p. 45-49. 
12. Xilinx, Fast Simplex Link Channel, in Product Specification DS449. 2004: San Jose, CA,. 
13. Williams, J., N. Bergmann, and X. Xie. “FIFO Communication Models in Operating Systems for 
Reconfigurable Computing”. in 2005 IEEE Symposium on Field Programmable Custom Computing 
Machines(FCCM 05). 17-20 April, 2005. Napa, California, USA. 
14.  Guthaus, Matthew R., J.S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, R.B. Brown “MiBench: A free, 
commercially representative embedded benchmark suite”, IEEE 4th Annual Workshop on Workload 
Characterization, Austin, TX, December 2001 
 
Proc. of SPIE Vol. 6035  603508-12
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/18/2015 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
