Hardware/Software by Theerayod Wiangtong et al.
R
econfigurable hardware has
received increasing attention
in the past decade due to its
adaptable capability and short
design time. Instead of using
field-programmable gate arrays (FPGAs)
simply as application-specific integrated
circuit (ASIC) replacements, designers can
combine reconfigurable hardware with
conventional instruction processors in a
codesign system, providing a flexible and
powerful means of implementing computa-
tionally demanding digital signal processing
(DSP) applications. This type of codesign system is
the focus of this article.
Most traditional codesign implementations are
application specific and do not have a standard method for
implementing tasks. A hardware model is usually very different
from those used in software. These distinctive views of hardware and
software tasks can cause problems in the codesign process. For example, swapping tasks between hardware and software
can result in a totally new structure in the control circuit. In addition, many design tools leave the designers to make their
own decisions on task partitioning and scheduling, although these decisions dramatically affect the system performance
and cost. For example, partitioning in [1] has to be done manually and there is no reconfiguration at run-time.
This article presents a systematic approach to hardware/software codesign targeting data-intensive applications. We focus on 
application processes that can be represented in directed acyclic graphs (DAGs) and use a synchronous dataflow (SDF) model,
the popular form of dataflow employed in DSP systems [2], when running the processes. The codesign system is based on
the UltraSONIC reconfigurable platform, a system designed jointly at Imperial College and the SONY Broadcast Laboratory.
This system is modeled as a loosely coupled structure consisting of a single instruction processor and multiple reconfig-
urable hardware elements. We suggest a new method of constructing and handling system tasks for this real codesign sys-
tem. Both hardware and software tasks are structured in an interchangeable manner without sacrificing the benefit of
concurrency found in conventional hardware implementations. 
IEEE SIGNAL PROCESSING MAGAZINE [14] MAY 2005 1053-5888/05/$20.00©2005IEEE
[Theerayod Wiangtong, Peter Y.K. Cheung, and Wayne Luk]
Hardware/Software  
Codesign
[A systematic approach targeting 
data-intensive applications]
©
D
I
G
I
T
A
L
V
I
S
I
O
NOur design environment involves an automated partitioning
and scheduling algorithm to make a decision on where and
when tasks are implemented and run. The CPS algorithm, col-
lectively named for the three main steps of cluster, partition,
schedule, is proposed to find the minimum processing time
under a specified set of real-world conditions and constraints
(communication time, memory conflict, and bus conflict), which
are not addressed in [3]. (Note that processing time in this con-
text is the time to process all tasks in the system. It has the
same meaning as the maximum completion time, total produc-
tion time, schedule length, or makespan that can be found in
some literature.) Results from the CPS algorithm are used at the
implementation stage, where the application-independent infra-
structure is available to facilitate designers. This predesigned
infrastructure, provided as standard library modules, is responsi-
ble for the control/interface mechanism in the codesign system.
For example, the designed tasks must be encapsulated by the
standard task wrapper to be able to collaborate with the rest of
the system. To control operations such as task executions, run-
time reconfigurations, and data transfers, an automatically gen-
erated task manager program is used. This design approach
reduces design errors and supports system modularity, scalabil-
ity, and manageability for run-time reconfiguration.
THE ULTRASONIC RECONFIGURABLE PLATFORM
Our codesign environment targets UltraSONIC [4], a reconfig-
urable computing system designed to cope with the computa-
tional power and the high data throughput demanded by
real-time video applications. The architecture exploits the spa-
tial and temporal parallelism in video processing algorithms. It
also facilitates design reuse and supports the software plug-in
methodology.
The structure of the board is shown in Figure 1. The system
consists of plug-in processing element (PIPE) modules that per-
form the actual processing. The standard PIPE contains an
XCV1000E FPGA and 4 × 2 MB SRAM. The PCI bus connects
the UltraSONIC board to a host PC. Data transfers between the
UltraSONIC and the host PC are performed over this PCI bus.
(Note that the UltraSONIC main board is a universal PCI card,
meaning that it can operate at 66 or 33 MHz in a 32-or 64-b PCI
[FIG1] The architecture of the UltraSONIC reconfigurable platform.
LBC
(Local
Bus Controller)
XCV300
PCI
PCI Bus 
64-b 66 MHz
PIPE Bus - 64-b Address/Data and 2-b Control
PIPE 
16
PIPE 
3
PIPE 
2
PIPE 
1
Global Pipe Flow Bus
Pipe Flow Chain 32-b + 2-b Ctrl
32-b + 2-b Ctrl
Vertex Device XCV1000E
PIPE Engine (PE)
PIPE Router (PR)
PIPE Memory (PM)
REG
SRAM
SRAM
SRAM
SRAM
Data/Add
Data/Add
PIPE Flow Right 32-b PIPE Flow Left 32-b
Global PIPE Flow Bus (Global) 32-b
PIPE Bus (Global) 64-b
IEEE SIGNAL PROCESSING MAGAZINE [15] MAY 2005slot running at 5 or 3.3
V, controlled by the local
bus controller.) On the
board, there is one global
bus, called the PIPE bus
(PB), and two local
buses, called the PIPE
flow global (PFG) and PIPE flow chain (PFC).
Each PIPE consists of three main parts:
■ PIPE engine: handles computations specified by the user.
PE registers are used to receive parameter values from the
host during computational processes. 
■ PIPE router: responsible for data movement. Routes are
programmable via the internal PR registers. It consumes
around 10% of resources in an XCV1000E device.
■ PIPE memory: provides local buffering of data. Two independ-
ent data/address ports are provided for two equal memory
blocks (4 MB each).
SYSTEM SPECIFICATIONS AND MODELS
The reconfigurable hardware element of our codesign system is
the programmable FPGAs. Run-time reconfiguration is supported
in this reconfigurable plat-
form, which does increase
the complexity of our sys-
tem. Features that we are
concerned with in our
reconfigurable system are
as follows:
■ There are many hardware processing elements (PEs) that
are reconfigurable.
■ Reconfiguration can be performed at run-time.
■ All parameters such as the number of PEs, the number of
gates on each PE, communication time, and configuration
time are taken into account.
■ On each PE where the number of logic gates is limited,
hardware tasks may have to be divided into several temporal
groups that will be reconfigured at run-time.
■ Tasks must be scheduled without conflicts on shared
resources, such as buses or memories.
SYSTEM MODEL
As shown in Figure 1, the target system is a loosely coupled
model by nature, which means that there is no shared memory
used as a medium for transferring
data. There is only local memory on
each PE. To transfer data between
PEs, a communication channel is
directly established between both
ends that must communicate.
The target system can be modeled
as a system consisting of a single
software element and multiple recon-
figurable hardware elements [see
Figure 2(a)]. There are two main
types of buses in this system: global
and local buses. Configurations can
be completed through the global bus.
This global bus is also used as a com-
munication channel between hard-
ware and software. However,
transferring data between the hard-
ware PEs is done through the local
bus. Each PE has its own local mem-
ory for storing input and output data
for all internal tasks. In addition,
there is a well-established mecha-
nism for users to control PEs from
the host processor based on a set of
well-defined application program
interface (API) functions.
Tasks in a system process are
dynamically implemented and exe-
cuted (once step for each task) either
in the CPU or PEs, according to
precedence relationships and priori-
ties. All operations are controlled and [FIG2] The UltraSONIC system. (a) The system model. (b) An example of task implementation.
P
C
I
P
C
I
Software Reconfigurable Hardware
Global Communication Channel
Local Communication Channel
CPU
Mem
Mem Mem Mem
PE0 PE1 PEn
(a)
(b)
Local Communication Channel
Task
Manager
Task
Manager
Global Communication Channel
Local Task
Controller
and Memory
Interface
Mem Mem Mem
Task
0
Task
5
Task
12
Mem
Task
1
Task
2
Task
8
Task
10
Task
8
Reconfig Task
10
Task
7
Task
9
Task
3
Task
4
SW PE0 PE1 PEn
OUR ENVIRONMENT SUPPORTS 
AUTOMATIC PARTITIONING AND SCHEDULING
BETWEEN A HOST PROCESSOR AND A NUMBER 
OF RECONFIGURABLE PROCESSORS. 
IEEE SIGNAL PROCESSING MAGAZINE [16] MAY 2005IEEE SIGNAL PROCESSING MAGAZINE [17] MAY 2005
initiated by a task manager program running in software. There
exists a local controller in each PE to interact with the task
manager and to be responsible for all local operations such as
executing tasks, memory interfacing, and transferring data
between PEs. An example of implementing tasks in the system is
given in Figure 2(b).
TASK MODEL
The tasks that we implement in our system are assumed to con-
form to the following restrictions:
■ Software and hardware tasks are built uniformly to be per-
formed under the same control mechanism. This simplifies
system management and task
swapping.
■ Tasks implemented in each
hardware PE are coarse-grain
tasks, which may consist of one
or more functional tasks
(blocks or loops).
■ Communication between
tasks is always through local
single port memory used as
buffers between tasks.
■ Tasks for a PE may be dynam-
ically swapped in and out using
dynamic reconfiguration.
There are different types of
tasks to be specified in this system: normal, software-only,
hardware-only, and dummy tasks. A normal task is free to be
partitioned and scheduled either in hardware or software
resources. A software-only task is a task that users intentionally
implement in software without exception. Similarly, a hard-
ware-only task is implemented solely in hardware. A dummy
task is either source or sink for inputting and outputting data,
respectively, and is not involved in any computation. In our sys-
tem, we assume that inputs and outputs are initially provided
and written to the microprocessor memory; a dummy task is
therefore a software task by default.
EXECUTION CRITERIA
After tasks have been loaded into a PE and are ready to be
processed, they cannot be interrupted in the middle of the exe-
cution process; in other words, they are nonpreemptive. Task
execution is completed in three consecutive steps: read input
data, process the data, and write the results. This is done repeat-
edly until input data stored in memory is completely processed.
Thus, the communication time between the local memory and
the task (while executing) is considered to be a part of the task
execution time [5]. Also, resource conflicts must be prevented,
which means the same shared resource (such as memory or
bus) is only available for use by one task.
The main restriction of this execution model is that exactly one
task in a given PE is active at any one time. This is a direct conse-
quence of the single-port memory restriction that allows one task
to access the memory at any given time. However, multiple tasks
can run concurrently when they are mapped to different PEs. This
is an improvement over the model proposed by others in [6].
FIRING RULES
We adopt the rules by which data is processed by a task from the
synchronous data flow (SDF) computational model [7]. However,
in this work, our tasks are coarse-grained tasks represented in a
DAG and we use the PE local memory as buffers between nodes,
replacing the FIFO in [7] and [8]. As a result, tasks mapped to the
same PE must be executed sequentially to avoid memory con-
flict. If a task requires a large amount of input data, the data
must be sliced into sufficiently small units for processing to take
place. The task is fired repeatedly as read-execute-write cycles
until all data for the task has been processed. 
An example of our task execution model is shown in Figure 3.
In this example, task A, which has two incoming edges (IE1,
IE2) and two outgoing edges (OE1, OE2), is a task to be fired.
There are 100 tokens from IE1 and 50 tokens from IE2 to be
executed. The number shown inside the parentheses on each
edge is the number of data values needed for each firing itera-
tion, representing the consuming rate or producing rate of a
task. For instance, the consuming rate on IE1 of task A is four,
while the producing rate on OE1 is one.
To process data, task A first reads four tokens from IE1 fol-
lowed by two tokens from IE2. This must be performed sequen-
tially because these inputs are stored in the same single-port local
memory used as buffers on each edge. All of these input tokens
are stored inside the node before being processed. When execu-
tion is completed, one token is written out to OE1, followed by
three tokens to OE2. These steps are performed repeatedly until
all data (shown as the first number on the edge) is processed.
CODESIGN ENVIRONMENT
Figure 4 depicts the codesign environment of the UltraSONIC
system. It is divided into the front-end and the back-end
stage. The front end is responsible for system specifications,
input intermediate format, and system partitioning/schedul-
ing. The back-end involves hardware/software task design and
implementation, design verifications, control mechanisms, and
system debugging.
[FIG3] Example of firing process.
Tokens
Read (IE1) Read (IE2) Execute Write (OE1) Write (OE2)
Repeat 25 Times
A A AA A
IE1
OE1
25/(1)
100/(4) 50/(2)
75/(3)
OE2
IE2At the front end, the design to be implemented is assumed to
be described in a suitable high-level language, which is then
mapped to a DAG at coarse-grained level. Nodes and edges in the
DAG represent tasks and data dependencies, respectively. The
group of algorithms known as the CPS algorithm reads a textual
input file that includes DAG information and parameters for the
clustering, partitioning, and scheduling process. During this
input stage, users can specify the type of tasks as normal, 
software-only, hardware-only, or dummy tasks.
After obtaining the results of the partitioning and scheduling
process, which are the physical and temporal bindings for each
task, we can start the back-end design implementation phase. In
the case of hardware tasks, they may be divided into many temporal
groups that can either be statically mapped to a hardware resource
or dynamically configured during run-time.
We currently assume that software tasks are manually writ-
ten in C/C++, while hardware tasks are designed manually in a
hardware description language (such as Verilog in this work)
using a library-based approach. Once all the hardware tasks for a
given PE are available, they are wrapped in a predesigned cir-
cuit, called xPEtask, which is application independent.
Commercially available synthesis and place-and-route tools are
then used to produce the final configuration files for each hard-
ware element. Each task in this implementation method
requires some hardware overhead to implement the task frame
wrapper circuit. Therefore, our system favors partitioning algo-
rithms that generate coarse-grained tasks.
The results from the partitioning and scheduling process,
the memory allocator, the task control protocol, the API func-
tions, and the configuration files of hardware tasks are used to
automatically generate the codes for the task manager pro-
gram that controls all operations in this system (such as
dynamic configuration, task execution, and data transfer). The
resulting task manager is inherently multithreaded to ensure
that tasks can run concurrently where possible.
In the following section, the important parts of the codesign
environment, including the CPS algorithm, the task manager
program, and the infrastructure, are described.
THE CPS ALGORITHM
The CPS algorithm in the
front-end stage plays an
important role in this
codesign system. It helps
designers in task cluster-
ing, partitioning, and
scheduling, which are
known as intractable prob-
lems [9]. The CPS method
is a combination of three
heuristic algorithms: the
two-phase clustering, the
tabu search, and the list
scheduling. This combina-
tion is designed to obtain a
good result in a reasonable
time frame.
The two-phase cluster-
ing algorithm [10] is used
as a preprocessing step to
modify the granularity of
tasks and enable more task
parallelism. On average, it
has been shown to achieve
15% shorter processing
time for different task
granularities. A new,
smaller DAG with coarser-
grained tasks is then parti-
tioned and scheduled in
order to obtain the mini-
mum processing time.
The heuristic algo-
rithm, based on tabu [FIG4] The proposed codesign environment.
Design Specification in
High-Level Language
Parameters
for CPS
Displaying
Graphs
DAG
DAG
Info
Task Model,
Comm Model,
Target System
Model
Partitioning
(Tabu Search)
Scheduling
(List Scheduling)
Clustering
(Two-Phase)
Displaying
Graphs
Mapping and
Scheduling Info
Mapping and
Scheduling and
Final DAG Info
Mapping and
Communication
Info
Temporal HW Task Groups
SW Code
(Multithread
in C/C++)
SW Tasks
Target API,
Protocol
Memory
Allocation
Task Manager
Program
Generator
(C/C++)
Configuration Files of
Each Temporal HW
Group (.ucd)
VHDL
Verilog
Infrastructure
xPEtask
xPEcontrol
xPEregister
Commercial FPGA
Design Tools
(Xilinx)
SW
Debugger
HW
Debugger
Target
PCI
B
a
c
k
 
E
n
d
F
r
o
n
t
 
E
n
d
New DAG CPS
IEEE SIGNAL PROCESSING MAGAZINE [18] MAY 2005search, is used to parti-
tion tasks into software
and hardware [11]. It has
been modified for this
real system, which has a
search space of KN
(where K is the number
of processing elements
and N is the number of tasks). This exponentially increasing
search space makes an exhaustive search impractical. Although
heuristic search could yield a near-optimal solution, convergence
speed could be greatly improved by using a good initial guess. 
The list scheduler is used to order tasks, without any shared
resource conflicts, with regard to partitioning results, task
precedence, and the target system model. Our scheduling
process has a tight relationship with the partitioning process. It
is used as a cost function to examine the processing time of the
guessed solutions from the partitioner. After the processing time
information is obtained, it is sent back to guide the partitioner
to explore only promising regions. This iteration process
between partitioning and scheduling to minimize processing
time will terminate when the stop condition (such as the num-
ber of iterations specified by the designer) is met. 
THE TASK MANAGER
Two main control methods in a distributed architecture are cen-
tralized control and distributed control [12]. In this work, we
choose to implement the former due to its simplicity and good
matching to the PC host processor in our system. The centralized
control task manager program is used to orchestrate the sequenc-
ing of all hardware and software tasks, the transfer of data and the
synchronization between them, and the dynamic reconfiguration
of FPGAs in PEs when required. Because this program runs on the
software processor (a
host PC), the service
time of which is uncer-
tain and depends on sev-
eral unpredictable
factors, a time-triggered
method is not suitable
for real-time control
actions. To properly synchronize the execution of the tasks and the
communication between tasks, our task manager employs an
event-triggered protocol when running an application. However,
unlike a reactive codesign system [13], we do not regard external
real-time events as triggers. Instead, we use the termination of
each task execution or data transfer as event triggers, and the sig-
naling of such events is done through dedicated registers.
With a single CPU model, the software processor must run
simultaneously the task manager and the software tasks. A mul-
tithreaded programming technique is then employed to run
these two types of processes concurrently. A mutex (short for
mutual exclusion) is a way of communicating among threads
that are executing asynchronously.
EXAMPLE
Figure 5(a) shows an example of a DAG. After the partitioning
and scheduling process, operation sequences to run the DAG
can be obtained as shown in Figure 5(b). This information is
used to automatically generate the task manager program based
on the message-based, event-triggered protocol as described ear-
lier. In this example, the task manager first sends a message to
execute task A in the processor (SW). Consequently, the config-
uration process runs to load the temporal group of tasks B and
C into PE1. The task manager waits until task A is finished
before initiating data transfer between SW and PE1, preparing
[FIG5] (a) DAG example, (b) operation sequences of the task manager, and (c) the task manager control view.
Src
SW
PE0
PE1
Sink Sink Sink Sink
A
B C
DE
Temporal
Group 1
Temporal
Group 2
(a)
SRC
SRC
SRC
SRC
SRC
DST
DST
TRF
TRF
TRF
TM
CFG
CFG
Task
E
Task
D
Task
B,C
SW PE0 PE1
TRF
EXE A
EXE B
EXE C
EXE E
EXE D
TRF DST
A
B
C
D
E CFG
(b)
Global
Variable
for SW
Tasks
Task
A
Message
Board Message Board
Message Board
Task Manager
Check Operation
Messages on Each
PE and Decide
When to
+ run HW Tasks
+ run SW Tasks in
  Background
  (multithread)
+ Reconfig HW
  Tasks
+ Transfer Data
  Between PEs
+ Transfer Data
  Between
  SW and PE.
Registers
Start(TaskID=E)
Finish Transfer
Finish(TaskID=)
Data Transfer
Registers
Start(TaskID=C)
Finish Transfer
Finish(TaskID=B)
Data Transfer
Task
E
Task
B
Task
C
Task
Controller
Task
Controller
SW PE1
PE0
(c)
N.B. TM = Task Manager
CFG = Configuration Period
TRF = Inter-PE Data Transfer
DST
DST
OUR DESIGN ENVIRONMENT INVOLVES AN 
AUTOMATED PARTITIONING AND SCHEDULING
ALGORITHM TO MAKE A DECISION ON WHERE 
AND WHEN TASKS ARE IMPLEMENTED AND RUN. 
IEEE SIGNAL PROCESSING MAGAZINE [19] MAY 2005for task B to be executed
next. This example also
shows that there are two
hardware temporal
groups for PE1 that will
be reconfigured at run-
time.
Figure 5(c) shows the
conceptual control view of the task manager and its operations.
The task manager communicates with a local task controller on
each PE in order to assert control. A message board is used in
each PE to receive commands from the task manager or to flag
finishing status to the task manager. As can be seen, a message
indicating execution completion from task B is posted to a specif-
ic register inside PE1. The task manager program polls this reg-
ister, finds the message, and then proceeds to the next scheduled
task (in this case, task C). Using this method, tasks on each PE
run independently because the program operates asynchronously
at the system level.
LIMITATIONS
At present, the task manager program, which is based on 
an asynchronous control protocol and runs in software, is not
applicable to process
streaming data that
usually employs a
pipelined structure. We
cannot initiate several
messages to run tasks in
different pipeline stages
simultaneously from
the manager program, which is a sequential software code.
However, one possibility to extend this work is to use special
hardware nodes that can handle real-time pipelining to control
operations, rather than the task manager program [14].
DESIGN INFRASTRUCTURE: THE TASK WRAPPER
Both hardware and software tasks are designed manually based
on the results of partitioning. However, we provide an infrastruc-
ture to assist designers. A task core will reside in a standard pre-
designed structure called the task wrapper, which is responsible
for cooperating with the task manager and local controllers. The
software task wrapper is an automatically generated code that
partly resides in the task manager program. The hardware task
wrapper is a predesigned HDL code with generic parameters for
different tasks with different consumption and production rates. 
[FIG6] The hardware design structure in each PIPE.
Task A
xPE Task xPE Task
xPE Task xPE Task
xPE Task xPE Task
xPE Control xPE Control
xPE Register xPE Register
Task Req Task Ack
Controller
Memory
Interface
Task B
Task C
P
F
 
I
N
P
F
 
m
e
m
 
I
N
P
F
 
m
e
m
 
O
U
T
P
F
 
 
O
U
T
D
a
t
a
 
O
U
T
D
a
t
a
 
I
N
PIPE Bus Port A Port B
Mem Control
PIPE Router
Bus Control and
Routing
PF Left (PFC) PFG PF Right (PFC)
Page 1
Page 2
Page 3
Pagen
Local
Memory
2×4 MB
PIPE
Memory
Address
Data IN/OUT
Start X Reg
Finish X Reg
Inter PIPE Trf Reg
Inter PIPE Trf Cmd
Task Data in Reg 0
Task Data in Reg 1
Task Data in Reg 2
Task Data in Reg 3
Task Data Out Reg 0
Task Data Out Reg 1
Task Data Out Reg 2
Task Data Out Reg 3
Task Data Ctrl Reg 0
Task Data Ctrl Reg 1
Hardware Task
Hardware Overhead
(The Task Wrapper) Task Core
Circuit
Circuit
Control
(a) Use Shift-Register
(b) Use RAM Block
Control
Addr Cnt
R
A
M
 
B
l
o
c
k
R
A
M
 
B
l
o
c
k
WE
Addr WE
Addr
I
/
P
 
S
h
i
f
t
 
R
e
g
I
/
P
 
B
u
f
f
e
r
O
/
P
 
B
u
f
f
e
r
O
/
P
 
S
h
i
f
t
 
R
e
g
PIPE Engine
IEEE SIGNAL PROCESSING MAGAZINE [20] MAY 2005
TO CONTROL OPERATIONS SUCH AS TASK 
EXECUTIONS, RUN-TIME RECONFIGURATIONS, 
AND DATA TRANSFERS, AN AUTOMATICALLY 
GENERATED TASK MANAGER PROGRAM IS USED.By conforming to a set of design
rules, designers can concentrate on con-
structing the task cores without having
to worry about the interfacing or control-
ling mechanisms. This infrastructure
facilitates fast design cycles and reduces
error-prone aspects of the design process.
IMPLEMENTATION AND RESULTS
This section first provides implementa-
tion details of the proposed codesign
framework for the UltraSONIC system.
We then describe a case study for our
approach based on JPEG compression
and present some experimental results.
Figure 6 shows how the predesigned
infrastructure for hardware tasks is
implemented in the UltraSONIC PIPE.
The infrastructure consists of three
application-independent modules:
xPEcontrol, xPEregister, and xPEtask.
Based on the information on
xPEregister, xPEcontrol can control
operations of all hardware tasks resident
in xPEtask wrappers. The total hardware
overhead of this infrastructure is mod-
est. It consumes around 10% of the
XCV1000E FPGA on each PIPE.
CASE STUDY: JPEG COMPRESSION
JPEG compression has been used as a real codesign application
[15]–[16] for grey scale images, using a block size of 4 × 4 pix-
els. In this work, however, the standard JPEG compression [17]
with block size of 8 × 8 pixels for color images is implemented
using the DCT baseline method with sequential encoding.
The DAG of the JPEG compression algorithm is illustrated in
Figure 7. Tasks such as Read.BMP, RGB2YCbCr, Encoder, and
Write.JPG are implemented in software for convenience. The
other modules (Level Shifters, 2D-DCTs, and Quantizers) can be
implemented in either software or hardware. For hardware, the
designed tasks are wrapped by the xPEtask wrapper with the
RAM structure. Note that we employ a one-dimensional fast
DCT architecture for eight pixels [18]. To deal with two-dimen-
sional data blocks with size of 8 × 8, the DCT block is applied
first in the horizontal direction, then the vertical.
EXPERIMENTAL RESULTS
For the software-only solution, C++ codes of every task module
are written and run under the control of the task manager pro-
gram. Table 1 shows execution times of the software-only solu-
tion for different sizes of pictures. The values are averaged from
20 runs on the UltraSONIC host, a PC with a Pentium II proces-
sor running at 450 MHz.
When using hardware to alleviate computational jobs in this
JPEG algorithm, we can substantially reduce the processing time
by around 83% on average for given image sizes (see Table 2). To
accomplish this, the tasks, including Level Shifter, 2D-DCT, and
[FIG7] (a) DAG of JPEG compression algorithm implemented in this work. (b) The results
after clustering and partitioning.
Read.BMP
RGB2YCbCr
0
1
234 64
567
64
64
Y Cb
Level
Shifter
2D-DCT
8×8
2D-DCT
8×8
89 1 0
11
12
(a) (b)
2D-DCT
8×8
Level
Shifter
Level
Shifter
Cr
64
Quantize Quantize
Zigzag and
Huffman
Encode
Write.JPG
Quantize Quantization
Table
Huffman
Table
PIPE0
PIPE1
SW
SW
0,1
2,5,8 3,6,9 4,7,10
11,12
S
W
-
O
n
l
y
 
T
a
s
k
H
W
 
o
r
 
S
W
 
T
a
s
k
S
W
-
O
n
l
y
 
T
a
s
k
IEEE SIGNAL PROCESSING MAGAZINE [21] MAY 2005
[TABLE 1] EXECUTION TIMES OF TASKS RUNNING IN SOFTWARE.
EXECUTION TIME (ms)
IMAGES BMP READ RGB2 LEVEL 2D- WRITE JPEG
(PIXELS) SIZE .BMP YCbCr SHIFTER* DCT* QT* ENCODER JPEG SIZE
400 × 300 352 KB 35.5 130.2 12.7 834.9 67.4 193.9 4.0 11 KB
640 × 480 901 KB 90.1 360.5 32.1 2193.4 192.0 481.3 8.0 18 KB
800 × 600 1,407 KB 151.3 572.8 50.1 3437.3 299.1 835.1 8.0 55 KB
800 × 800 1,876 KB 186.1 758.0 63.7 4690.5 325.7 1121.7 11.0 63 KB 
1,024 × 768 2,305 KB 257.3 862.1 86.8 5894.4 398.0 1402.0 15.0 70 KB
PERCENTAGE – 1.14% 4.24% 1.18% 80.90% 6.08% 6.38% 0.07% –
*The average time to process one color component.Quantizer (which consume about 88% of processing time if all
tasks are run in software), are all moved into hardware. This
move results in substantial processing time reduction. The cor-
rect results of every image size are produced.
We also accomplish the implementation of the 8- and 16-
point FFT algorithms, which contain 24 and 52 task nodes,
respectively. The radix-2 butterfly node is represented as a task
in the implementation. Unfortunately, the details cannot be
exhibited in this article due to space constraints.
SUMMARY
This article introduces and demonstrates a task-based 
hardware/software codesign environment specialized for real-
time video applications. Both the automated partitioning and
scheduling environment (the predesigned infrastructure in the
form of wrappers) and the task manager program help to pro-
vide a fast and robust route for supporting demanding applica-
tions in our codesign system. The UltraSONIC reconfigurable
computer, which has been used for implementing many indus-
trial-grade applications at SONY and Imperial College, allows
us to develop a realistic system model. Many simplifying
assumptions found in previous research, such as zero com-
munication overhead and no possible resource conflicts,
become unnecessary. 
Current and future work includes improvement of our task
execution model, which uses local memory as a shared buffer for
hardware tasks on each PE. This limits the possible degree of con-
currency within a PE. The task manager should also be improved
for better concurrency. Furthermore, we plan to extend our ideas
here to cover the SONIC-on-a-chip system [19], which would
require improving our system model at different levels. 
AUTHORS
Theerayod Wiangtong received the B.Eng. degree in electron-
ic engineering from KMITL (King Mongkut’s Institute of
Technology Ladkrabang) in 1993, the M.Sc. degree in satellite
communication from the University of Surrey in 1996, and
the Ph.D. degree in digital system design (codesign) from
Imperial College, London, in 2004. He is now working as a
lecturer in the electronic department at Mahanakorn
University of Technology, Thailand. His research areas include
digital VLSI design, optimization techniques, hardware/soft-
ware codesign, reconfigurable computing, and FPGA imple-
mentations for various applications.
Peter Y.K. Cheung is the deputy head of the Electrical and
Electronic Engineering Department at Imperial College,
University of London, where he is professor of digital systems.
His research interests include VLSI architectures for DSP and
video processing, reconfigurable computing, embedded systems,
and high-level synthesis and optimization of digital systems,
particularly those containing field-programmable logic. 
Wayne Luk is a professor of computer engineering in the
Department of Computing, Imperial College, London, and leads
the Custom Computing Group. His research interests include theo-
ry and practice of customizing hardware and software for specific
application domains, such as graphics and image processing,
multimedia, and communications. Much of his current work
involves high-level compilation techniques and tools for parallel
computers and embedded systems, particularly those containing
reconfigurable devices such as field-programmable gate arrays. 
REFERENCES
[1] A. Kalavade and E.A. Lee, “A hardware/software codesign methodology for DSP
applications,” IEEE Des. Test Comput., vol. 10, no. 3, pp. 16–28, Sept. 1993.
[2] S.S. Bhattacharyya, “Hardware/software co-synthesis of DSP systems,” in
Programmable Digital Signal Processors: Architecture, Programming, and
Applications, Y.H. Hu, Ed. New York: Marcel Dekker, 2001, pp. 333–378.
[3] H. Oh and S. Ha, “Hardware-software cosynthesis of multi-mode multi-task
embedded systems with real-time constraints,” in Proc. Int. Workshop
Hardware/Software Codesign, May 2002, pp. 133–138.
[4] S.D. Haynes, H.G. Epsom, R.J. Cooper, and P.L. McAlpine, “UltraSONIC: A
reconfigurable architecture for video image processing,” in Proc. Field-
Programmable Logic and Applications, 2002, pp. 482–491. 
[5] T. Pop, P. Eles, and Z. Peng, “Holistic scheduling and analysis of mixed
time/event triggered distributed embedded systems,” in Proc. Hardware/Software
Codesign, 2002, pp. 187–192.
[6] J. Hou and W. Wolf, “Process partitioning for distributed embedded systems,”
in Proc. Hardware/Software CoDesign, 1996, pp. 70–76. 
[7] E.A. Lee and D.G. Messerschmitt., “Synchronous dataflow,” Proc. IEEE, vol. 75,
no. 9, pp. 1235–1245, Sept. 1987.
[8] E.A. Lee and T.M. Parks, “Dataflow process networks,” Proc. IEEE, vol. 83, no.
5, pp. 773–801, May 1995. 
[9] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the
Theory of NP-Completeness. San Francisco, CA: Freeman, 1979.
[10] T. Wiangtong, P.Y.K. Cheung, and W. Luk, “Cluster-driven hardware/software
partitioning and scheduling approach for a reconfigurable computer system,” in
Proc. Field-Programmable Logic and Applications, 2003, pp. 1071–1074.
[11] T. Wiangtong, P.Y.K. Cheung, and W. Luk, “Comparing three heuristic search
methods for functional partitioning in HW-SW codesign,” in Proc. Int. J. Design
Automat. Embedded Syst., vol. 6, pp. 425–449, 2002.
[12] P. Chou, K. Hines, K. Partridge, and G. Borriello, “Control generation for
embedded systems based on composition of modal processes,” in Proc. Computer-
Aided Design, 1998, pp. 46–53.
[13] G.D. Micheli, “Computer-aided hardware-software codesign,” IEEE Micro, 
vol. 14, no. 4, pp. 10–16, Aug. 1994.
[14] M. Eisenring and M. Platzner, “An implementation framework for run-time
reconfigurable systems,” in Proc. Workshop Engineering of Reconfigurable
Hardware/Software Objects, ENREGLE, 2000, pp. 151–157.
[15] K.S. Chatha and R. Vemuri, “Hardware-software partitioning and pipelined
scheduling of transformative applications,” IEEE Trans. VLSI Syst., vol. 10, no. 3
pp. 193–208, June 2002.
[16] N. Narasimhan, V. Srinivasan, M. Vootukuru, J. Walrath, S. Govindarajan, and
R. Vemuri, “Rapid prototyping of reconfigurable coprocessors,” in Proc. ASAP,
Application Specific Systems, Architectures and Processors, 1996, pp. 303–312.
[17] G.K. Wallace, “The JPEG still picture compression standard,” IEEE Trans.
Consumer Electron., vol. 38, no. 1, pp. 18–34, 1992.
[18] B.G. Sherlock and D.M. Monro, “Fast discrete cosine transform,” ACM Trans.
Math. Softw., vol. 21, no. 1, pp. 372–378, Dec. 1995.
[19] P. Sedcole, P.Y.K. Cheung, G.A. Constantinides, and W. Luk, “A reconfigurable
platform for real-time embedded video image processing,” in Proc. Field-
Programmable Logic and Applications. 2003, pp. 606–615.
IEEE SIGNAL PROCESSING MAGAZINE [22] MAY 2005
[SP]
[TABLE 2] COMPARISONS OF PROCESSING TIME
OF JPEG COMPRESSION.
IMAGES SOFTWARE HARDWARE/SOFTWARE
(PIXELS) SOLUTION SOLUTION*
400 × 300 3.15 s 0.57 s
640 × 480 8.23 s 1.34 s
800 × 600 12.96 s 2.04 s
800 × 800 17.38 s 2.71 s
1,024 × 768 21.71 s 3.26 s
*Using only two PIPEs available in the system.