Efficient Exploration of Bus-Based System-on-Chip Architectures by Kim, Sungchan & Ha, Soonhoi
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006 681
Efficient Exploration of Bus-Based
System-on-Chip Architectures
Sungchan Kim and Soonhoi Ha, Member, IEEE
Abstract—Separation between computation and communication
in system design allows system designers to explore the communi-
cation architecture independently after component selection and
mapping decision is made. In this paper, we present an iterative
two-step exploration methodology for bus-based on-chip commu-
nication architecture for multitask applications. We assume that
the memory traces from the processing components are given.
The proposed methodology uses a static performance estimation
technique extended for multitask applications to reduce the design
space quickly and drastically and applies a trace-driven simulation
to the reduced set of design candidates for accurate performance
estimation. For the case that local memory traffics as well as
shared memory traffics are involved in bus contention, memory
allocation is considered as an important axis of the design space
in our technique. Experimental results show that the proposed
methodology achieves significant performance gain by optimizing
on-chip communication only, up to almost 100% compared with
an initial single shared bus architecture, in both two real-life
examples, a four-Channel digital video recorder and an equalizer
for OFDM DVB-T receiver.
Index Terms—Communication architecture, design space explo-
ration, memory allocation, multitask, performance estimation.
I. INTRODUCTION
I NSATIABLE demand of system performance makes it in-evitable to integrate more and more processing elements in
a single system-on-chip (SoC) to meet the performance require-
ment. In addition, with the fast evolution of programmable hard-
ware, such as microprocessors or DSPs, and ever-increasing
complexity of its application, a large portion of an application
tends to be implemented as multitask software. Such systems
usually have complex and diverse on-chip communication traf-
fics among components. As a consequence, on-chip communi-
cation design becomes critical for successful SoC designs. In
this regard, as a new design methodology, separation between
function and architecture and between communication and com-
putation has been recently proposed [1], [23]. Adopting this
paradigm, in the proposed design methodology, we model the
system behavior as a composition of function blocks and map
Manuscript received July 1, 2005; revised January 8, 2006. This work
was supported by the National Research Laboratory Program under Grant
M1-0104-00-0015 and the IT Leading Research and Development Support
Project funded by Korean MIC.
S. Kim is with the Codesign and Parallel Processing Laboratory, Depart-
ment of Computer Science and Engineering, Seoul National University, Seoul
151-742, Korea, and also with Samsung Electronics, Yongin, Gyeonggi 446-711
Korea (e-mail: ynwie@iris.snu.ac.kr).
S. Ha is with the Codesign and Parallel processing Laboratory, Department of
Computer Science and Engineering, Seoul National University, Seoul 151-742,
Korea (e-mail: sha@iris.snu.ac.kr).
Digital Object Identifier 10.1109/TVLSI.2006.878260
the function blocks to the processing elements on the target ar-
chitecture specified separately.
Separation between computation and communication enables
system designers to explore the communication architecture in-
dependently after component selection and mapping decision is
made. Communication architecture decision is performed after
a decision is made on which processing elements are used and
which function blocks are mapped to where. From the given
communication requirements from all processing elements, the
design space of communication architecture is explored to find
out the optimal one considering the tradeoffs between perfor-
mance, power, cost, and other design objectives. Since the de-
sign space of communication architecture is extremely wide, it
is critical to develop an efficient exploration technique, which is
the main theme of this paper.
In this paper, we restrict the network architecture to bus since
it is still the most popular network [14]–[16], [25]. The design
space we explore, however, is still very wide, since it is formed
by multiple axes such as the number of buses, bus topology
including bus bridges, component allocation, bus arbitration
scheme, operation clock frequency, data width, and so on.
Fast and accurate performance estimation is the key to a prac-
tical design space exploration methodology. However, speed
and accuracy are two conflicting goals of performance estima-
tion. A simulation-based method gives accurate estimation re-
sults but pays too heavy a computational cost to be used for
exploring the large design space. So, research based on simu-
lation method exploits only a few design axes to confine the
design space to be explored. On the other hand, static perfor-
mance estimation methods do not model dynamic effects ac-
curately enough to determine the optimal architecture. In the
proposed technique, however, we utilize the advantages of both
approaches by breaking down the exploration procedure into
two steps. In the first step, we use a static performance estima-
tion technique to quickly evaluate each candidate design point
and prune the design space drastically. The second step uses a
trace-driven simulation to accurately evaluate the design points
in the reduced design space and to determine the pareto-optimal
set of bus architectures.
We assume that processing elements communicate with each
other through a shared memory. Each processing element has a
single port for both local and shared memory accesses, as usu-
ally is the case in real systems. Then, local memory traffics as
well as shared memory traffics are also involved in bus con-
tentions: memory allocation is considered as an important axis
of the design space in our technique. On the other hand, most
previous works have only considered shared memory traffics
and have not considered memory allocation separately [5], [8].
1063-8210/$20.00 © 2006 IEEE
682 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006
Fig. 1. (a) Behavior specification of an illustrative example, (b) its single shared bus implementation, and (c) and (d) dual-bus implementations with two different
mappings of shared memory segments SM_arc2 and SM_arc3.
Fig. 2. Proposed design space exploration methodology.
The remainder of this paper is organized as follows. In
Section II, we explain the overall procedure of the proposed
design space exploration technique with an illustrative ex-
ample. Section III reviews some related works. In Section IV,
we explain the performance estimation technique considering
multitask applications. Section V discusses the details of the
proposed technique with a preliminary example. Section VI
provides the overall structure of the proposed exploration
framework. We show some experimental results in Section VII
and conclude this paper in Section VIII.
II. OVERVIEW OF THE PROPOSED
EXPLORATION METHODOLOGY
For better understanding of the proposed methodology, we
use an illustrative example in Fig. 1. Initially, the system be-
havior is specified as a block diagram of four function blocks.
The arcs between function blocks show the data dependency.
For example, function blocks and can be executed only
after function block is completed. Those four function
blocks are mapped to three processing elements: and are
mapped to processing element PE0, to PE1, and to PE2,
respectively.
Fig. 1(b) represents a single shared bus implementation. Note
that one physical memory component is connected to a bus and
it contains seven logical memory segments: three local memory
segments and four shared memory segments. Memory segments
LM_PE0, LM_PE1, and LM_PE2 are local memory segments
of PE0, PE1, and PE2, respectively. The arcs between function
blocks are implemented as shared memory segments for inter-
component communication. For example, SM_arc0 associated
with arc0 in Fig. 1(a) indicates a shared memory segment for
communication between function blocks and .
The proposed exploration technique, whose overall struc-
ture is shown in Fig. 2, starts the exploration process with this
single-bus architecture that becomes the only element in the “set
of architecture candidates” initially. A set of memory traces
KIM AND HA: EFFICIENT EXPLORATION OF BUS-BASED SYSTEM-ON-CHIP ARCHITECTURES 683
is one of the inputs to the proposed exploration procedure,
which includes both local and shared memory accesses from
all processing elements. After the mapping of function blocks
to processing elements is completed, the memory traces are
obtained using instruction set simulator for each processor core,
HDL simulator for ASIC parts, or IP simulators, assuming that
memory access overhead is zero. Memory traces are classified
into three categories: code memory, data memory, and shared
memory. Code and data memories are associated with local
memory accesses and shared memory with inter-component
communication. For the processor that uses a cache memory,
the traces associated with code and data memories represent
the memory accesses incurred by cache-miss. Note that the
memory trace information is never changed throughout the
exploration.
We traverse the design space in an iterative fashion as shown
in Fig. 2. The body of the iteration loop consists of three main
steps. The purpose of the first step is to quickly explore the de-
sign subspace of architecture candidates to build a reduced set
of design points to be carefully examined in the next step. With
a given set of architecture candidates, we visit all design points
by varying the priority assignment of processing elements on
each bus and other bus operation conditions. The performance
estimation technique used in this step is an extension of [10]
to consider multitask applications. We collect the design points
with performance difference by less than 10% compared with
the best because the proposed static estimation method has less
than 10% error bound. The proposed estimation method is based
on the queuing model of the system where processing elements
are regarded as customers and a bus with its associated memory
is regarded as a single server. The service request rate from each
processing element is extracted from the memory traces as a
function of execution time. Since our method considers the bus
contention effect, it gives reasonably accurate estimation results
to be used as the first-cut pruning of the design space.
The second step applies trace-driven simulation to the se-
lected design points from the first step. It accurately evaluates
the performance of design points in the reduced space and de-
termines the best design point. If the performance of the best
design point is not improved from the previous iteration, we exit
the exploration loop. Otherwise, we go to the third step and re-
peat another round of iteration.
The third step generates the next set of architecture candi-
dates. From the architecture of the best design point, we explore
the design space incrementally by selecting a processing ele-
ment and allocating it to a different bus or a new one. Let us go
back to the example of Fig. 1. Since the first round starts with
one architecture candidate, single-bus architecture, it becomes
the input architecture to the third step. Suppose that PE2 is se-
lected and allocated to a new bus to make a dual-bus system.
Since all local memory segments should reside in the same bus
as the associated processing element, there are four candidate ar-
chitectures depending on where to put the shared memory seg-
ments associated with PE2: SM_arc2 and SM_arc3. Function
blocks and use shared memory segment SM_arc2 so that
it may be allocated either to Bus0 or Bus1. However, SM_arc0
that is accessed by function blocks and should remain at
Bus0 since all of its associated processing elements reside in
the same bus. Among four candidate architectures, Fig. 1(c)
and (d) shows two candidate architectures. In the case we select
PE0 and move it to a new bus, we generate 16 different can-
didate architectures since PE0 is associated with four shared
memory segments. In this way, we can generate 24 candidate
architectures for the second round of iteration by moving a pro-
cessing element into a new bus and considering all possible
shared memory segment allocation.
As the iteration goes, we record the best performance num-
bers as a function of the number of buses to obtain the pareto-op-
timal design points. If the number of buses increases, the perfor-
mance tends to increase. We exit the iteration when no perfor-
mance increase is obtained from the previous iteration.
The proposed technique does not explore the entire design
space but it is a greedy heuristic to prune the design space ag-
gressively since we select only the best architecture at the end of
iteration. If we select multiple ones, we may explore the wider
set of design points with longer execution time.
III. RELATED WORK AND OUR CONTRIBUTION
Some researchers have considered communication architec-
ture selection simultaneously during the mapping step. Since the
communication overhead is needed for the mapping decision,
static estimation of communication architecture has been inves-
tigated. A technique was proposed to estimate the communi-
cation delay using the worst-case response analysis of real-time
scheduling [1]. Knudsen and Madsen estimated the communica-
tion overhead on a point-to-point channel taking into account the
data transfer rate variation depending on the protocol, configu-
ration, and different operating frequencies of components [13].
Nandi and Marculescu proposed a performance measure tech-
nique based on continuous-time Markov processes [11]. How-
ever, these techniques did not model the dynamic effects such as
bus contention and explored only a limited configuration space.
For exploration of communication architectures, simulation-
based estimation is widely adopted in many commercial tools
and academic researches at various transaction levels [4], [12],
[22]. A simulation-based method gives accurate estimation re-
sults but pays too heavy a computational cost to be used for ex-
ploring the large design space. So, previous research based on
this method has exploited only a few design axes to confine the
design space to be explored. To overcome this difficulty, a hy-
brid approach between a static estimation and a simulation ap-
proach has been developed by Lahiri et al. [3]. They used some
static analysis to group the traces and apply a trace-driven sim-
ulation with the trace groups. Their approach is similar to ours
in that they applied some static analysis to the memory traces to
reduce the time complexity of trace-driven simulation.
Since the design space is extremely huge, most previous
works focused on a small number of design axes. In Gong
et al.’s work [6], system specification refinement onto four
fixed communication architecture templates was addressed to
optimize performance. Gasteier et al. proposed a bus topology
synthesis technique at high level using the port constraints of
components and considering only static information such as bit
widths, the amount of data transfer, and so on [7]. Meeuwen et
al. presented a technique for cost-efficient interconnect archi-
tecture exploration by time-multiplexing the data transfers over
684 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006
Fig. 3. Queuing model of a single bus.
a number of shared buses assuming distributed memory sys-
tems under predetermined memory allocation [8]. Meftali et al.
found performance-optimal shared memory allocation consid-
ering area of communication channel and memory subsystem
using integer linear programming (ILP) in a point-to-point
communication architecture [9]. Lahiri et al. proposed an ex-
ploration technique optimizing the component mapping to bus
and the bus protocol such as DMA block transfer size and bus
priority assignment for a given bus topology [5]. Thepayasuwan
and Doboli proposed a bus architecture synthesis technique that
minimizes the cost considering bus topology, communication
conflicts, and bus utilization using a simulated annealing [20].
Srinivasan et al. developed a technique that performs both bus
partitioning and bus frequency assignment simultaneously to
optimize power consumption and performance using a genetic
algorithm [21]. A technique using the profiled statistics of
communication traffics between cores to determine core-to-bus
assignment for a given application was proposed by Drinic et
al. [24].
Compared with these related works, the main contribution
of the proposed technique is that we explore the larger design
space by the two-step design space exploration, considering
multiple design axes such as the number of buses, bus topology,
component allocation, priority assignment, and other bus op-
erating conditions. Since the proposed exploration technique
is extensible, more design axes can easily be added. Our work
also extends the previous works in two ways. We consider local
memory accesses as well as shared memory accesses. Pre-
vious works on architecture exploration are mostly concerned
with only communication requirements between processing
elements ignoring the local memory accesses. In case local
memory accesses are also involved in bus contention, they need
to be considered. Another extension is our static performance
analysis using the queuing model capable of considering mul-
titask applications.
IV. PERFORMANCE ESTIMATION OF MULTITASK APPLICATIONS
Here, we discuss the extension of our previous performance
estimation technique to multitask applications. We first summa-
rize the performance estimation technique of single task pro-
posed in [10]. The technique is based on the queuing model
of the target bus architecture, where processing elements are
customers while a bus and associated memory subsystem cor-
respond to a single server. The model aims at estimating an
average wait time for bus grant of each processing element by
construction of a steady-state transition diagram.
Fig. 3 shows the queuing model of a single bus architecture.
There are processing elements
competing for the use of a bus. It is assumed that the bus arbi-
tration is based on the fixed priorities of processing elements.
is assigned the highest priority. The bus access is assumed
nonpreemptive.
denotes the rate at which the processing element
issues memory requests assuming an ideal condition of zero
memory access overhead. For function block on a pro-
cessing element, we first compute the bus request rate ,
the schedule length , and the average bus access time
from the memory traces’ information and scheduling
result, assuming ideal but unrealistic bus conditions: no waits
for bus grant and the access time of one cycle for unit data
transfer from/to memory. The bus request rate is a ratio of
bus access counts over so that it varies according
to the function block running on .
If the execution time is lengthened due to bus contention, the
effective arrival rate of requests becomes smaller than . We
denote the actual memory access rate by , which is actually
seen on the bus. The mean service rate of a server for the request
from is denoted by and its mean service time is the
reciprocal of the service rate, i.e., . Let be the expected
number of requests from waiting for use of the bus. It is
within the range of if does not issue the next memory
request until the current request is served, and we denote as
the expected waiting time of the stalled request. Then, we obtain
the following equation:
(1)
where is the bus utilization factor of . Little’s
Law [28] says
(2)
We want to obtain from (2), which indicates the delays in-
curred from bus contention. We can extract from the memory
traces. By the memory system and the average burst length of
the memory traces, is determined statically. There remains an
unknown parameter in the right-hand side of (1). To obtain
this, we use a state transition diagram and its steady-state proba-
bility. As a result, dynamic bus conflicts can be predicted accu-
rately. We omit the detailed explanation on the queuing model;
refer to [10] for further details.
Fig. 4(a) shows an example of static schedule of a task that
consists of four function blocks, assuming ideal bus conditions.
At the beginning of the schedule , three function blocks ,
, and can be executed concurrently on , , and
, respectively. To evaluate expected wait delay for bus ac-
cess from each processing element, the queuing system is con-
structed as shown in Fig. 4(a). Through the queuing analysis, we
KIM AND HA: EFFICIENT EXPLORATION OF BUS-BASED SYSTEM-ON-CHIP ARCHITECTURES 685
Fig. 4. (a) Example schedule of PE , PE , and PE and the corresponding queuing model. (b) New queuing model after function block C is finished at T .
(c) Another queuing model after function block B is finished at T .
Fig. 5. (a) Specification and (b) schedule of H.263 encoder and (c) the mapping of function blocks to processing elements for a four-channel DVR.
estimate the waiting time due to bus contentions to obtain the es-
timated schedule extension as illustrated in Fig. 4(b) until any
function block completes its execution. In this example, func-
tion block is assumed to be finished first at . We consider
the function blocks that are concurrently executable after that
time. After , two function blocks and continue execu-
tions until either one finishes next. Thus, we construct the asso-
ciated queuing model with and as shown in Fig. 4(b).
The shaded regions indicate the remaining parts of the static
schedule to be estimated by the queuing analysis. Such eval-
uation process is repeated until all of the function blocks are
examined.
Now we consider multitask applications. For simplicity, but
with little loss of generality, we make the following assump-
tions: all tasks are independent, any preemptable task sched-
uling policy can be used, and the scheduling overhead is neg-
ligible. In case tasks are interdependent, we still consider the
set of independent tasks that run concurrently with the function
block of interest for static analysis.
We explain the proposed technique using a four-channel dig-
ital video recorder (DVR) example throughout this paper. The
DVR receives the raw bit streams from external four sources
and encodes each stream separately using an H.263 encoding
algorithm. Each channel corresponds to a task so that DVR has
four tasks, from ch0 to ch3. Fig. 5(a) and (b) shows the spec-
ification and the schedule of an H.263 encoder, respectively,
while Fig. 5(c) represents a mapping example of function blocks
Fig. 6. Function blocks of ch1, ch2, and ch3 on ARM0, ARM1, and HW_DCT
that can be executed simultaneously with ME0.
onto processing elements. Each ARM processor takes charge
of running two tasks, respectively. A task is mapped to one
ARM processor except for the function blocks ME and DCT,
which are mapped to the dedicated hardware blocks HE_ME
and HW_DCT, respectively. Memory traces are obtained from
encoding of a P-frame in QCIF-format bit stream.
Suppose we want to estimate the execution time of ME0 of
task ch0 mapped to HW_ME. To build the queuing model as de-
picted in Fig. 6, the function blocks being concurrently executed
on ARM0, ARM1, and HW_DCT with ME0 should be selected.
In ch0, no function blocks are concurrently executable with ME0
due to the execution dependency. Therefore, only ch1 can be si-
multaneously executable with ME0 in ARM0 since ME0 of task
ch0 is executed in HW_ME. However, it cannot be statically de-
termined which function block of ch1 is concurrently executable
with ME0. Moreover, any function block of two tasks ch2 and
686 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006
ch3 can be executed in ARM1. If we enumerate all possible com-
binations of the function blocks concurrently executable with
ME0, or 150 queuing models should be inves-
tigated, which is impractical for fast design space exploration.
Thus, we propose a heuristic approach.
As explained in the beginning of this section, the required pa-
rameters for constructing the queuing model are the bus request
rate and the average bus access time. To model the bus con-
tentions due to other tasks, we define a virtual function block
for task on processing element , which has an
approximated bus request rate and a bus access
time while ME0 is executed.




where the function blocks of ch1 mapped to
and sl(ch1) is the schedule length of the task ch1. In other
words, and mean an average
request rate and an average bus access time of the function
blocks of ch1 that are concurrently executable with ME0 in
ARM0, respectively. In ARM1, two virtual function blocks
and , are defined. Then, the
virtual function block with higher bus request rate is selected.
We also assume that the schedule length of all virtual function
blocks is infinite to make them longer than ME0.
For more general formulation, we define two terms
and that are the task including function block and the
processing element executing function block , respectively. If
two function blocks and are to be executed concurrently
according to the static task schedule, we denote it by .
Suppose that function block is executed on processing el-
ement , i.e., . In order to build the queuing
system of Fig. 4, is defined as the virtual function block
of processing element , which is assumed to be running con-




and . Thus, among all tasks running on ,
we choose the worst case that has the highest bus request rates.
If task is selected to build the virtual function block of by
(5), the average bus access time of virtual function
block in ideal bus conditions is
(6)
Fig. 7. (a) Initial schedule of four-channel DVR and (b) its single-bus imple-
mentation.
where .
With those queuing parameters of the processing element, the
queuing system is constructed to estimate the average wait time
for bus grant of function block .
V. GENERATION OF COMMUNICATION
ARCHITECTURE CANDIDATES
Here, we explain how the proposed technique explores the
design space of a four-channel DVR example in Fig. 5.
Suppose that each H.263 encoder is mapped to a separate
ARM processor except all ME and DCT blocks: they are
mapped to a motion estimation (ME) hardware component
and a discrete cosine transform (DCT) hardware component,
respectively. Thus, four H.263 encoders share two hardware
components. Fig. 7(a) and (b) shows the scheduling and
mapping result of a four-channel DVR and its single-bus imple-
mentation, respectively. This example system has 14 memory
segments: five local memory segments and 12 shared memory
segments. In Fig. 7(b), for instance, shared memory segment
“MC0, ME0” is associated with communication between func-
tion blocks MC0 and ME0.
A. Bus Topology Exploration
Performance improvement of communication architecture
can be achieved by scattering communication traffics into
multiple buses to reduce bus contention and to maximize
concurrency. For this purpose, we select a processing element
and allocate it to a new bus or to another existing bus. Suppose
that we select ARM0 in the single-bus architecture of Fig. 7.
The way of changing the allocation of ARM0 is only to create
a new bus, Bus1. Then, its associated shared memory segments
“MC0, ME0” and “DCT0, Q0” can be allocated to either Bus0
or Bus1 to generate four architecture candidates as shown in
Fig. 8. It is important to note that a bus bridge is introduced
between two buses for inter-bus communication.
Further communication traffic reduction can be obtained by
removing local memory accesses of each processing element
from shared buses, as illustrated in Fig. 9. Processor ARM0
KIM AND HA: EFFICIENT EXPLORATION OF BUS-BASED SYSTEM-ON-CHIP ARCHITECTURES 687
Fig. 8. Creating a new bus for processing element ARM0 and allocating its associated shared memory segments.
Fig. 9. If the local memory segment of processor ARM0 is separated from shared bus Bus1, communication traffics incurred by local memory accesses can be
removed from Bus1, which may lead to further performance improvement.
can access its local memory segment LM_ARM0 without con-
tending for the use of Bus1. Since the local memory segment in
Fig. 8 is not a cache, the requests from a processing element go
out to either the bus or the local memory according to its des-
tination address. A drawback of such separation is area over-
head caused by adding a local bus to each processing element.
However, it may be desirable for power saving due to the use
of more memory components of smaller size as well as reduced
bus-switching activities [26], [27]. We leave it as future work to
consider power consumption as another design objective.
Further communication traffic reduction can be obtained by
removing local memory accesses of each processing element
from shared buses as illustrated in Fig. 9. Processor ARM0
can access its local memory segment LM_ARM0 without con-
tending for the use of Bus1. Since the local memory segment in
Fig. 8 is not a cache, the requests from a processing element go
out to either the bus or the local memory according to its des-
tination address. A drawback of such separation is area over-
head caused by adding a local bus to each processing element.
It, however, may be desirable for power saving due to the use
TABLE I
NUMBER OF GENERATED ARCHITECTURES FROM THE SINGLE-
BUS ARCHITECTURE OF FOUR-CHANNEL DVR BY CHANGING
ALLOCATION OF SHARED MEMORY SEGMENTS
of more memory components of smaller size as well as reduced
bus-switching activities [26], [27]. We leave it as a future work
to consider power consumption as another design objective.
Table I shows the size of design space by moving each pro-
cessing element from the single-bus architecture in Fig. 7(b).
In total, 528 architecture candidates are generated. The same
number of architecture candidates is also generated for the
688 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006
architectures that use local buses for local memory segments.
Consequently, more than one thousand architecture candidates
should be investigated by examining the bus topology only.
B. Bus Parameterization
Once the bus topology and memory allocation are deter-
mined, bus protocol of each architecture candidate should be
configured. Bus protocol includes priority assignment, oper-
ation clock frequency, bus data-width, and so on. Of these,
priority assignment should be treated with care since it may
cause significant performance variation, as observed in [5] and
[18]. Although an exhaustive search method guarantees an
optimal result, it is prohibitively expensive. For example, the
first architecture, arch0, of Fig. 8 has six bus masters in Bus0
and two bus masters in Bus1 including bus bridges. Since 6!
or 1440 priority assignments for Bus0 and 2! assignments for
Bus1 are possible, 1440 2 assignments should be investigated
in an exhaustive search method to get the optimal priority
assignment. To overcome this difficulty, we devised a priority
assignment heuristic where a higher priority is bestowed to the
processing element with more memory accesses and to more
critical processing element.
The amount of data transfer per unit time , i.e.,
bandwidth, represents the memory access characteristics of
function block . The criticality of function block
is the sum of the schedule length of function blocks on the
longest execution path starting from block . The bandwidth
and the criticality of a function block are computed from the
memory traces and the function block scheduling, respectively.
In our heuristic, we define the rank of function block
as the product of criticality and bandwidth .
Also, the rank of processing element is the sum of
the ranks of function blocks that are executed in . Thus, we
get following formula:
(7)
where is the set of function blocks mapped on . The
higher the rank of a processing element is, the higher priority the
processing element is assigned. After this initial assignment, we
perform a simple annealing process by swapping the priorities
of two processing elements on the same bus.
When the priority assignment of a bus is investigated, assign-
ments of the other buses are assumed fixed. For instance, in
arch0 of Fig. 8, we start by varying the priorities of processing
elements in Bus0 first. The performance estimation proposed in
the previous section is applied to every combination by swapped
priorities between two processing elements. Therefore,
or 30 architecture candidates are investigated when considering
Bus0 and the best result is selected as the priority assignment of
Bus0. Then, we move to Bus1 and repeat the same procedure to
find a more improved solution, which results in or six pri-
ority assignments. Therefore, the total number of assignments
Fig. 10. Proposed exploration flow.
to be explored by the heuristic amounts to 36. The proposed
assignment heuristic shows remarkable results compared with
an exhaustive search method, while it explores significantly re-
duced design space, by 80 times in this example. We validate its
efficiency by experimental results in Section VII.
Bus clock frequency and bus data-width depend on the
memory used. For brevity, in this paper, we assume that all
buses in an architecture candidate are synchronized with a
single global clock and its frequency is set to be the reciprocal
of memory access time for one word. Bus data-width follows
the data-width of memory.
VI. OVERALL STRUCTURE OF THE EXPLORATION FRAMEWORK
Fig. 10 summarizes the main procedure of the proposed tech-
nique: Select_Architecture. This procedure requires three in-
puts: the initial architecture Initial_Arch to begin the explo-
ration, the schedule information of system specification Sched,
and the memory traces Mem_Trace of processing elements. The
while statement of line 3 defines the main iteration loop of ex-
ploration.
Select_Architecture consists of three parts. The first part
is the first architecture-pruning step from line 5. Initially, the
set of architecture candidate contains only one element, Ini-
tial_Arch. In the first for loop, from line 6 to line 9, the diverse
KIM AND HA: EFFICIENT EXPLORATION OF BUS-BASED SYSTEM-ON-CHIP ARCHITECTURES 689
priority assignments selected from the proposed priority assign-
ment heuristic are assessed by the proposed static estimation
method, and we obtain the best performance of each architec-
ture candidate.
Then, after the best performance values of all architecture
candidates are sorted in an ascending order, the architecture
that has the shortest execution time Best_Exe_Time is chosen.
Since our static estimation method is observed to have 10%
error bound through preliminary examples, the architecture
candidates that have the estimated performance differed from
Best_Exe_Time by less than 10% may have actually better
performance than Best_Exe_Time. Thus, this error range is
used to reduce the design space: the parameter ESTIMA-
TION_ERROR is set to 0.1. In the for loop from lines 11 to
18, if the performance difference of an architecture candidate
from Best_Exe_Time is greater than ESTIMATION_ERROR, it
is pruned from the design space. Note that the more accurate
the static estimation technique is, the narrower the design space
becomes that should be investigated more precisely in the
second pruning step.
In case the reduced design space is still too large, we may
want to restrict the maximum number of architecture candi-
dates to be explored in the second step. Therefore, we define
the MAX_ARCH parameter and enforce that at most as many as
MAX_ARCH architecture candidates are left in the reduced de-
sign space (lines 19–25).
The second part of the procedure applies trace-driven simu-
lation to the selected architecture candidates from the first step.
We use an in-house cycle-accurate trace-driven simulator at this
step. Since the estimated performance from the trace-driven
simulation is very accurate, we compare the performances of
all candidate architectures and choose the best architecture.
Then, the performance of the best architecture is compared with
that of the previous iteration. If no performance improvement
is achieved from the current iteration or the number of shared
buses reaches the number of processing elements, we exit the
iteration loop and terminate the procedure.
The last part of procedure Select_Architecture is to generate
the architecture candidate incrementally from the best architec-
ture chosen from the second part (line 39). How to generate
the architecture candidate is already explained in the previous
section. When we estimate the performance, we record the best
performance value for each number of buses used to obtain the
pareto-optimal set of bus architectures.
VII. EXPERIMENTAL RESULTS
This section provides the experimental results on the static
estimation technique for multitask extension, on the priority
assignment heuristic, and on the proposed two-phase explo-
ration methodology. All experiments were conducted on a
Xeon 2.8-GHz workstation running Linux.
A. Validation of the Proposed Performance
Estimation Technique
We compared the static estimation result from the proposed
multitask extension of the previous queuing analysis for the
four-channel DVR system example with a trace-driven simu-
lation. Trace-driven simulation used in this experiment adopts
TABLE II
RESULTS OF THE EXPLORATION FORA FOUR-CHANNEL DVR
Fig. 11. Difference between the sum of bus access time and wait time for bus
grant between the proposed static estimation and the trace driven simulation for
each processing element on a four-channel DVR.
a simple bus protocol, where the advanced features such as
address/data bus pipelining, split-transaction, multiple-out-
standing masters, and so on are not modeled. The trace-driven
simulator schedules tasks using the rate-monotonic scheduling
with fixed priorities [19]. However, no scheduling overhead is
considered in the simulator.
Table II represents the results after architecture exploration
for a four-channel DVR until no more performance gain is ob-
tained. In Table II, the second column “Number of architec-
tures” shows the number of generated architecture candidates
according to the associated number of buses during the explo-
ration. The last row indicates the average elapse of estimating
an architecture candidate in the first exploration step. It takes
no less than one second, which shows the effectiveness of the
proposed technique. In the third column “Estimated execution
time,” the best performance obtained from the trace-driven sim-
ulation is recorded in bus clock cycles. No performance gain
is obtained with more than two buses, since the overhead of
crossing bus bridges tends to exceed the benefit of scattering
communication traffics by splitting a bus.
Fig. 11 shows the difference of the sum of bus access time
and waiting time for bus grant between the proposed static es-
timation and trace-driven simulation. Each bar corresponds to
the maximum difference over the explored architectures with
the same number of buses for a four-channel DVR. The estima-
tion error compared with the trace-driven simulation is about
28% in the worst case. Since the ratio of bus access time over
the entire execution does not exceed 30%, the estimated error
on the entire execution becomes less than 10%. Furthermore,
the average estimation error is around about 6%. Through the
experiments, we verify that the proposed estimation technique
690 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006
Fig. 12. (a) Specification of the equalizer for OFDM DVB-T receiver and (b)
its schedule.
TABLE III
EFFICIENCY OF THE PRIORITY ASSIGNMENT HEURISTIC COMPARED WITH AN
EXHAUSTIVE METHOD: EQUALIZER FOR THE OFDM DVB-T RECEIVER
can be used successfully for reducing the design space for mul-
titask applications.
B. Validation of the Priority Assignment Heuristic
We compared the efficiency of the proposed priority assign-
ment heuristic with the exhaustive assignment for architecture
candidates during exploration of two examples: the equalizer
subsystem of an OFDM DVB-T receiver and a four-channel
DVR system, starting from the single-bus architecture. In a
DVB-T receiver, the equalizer is used for correcting the am-
plitude distortion of received signals [17]. Function blocks of
the equalizer are mapped onto five ARM940T processors and
are scheduled in a pipelined fashion to make all processors run
concurrently, as shown in Fig. 12(b). Due to an excessively
long run time of the exhaustive search, we only considered
single-bus and dual-bus implementations for performance
comparison. Comparison results are summarized in Tables III
and IV for each example, respectively.
For each bus topology, the number of investigated combi-
nations of priority assignment by exhaustive search and the
heuristic are given in the rows “# of total assignments per
architecture” and “# of average assignments per architectures”
in their total and average, respectively. Comparing with the
exhaustive search, the proposed heuristic assignment reduces
TABLE IV
EFFICIENCY OF THE PRIORITY ASSIGNMENT HEURISTIC COMPARED
WITH AN EXHAUSTIVE METHOD: FOUR-CHANNEL DVR
the search space significantly by about 12 times in a single-bus
implementation and about 19 times in dual-bus implementa-
tions. We evaluated the performance of all combinations of
priority assignment by exhaustive method and then recorded the
normalized values with respect to the best whose performance
is set to 1. The smaller the value is, the better the performance
is. It should be noticed that wrong priority assignment leads to
30% performance degradation in the worst case. It confirms the
importance of optimal priority assignment.
The column “Initial/Tuning” reports the results by the pro-
posed heuristic. The first value is obtained from the initial as-
signment while the second one is after the annealing process.
As shown in the tables, initial assignment does not always guar-
antee acceptable results. Even though the quality of initial as-
signment only is not good, it can be elevated close to the op-
timum by the annealing process. In both examples, the heuristic
results are deviated from the optimum only by 1% at most for
various bus architectures. In the case of singe-bus implementa-
tion of the four-channel DVR, the optimum was found by the
heuristic. It shows that the proposed heuristic is effective to find
an optimized priority assignment as well as to reduce the search
space drastically.
C. Validation of the Proposed Exploration Methodology
The proposed two-phase exploration technique is applied to
the previous examples. The maximum number of architectures
to be evaluated in the second pruning step, MAX_ARCH in
Fig. 10, was fixed to 20. Table V represents the results of
exploration for each example system. We do not separate local
memory accesses to the local buses in this experiment. Ex-
ploration considering local buses will be discussed in the next
experiment to investigate additional performance improvement
due to local buses. In each set of Table V, the first column “# of
arch” shows the number of generated architecture candidates
having the associated number of buses during whole explo-
ration. The number of processing elements of a system is equal
to the maximum number of buses that an architecture candidate
may have.
KIM AND HA: EFFICIENT EXPLORATION OF BUS-BASED SYSTEM-ON-CHIP ARCHITECTURES 691
TABLE V
RESULTS OF EXPLORATION FOR THE EXAMPLE SYSTEMS
The column “Speed up” shows the performance improvement
of the best architecture among the architecture candidates com-
pared with the initial single-bus architecture. Performance im-
provement tends to be saturated near the end of the exploration.
It is noteworthy that the maximum performance of example sys-
tems is about 50% to 100% better than that of the single-bus
architecture. Since such improvement comes from optimization
of only communication architecture and memory allocation, it
confirms the usefulness of the proposed technique.
The four rows from the bottom represent the number of total
architecture candidates explored, the architecture pruning ratio
by the static performance estimation, the average time taken for
the static performance estimation of an architecture candidate,
and the total elapsed time for the exploration, respectively. In the
case of the four-channel DVR, a pruning ratio is close to 100%.
The entire set of architecture candidates includes those by pri-
ority assignments as well as by the move of processing elements
and shared memories. As reported in the second row from the
bottom, each architecture candidate is evaluated rapidly within
less than one second in average. The total execution time of the
last row includes trace-driven simulation.
The performance variation of each system according to the
number of buses is shown in Fig. 13. For all systems, dual-bus
system has the widest performance variation meaning that
wrong mapping of processing elements or memory alloca-
tion could lead to significant performance degradation. For
example, in the four-channel DVR and System3, the worst per-
formance of dual-bus architecture is even inferior to single-bus
architecture. However, as more buses are used, the variation
becomes smaller since the concurrency of memory accesses is
fairly exploited by multiple buses enough to compensate for
performance degradation due to wrong mapping of processing
elements and memory allocation. If we take the best perfor-
mance for each number of buses, we obtain the pareto-optimal
set of bus architectures.
Fig. 13. Performance variation of the example systems during exploration,
varying the number of buses. In each graph, horizontal and vertical axes repre-
sent the number of buses used and the normalized execution time, respectively.
(a) Four-channel DVR. (b) Equalizer for DVB-T receiver.
TABLE VI
PERFORMANCE IMPROVEMENT OBTAINED BY CONSIDERING LOCAL
BUS EXPLORATION
As the last experiment, we examined how much performance
improvement can be obtained by using dedicated local buses for
local memory access, as discussed in Section V-A. The same
environment and configurations of the previous experiment are
used again for equalizer and DVR examples. Contrary to the
previous experiment, local buses of processing elements are ex-
plored to get further performance improvement. Similar tenden-
cies could be observed as shown in Table V and Fig. 13. Now,
we focus on how much performance improvement is obtained
and summarize the results in Table VI. For each example, per-
formance values for three types of architecture are reported: Ini-
tial single bus architecture, the best architecture without local
buses, and the best one with the local buses. Performance values
are provided in both normalized ones and bus clock cycles. The
performance of initial single bus implementation is set to 1. The
performance improvement with local buses is about 100%, i.e.,
it becomes two times faster than initial single-bus architectures
for both examples. It is about 20% better than the architecture
without local buses.
VIII. CONCLUSION
In this paper, we have presented an iterative two-step ex-
ploration technique of bus-based on-chip communication ar-
chitectures and memory allocation. At each iteration, the first
step quickly reduces the large design space drastically by using
an efficient static performance estimation method based on a
queuing model. In the second step, the reduced design space is
explored using a trace-driven simulation to choose the best ar-
chitecture candidate. Experimental results with two examples,
692 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 7, JULY 2006
a four-channel DVR and the equalizer subsystem for OFDM
DVB-T receiver, and three randomly generated examples val-
idated the efficiency and the viability of the proposed technique
to explore the wide design space.
The main contribution of the proposed technique is that
we explored the larger design space by the two-step design
space exploration, considering multiple design axes such as the
number of buses, bus topology, component allocation, priority
assignment, and other bus operating conditions. Since the
proposed exploration technique is extensible, more design axes
can easily be added. We also extended the previous works to
consider local memory accesses and multitask applications in
the proposed static performance estimation.
The main overhead of the proposed methodology is building
of the trace-driven simulator and a new queuing model for static
performance analysis for a new bus standard of interest. Even
though only the performance metric has been investigated in this
paper, the proposed methodology is extensible to consider other
metrics such as power consumption, which is currently under
development. Another future work is the extension of the pro-
posed methodology to off-chip system, i.e., board-level system,
and to network-on-chip architecture.
ACKNOWLEDGMENT
The authors would like to acknowledge the useful comments
and suggestions of the anonymous reviewers who helped im-
prove the quality of this paper. The ICT and ISRC at Seoul Na-
tional University and IDEC provided research facilities for this
study.
REFERENCES
[1] K. Keutzer, S. Malik, R. Newton, J. Rabaey, and A. Sangiovanni-Vin-
centelli, “System-level design: Orthogonalization of concerns and plat-
form-based design,” IEEE Trans. Comput.-Aided Des. Integr. Circuits
Syst., vol. 19, no. 12, pp. 1523–1543, Dec. 2000.
[2] T. Yen and W. Wolf, “Communication synthesis for distributed em-
bedded systems,” in Proc. Int. Conf. Comput.-Aided Des., Nov. 1995,
pp. 288–294.
[3] K. Lahiri, A. Raghunathan, and S. Dey, “System-level performance
analysis for designing system-on-chip communication architecture,”
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 20, no. 6,
pp. 768–783, Jun. 2001.
[4] P. Lieverse, P. van der Wolf, K. Vissers, and E. Deprettere, “A method-
ology for architecture exploration of heterogeneous signal processing
systems,” J. VLSI Signal Process. Syst., vol. 29, no. 3, pp. 197–207,
Nov. 2001.
[5] K. Lahiri, A. Raghunathan, and S. Dey, “Design space exploration
for optimizing on-chip communication architectures,” IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 23, no. 6, pp.
952–961, Jun. 2004.
[6] J. Gong, D. D. Gajski, and S. Bakashi, “Model refinement for hard-
ware-software codesign,” in Proc. Eur. Des. Test Conf., Mar. 1996, pp.
270–274.
[7] M. Gasteier, M. Munch, and M. Glensner, “Generation of interconnect
topologies for communication synthesis,” Proc. Des. Autom. Test Eur.,
pp. 36–43, Feb. 1998.
[8] T. van Meeuwen, A. Vandecappelle, A. van Zelst, and F. Catthoor,
“System-level interconnect architecture exploration for custom
memory organizations,” in Proc. Int. Sym. Syst. Synthesis, Oct. 2001,
pp. 13–18.
[9] S. Meftali, F. Gharsalli, F. Rousseau, and A. A. Jerraya, “An
optimal memory allocation for application-specific multiprocessor
system-on-chip,” in Proc. Int. Symp. Syst. Synthesis, Oct. 2001, pp.
19–24.
[10] S. Kim, C. Im, and S. Ha, “Schedule-aware performance estimation
of communication architecture for efficient design space exploration,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 5, pp.
539–552, May 2005.
[11] A. Nandi and R. Marculescu, “System-level power/performance anal-
ysis for embedded systems design,” in Proc. Des. Autom. Conf., Jun.
2001, pp. 599–604.
[12] J. A. Rowson and A. Sangiovanni-Vincentelli, “Interface based de-
sign,” in Proc. Des. Autom. Conf., Jun. 1997, pp. 178–183.
[13] P. V. Knudsen and J. Madsen, “Communication estimation for hard-
ware/software codesign,” in Proc. Int. Symp. Hardware/Software
Codesign, Dec. 1998, pp. 55–59.
[14] On-chip coreconnect bus architecture, IBM [Online]. Available: http://
www.chips.ibm.com/products/coreconnect/index.html
[15] ARM Advanced Micro Bus Architecture (AMBA) ARM [Online].
Available: http://www.arm.com/products/solutions/AMBAHome-
Page.html
[16] Sonics, Integration Architectures [Online]. Available: http://www.son-
icsinc.com
[17] F. Frescrua, S. Pielmeier, G. Reali, G. Baruffa, and S. C. Cacopardi,
“DSP based OFDM demodulator and equalizer for professional
DVB-T receivers,” IEEE Trans. Broadcast., vol. 45, no. 3, pp.
323–332, Sep. 1999.
[18] F. Poletti, D. Bertozzi, L. Benini, and A. Bogliolo, “Performance anal-
ysis of arbitration policies for SoC communication architectures,” J.
Des. Autom. Embedded Syst., vol. 8, pp. 189–210, Jun./Sep. 2003.
[19] C. Liu and J. Layland, “Scheduling algorithms for multiprogramming in
a hard real-time environment,” J. ACM, vol. 20, no. 1, pp. 46–61, Jan.
1973.
[20] N. Thepayasuwan and A. Doboli, “Layout conscious bus architecture
synthesis for deep submicron systems on chip,” in Proc. Des. Autom.
Test Eur., Feb. 2004, pp. 10108–10115.
[21] S. Srinivasan, L. Li, and N. Vijaykrishnan, “Simultaneous partitioning
and frequency assignment for on-chip bus architectures,” in Proc. Des.
Autom. Test Eur., Mar. 2005, pp. 218–223.
[22] X. Zhu and S. Malik, “A hierarchical modeling of framework for on-
chip communication architectures,” in Proc. Int. Conf. Comput.-Aided
Des., Nov. 2002, pp. 663–671.
[23] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and
A. Sangiovanni-Vincentelli, “Addressing system-on-a-chip intercon-
nect woes through communication-based design,” in Proc. Des. Autom.
Conf., Jun. 2001, pp. 667–672.
[24] M. Drinic, D. Kirovski, S. Meguerdichian, and M. Potkonjak, “La-
tency-guided on-chip bus network design,” in Proc. Int. Conf. Comput.-
Aided Des., Nov. 2000, pp. 420–423.
[25] Wishbone System-on-Chip Interconnection Architecture for Portable
IP Cores Silicore and OpenCores [Online]. Available: http://www.
opencores.org
[26] W. B. Jone, J. S. Wang, H. Lu, I. P. Hsu, and J. Y. Chen, “Segmented
bus design for low-power systems,” IEEE Trans. Very Large Scale In-
tegr. (VLSI) Syst., vol. 7, no. 1, pp. 25–29, Mar. 1999.
[27] C.-T. Hsieh and M. Pedram, “Architectural power optimization by bus
splitting,” Proc. Des. Autom. Test Eur., pp. 612–616, Mar. 2000.
[28] S. Stidham, “A last word on L =  W ,” Oper. Res., vol. 22, pp.
417–421, 1974.
Sungchan Kim received the B.S. degree in material
science and engineering, the M.S. degree in computer
engineering, and the Ph.D. degree in electrical engi-
neering and computer science from Seoul National
University, Seoul, Korea, in 1998, 2000, and 2005,
respectively.
He is presently with Samsung Electronics, Yongin,
Gyeonggi, Korea. His research interests include hard-
ware/software codesign, analysis of performance and
power consumption, and architecture optimization of
SoC for multimedia applications.
Soonhoi Ha (M’94) received the B.S. and M.S. de-
grees in electronics engineering from Seoul National
University, Seoul, Korea, in 1985 and 1987, respec-
tively, and the Ph.D. degree in electrical engineering
and computer science from the University of Cali-
fornia, Berkeley, in 1992.
He was with Hyundai Electronics Industries Cor-
poration from 1993 to 1994 before he joined the fac-
ulty of the School of Electrical Engineering and Com-
puter Science, Seoul National University, where he is
currently a Professor. His primary research interests
are various aspects of embedded system design including hardware/software
codesign and design methodologies.
