System Synthesis of Synchronous Multimedia Applications by Qu, Gang et al.
System Synthesis of Synchronous Multimedia Applications
Gang Qu, Malena Mesarina, and Miodrag Potkonjak
Computer Science Department, University of California, Los Angeles, CA 90095
fgangqu, malena, miodragg@cs.ucla.edu
Abstract
Modern system design is being increasingly driven by
applications such as multimedia and wireless sensing and
communications, which all have intrinsic quality of service
(QoS) requirements, such as throughput, error-rate, and res-
olution. One of the most crucial QoS guarantees that the
system has to provide is the timing constraints among the
interacting media (synchronization) and within each me-
dia (latency). We have developed the first framework for
systems design with timing QoS guarantees, latency and
synchronization. In particular, we address how to design
system-on-chip with minimal silicon area to meet timing
constraints. We propose the two-phase design methodology.
In the first phase, we select an architecture which facilitates
the needs of synchronous low latency applications well. In
the second phase, for a given processor configuration, we
use our new scheduler in such a way that storage require-
ments are minimized. We have develop scheduling algo-
rithms that solve the problem optimally for a-priori speci-
fied applications. The algorithms have been implemented
and their effectiveness demonstrated on a set of simulated
MPEG streams from popular movies.
1 Introduction
Multimedia applications have intrinsic requirements on
deadlines to process the incoming data (latency), and coher-
ent playout of different types of data (e.g. synchronization
among text, image, audio, and video or multiple video/audio
streams). The timing relationship among the interacting
media (synchronization) and within each media (latency)1
is one of the most important metrics for the quality of ser-
vice (QoS) provided by the system that supports such appli-
cations, and must be satisfied at the presentation2 me. For
example, the lip-sync of audio and video usually requires
25 or 30 synchronization points per second. Cen et al.[1]
1These are sometimes referred asintra-media synchronizationand
inter-media synchronizationrespectively.
2By presentation, we mean the delivery of various media to the user.
For example, the display of text and images or the dynamic played-out of
audio, video and animations.
provide the lip synchronization in a MPEG player by si-
multaneously displaying audio and video frames with the
same sequence number. Qiao and Nahrstedt[7] design a
fine-grain lip-sync algorithm that first estimates the audio
playback and the video decoding times and then adopts a
selective dropping policy for each type of I,P, or B frames.
Synchronization has been discussed in both of the recently
proposed multimedia standards[3, 6].
Systems design traditionally focuses on the optimization
of objectives such as power, cost, area, performance. As
embedded CPU cores become increasingly popular in VLSI
systems and multiple embedded cores have been integrated
on a single silicon, system designers have to implement sys-
tems using real-time design techniques to meet the design
constraints. Memory hierarchies, in particular caches and
on-chip memory, play a very important role in achieving
high performance in modern RISC embedded cores. One
may use a fast CPU and large caches to improve the per-
formance, however, this requires a large silicon area and
restricts the on-chip memory on a fixed silicon area. Re-
search in the context of real-time scheduling suggests that
a proper scheduler with certain knowledge of the upcom-
ing applications requires less storage[2, 5]. How to pro-
vide QoS guarantees has not received the attention that it
deserves in the system design society. In this paper, we ad-
dress the problem of systems design with these traditional
optimization targets and QoS guarantees, in particular, how
to take into account synchronization and latency require-
ments during the system-on-chip (SoC) design.
We propose a two-phase design methodology: (i) selec-
tion of hardware configuration and (ii) storage minimization
via tasks scheduling. Different processor cores, combined
with different sizes of I-cache and D-cache, have different
performance. In the phase of hardware configuration selec-
tion, we exclude a combination if it does not produce better
performance but occupies more area than another one. For
each of the remaining hardware configurations, we deter-
mine the minimal storage requirement to satisfy the QoS
guarantees by finding the optimal scheduling. Then the sys-
tems are evaluated and the one with the best performance
is chosen based on the optimization targets. We develop
an off-line pseduo-polynomial scheduling policy, which is
provably optimal in minimizing the storage under the tim-
ing constraints.
A Motivational Example
We use a small example to illustrate the importance of
scheduling discipline. Suppose there are two applications,
A andB, to be processed on a single processor. Each ap-
plication consists of a sequence of tasks that request cer-
tain amount of memory storage, CPU time and latency con-
straints as shown in Table 1.
Arrival Time 0 1 2 3 4 5
Storage A 10 2 30 10 1 11
Requirement B 1 20 3 30 10 8
Latency A 3 3 3 7 7 7
Constraint B 4 4 4 5 5 5
Table 1. Latency ( in units of CPU time) and stor-
age requirement ( in units of memory) for the task
sequences of two applications. The i-th task
of an application arrives at time i and has a
deadline equals to the sum of arrival time and
latency.
For simplicity, we assume each task takes exactly 1 unit
CPU time for execution. The processor can start executing a
task on its arrival and free the memory occupied by this task
as soon as it is finished. Tasks in the same application fol-
low the first come first serve (FCFS) strategy. LettAi ; tBi
be the finishing time for thei-th tasks ofA andB. We say
A andB arek-synchronizedif jtAi   tBi j  k for all i. We
want to schedule the tasks such that no deadline is missed,
a pre-defined level of synchronization is achieved and the
memory requirement is minimized.
In order to solve the problem, we first construct a storage
requirement table (Figure 1), where the entry (i; j) indicates
the total storage requirements at the end of timei+j wheni
CPU units are assigned toB andj CPU units toA. An entry
marked by “X” indicates a situation that at least one of the
deadlines is missed. For example, by time 4, from Table
1, we know that tasksA0;A1 andB0 have to be finished,
therefore, any scheduler that reaches entries (0,4), (3,1), or
(4,0) will fail to satisfy all the latency constraints.
A scheduleris a path from the upper left corner (0,0) to
the lower right corner. At any entry, the schedule moves
either one step to the right or one step down, and assigns
the next CPU time to eitherA or B respectively. The ear-
liest deadline first (EDF) policy[5] always selects the task
with the least deadline. InEDF1, a tie is broken to mini-
mize the number of context switches, inEDF2, whenever
there is a tie, we choose the one that occupies more mem-
ory. In this example, bothEDF1 andEDF2 serve the two
applications with a minimal storage requirement of93 and
achieve 3-synchronized as shown by the “solid arrow path”






















































Figure 1. Four possible schedulers ( !: EDF1,
   >: EDF2, bold italic font: 3-sync, : 2-sync.).
Our off-line optimal algorithm uses dynamic program-
ming to find the minimal storage requirement at any instant
time and then finds one scheduler. In this case, an optimal
scheduler is the path consisting of entries in theBold Italic
font using only74memory units and achieves the same syn-
chronization. 2-synchronized is also possible as represented
by the circled entries. A comparison of the above 4 sched-
ulers is given in Table 2. One can easily see that schedul-
ing policies can affect the QoS and better synchronization
can be achieved at the expenses of extra storage and context
switches.
EDF1 EDF2 3-sync 2-sync
storage 93 93 74 84
synchronization 3 3 3 2
# of context switches 4 6 2 5
Table 2. Comparison of the 4 schedulers.
2 Background and Problem Formulation
2.1 Architecture and Hardware Model
Figure 2 shows a typical application specific system-on-
chip which consists of microprocessor core(s), instruction
cache, data cache, hardware accelerators, control blocks,
on-chip memory, etc. Several factors combine to influence
the system performance: processor performance, I-cache
and D-cache miss rates and miss penalty, and clock speed.
In particular, the system performance is computed using the
following formula for cycles per instruction (CPI): CPI=
f
MIPS+(Miss RateI-Cache+Miss RateD-Cache)Miss Penalty
, wheref is the system clock frequency, and MIPS is mil-
lion instructions per second.
Caches typically found in current embedded multimedia
systems range from 4KB to 32KB. Although larger caches
corresponds to higher hit rates, they occupy a larger silicon
area. Since higher cache associativity results in significantly

















Figure 2. A typical core-based application-
specific system-on-chip.
We experimented 2-way set associative caches, but they did
not dominate in any single case. Cache line size was a vari-
able in our experimentation. Its variation corresponded to
the following trade-off: larger line size results in less hard-
ware and area together with higher cache miss penalty. We
use CACTI [9] as a cache delay estimation tool with respect
to the main cache parameters: size, associativity, and line
size. A sample of the cache model data is given in Table 3.
Cache Cache Line Size
Size 8B 16B 64B 128B 512B
4KB 7.6656 7.0208 6.4892 6.5615 -
8KB 8.3447 7.8065 6.9916 6.9057 9.3963
16KB 9.3041 8.5829 7.6205 7.5426 9.7992
32KB 10.4131 9.4499 8.591 8.6902 10.0449
Table 3. Minimal cycle time (ns) for direct-
mapped caches with variable line sizes.
Data on microprocessor cores have been extracted from
manufacturer’s datasheets and the CPU Center Info web [8].
A sample of the collected data is presented in Table 4. The
table presents embedded microprocessor core operating fre-
quency, MIPS performance, technology and area. Given a
fixed choice of processor core and caches, we can calculate
the execution time for a given task. Long execution time
implies a large memory to store the tasks that have arrived
but not yet been executed.
Processor Core Clock MIPS Technology Area
(MHz) (m) (mm2)
ARM7 LPower 27 24 0.6 3.8
LSI TR4101 81 30 0.35 2
LSI CW4001 60 53 0.5 3.5
LSI CW4011 80 120 0.5 7
Motorola 68000 33 16 0.5 4.4
PowerPC403 33 41 0.5 7.5
DSP Group, Oak 80 80 0.6 8.4
NEC, R4100 40 40 0.35 5.4
Toshiba, R3900 50 50 0.6 15
StrongARM 233 266 0.35 4.3
Table 4. The performance and area data for
sample processor cores.
2.2 Application and Quality of Service Model
We assume that we receive applications from a reliable
end-to-end connection. Each application consists of a set of
tasks, each task has its arrival time, latency, execution time
(for a given hardware configuration), storage requirement
and synchronization specification with the tasks in other ap-
plications. Formally, thej-th taskAij of thei-th application
Ai has the following parameters:
 tij : the arrival time
 ij : the execution time with a given hardware config-
uration
 lij : the latest time to finishAij after its arrival
 mij : the memory requirement
 (nkij ; s
k
ij): the synchronization ofAij and the task in
the k-th application, i.e., the finish time ofAij and
Aknk
ij
cannot differ by more thanskij unit time
3.
On the service side, we assume that tasks within the same
application are processed in the first come first serve (FCFS)
fashion, and there is a charge for the context switch among
different applications. The memory occupied by a task can
be freed as soon as this task have been executed. The execu-
tion time for a task depends on the hardware configuration,
for example, a fast processor core and large cache with low
miss rate provide short execution time.
2.3 Problem Formulation and Key Results
We formulate the problem as follows:
Given a set of applications with their computation, stor-
age, latency and synchronization requirements, determine
a system-on-chip (i.e., the type of processor core, sizes of
I-cache, D-cache and on-chip memory) with the minimal
silicon area such that all the application requirements are
satisfied.
We developed a dynamic programming-based algorithm
that finds the minimal on-chip storage requirement and a
feasible scheduler to service the applications within their
timing constraints (latency and synchronization) in pseudo-
polynomial time. The algorithm assumes a priori knowl-
edge of the data streams and tasks within the same appli-
cation are scheduled following the FCFS policy. However,
every task can have its individual latency and synchroniza-
tion requests; we do not assume that computation load is
proportional to data size; finally the algorithm is also appli-
cable when a context switch penalty is explicitly specified.
We define a dominance relationship among the possible
SOC configurations and select the one that requires mini-
mal silicon area from all the non-dominated configurations.
This methodology is valuable in making early design de-
cisions in silicon area allocation among processor, cache,
memory and others.
3Alternatively, we may use tables or matrices to specify the synchro-
nization among tasks from different applications.
3 Global Synthesis Flow for QoS Guarantees
In this section, we describe the global flow of the pro-
posed synthesis system and explain the function of each








Pool of processor configurations(cores, I-cache, D-cache, etc.)
Figure 3. Global flow of the synthesis ap-
proach.
The goal is to choose the configuration of processor, I-
cache, D-cache and determine a task schedule with minimal
storage once the hardware configuration is fixed. To accu-
rately predict the system’s performance for target applica-
tions, we employ the approach which integrates the opti-
mization, simulation, modeling, and profiling tools. The
synthesis technique considers each non-dominated micro-
processor core and competitive cache configuration, and se-
lects the hardware setup which requires minimal silicon area
and meets all the QoS requirements of the applications.
Figure 3 depicts the global flow of the proposed synthe-
sis approach. Starting from a pool of processor cores, I-
cache, and D-cache configurations, we identify all the non-
dominated hardware configurations based on the character-
istics of the given applications. Then for each such system
setup, coupled with the detailed information of the applica-
tions, we determine the minimal storage requirement and a
task schedule to fulfil the QoS demand. Finally we conduct
the system performance estimation, and select the one that
optimizes our design goal.
4 Synthesis Techniques
4.1 Resource Allocation
The objective in this phase is to find an area-efficient sys-
tem configuration since area is our primary optimization tar-
get.
We conduct an exhaustive search for all the processor
cores, I-cache (range from 512B to 32KB), D-cache (range
from 4KB to 32KB) and cache line sizes (from 8B to 512B).
For each combination, we estimate the system performance
and area. One processor typedominatesanother if it uses
less area and results in the same or better system perfor-
mance. The non-dominated system configurations are kept
and task scheduling will be performed on these configura-
tions to identify the most area efficient design.
For each competitive hardware configuration, since the
silicon area for storage is proportional to the size of the on-
chip memory, our goal is to find the minimal amount of stor-
age that meets the latency and synchronization constraints
for a given set of applications. Once the storage require-
ment is determined, we can do the system performance es-
timation and in particular calculate the total silicon area.
Finally, a task scheduler is required to schedule the tasks
such that neither deadline miss or storage overflow occurs.
We argue that this cannot be done unless the hardware con-
figuration is fixed, because the execution time for a task
varies with different hardware configurations.
4.2 The Basic Storage Minimization Algorithm
We describe our area minimization algorithm for the sim-
plest case in Figure 4, where we have only two applica-
tionsA1;A2. Each application has a task arriving at the
end of each time unit, requiring 1 unit execution time for
the given hardware configuration andmij unit of storage,
further there is no latency and synchronization constraints
(i.e., tij = j; ij = 1; lij = 1; andskij = 1 for all j  0
andi = 1; 2.).
Input: mij , the storage requirements fori = 1; 2; 0  j < T .
Output: M , the minimal memory required and
a task schedule withM units of memory.
Procedure:
1. Build the instant memory requirement tableIMR:
IMRij = IMRi 1;j +m1;i+j +m2;i+j  m2;i 1
= IMRi;j 1 +m1;i+j +m2;i+j  m1;j 1 (*)
2. Build the aggregate memory requirement tableAMR:
AMRij = maxfIMRij ;minfAMRi 1;j ; AMRi;j 1gg (**)
3. Find a path in tableAMR from entry (0,0) to (T,T) without
crossing any entry that has number larger thanAMRTT .
3.1 start from entry (T,T) in tableAMR
3.2 while ( not reach entry (0,0) )
f mark the current entry
move up or to the left whichever has entry AMRTT
g
4. Report this path andM = AMRTT .
Figure 4. The dynamic programming-based
algorithm for finding the minimal memory re-
quirement and a feasible schedule.
Equation (*) computes the memory requirementAT time
instantk = i+ j wheni slots have been assigned to appli-
cationA2 andA1 getsj. This is the amount of total size







m2l according to the FCFS
dicipline. Notice that the instant memory requirement is
path-independent. That is,i slots andj slots have been as-
signed to applicationsA2 andA1 respectively, but it does
not matter to whom each specific slot has been assigned.
Equation (**) finds the minimal memory requirement
UPTO time instantk = i + j. It has to be large enough
to store the unfinished tasks ( IMRij), and guarantees a
feasible path to entry(i; j) from either left ( AMRi 1;j)
or above ( AMRi;j 1).
In step 3, any marked entry has either its left entry or the
entry above or both with value AMRTT , which is the
minimal storage requirement. This is guaranteed by equa-
tion (**). Once theAMR table is built4, the minimal mem-
ory requirement is given asAMRTT and a feasible sched-
uler (a path from (0,0) to (T,T) in theAMR table) can be
found backwards in time2T as in step 3. The complexity
of this algorithm isO(T 2) in both time and space.
4.3 Modifications for QoS Guarantees
In this section, we briefly discuss how to modify the
above algorithm to meet the QoS guarantees (e.g. latency,
synchronization) for general applications (e.g. individual
arrival time, latency, execution time) when there is a charge
for context switching.
latency: As we have seen in Figure 1, adding (individual)
latency constraint simply decreases the amount of com-
putation for building theIMR andAMR tables. For
example, if the first task of applicationA1 has to be
finished by 4, there is no need to compute entries
(i; 0) for all i  4.
synchronization: Like latency, synchronization constraints
reduce the number of entries to be filled in bothIMR
andAMR tables. For instance, if we want a solution
that is 1-synchronized, it will be sufficient to fill only
the entries(i; i); (i  1; i); (i+ 1; i).
execution time: Recall that in theIMR table, entry(i; j)
is the memory requirement to store the tasks that have
arrived but have not been finished yet. So when tasks
require different execution time, we only free the stor-
age for the tasks inA1 that can be finished inj unit
time and those inA2 that can be finished ini unit
time.
arrival time: This case is similar to the case when tasks
have individual execution time.
N applications: If there are N applications instead of only
2, we have to build a N-dimensional table to find the
optimal solution. This of course increase the com-
plexity of the algorithm.
context switch: If there is a charge for the context switch-
ing when we make a turn on the path from the upper
4One can easily combine (*) and (**) to build theAMR table directly.
Here we use an intermediate tableIMR to explain our approach. How-
ever, in both cases, the space and time complexity isO(T 2).
left corner to the lower right corner. A path (sched-
uler) with minimal number of turns (context switches)
can be found similarly by dynamic programming.
5 Experimental Results
We test the proposed algorithms on MPEG video streams.
Standard MPEG encoders generate three types of compressed
frames: I frames (intra-pictures), P frames (predicted pic-
tures) and B frame (bi-directional predicted pictures). On
average, I frames are the largest in size (since they are self-
contained), followed by P frames and B frames. Krunz
and Tripathi [4] present a comprehensive model for MPEG
video streams. In particular, the frame sizes of different
types of frames are simulated by three different sub-models
which are intermixed according to the group-of-picturespat-
tern. Statistically, the generated MPEG streams fit the em-
pirical video and are sufficiently accurate in predicting the
queueing performance for real video streams. We simulate
four video streams using the parameters provided in [4] and
the information of the generated frames is reported in Table
5. (The frame size of I-frames has a relatively large standard
deviation because it is modelled as the sum of two random
components).
Movie Number of I-frame size P-frame size B-frame size
Frames I I P P B B
Wizard of Oz 41,700 15.18 13.61 4.82 0.64 3.91 0.27
Star Wars 174,960 8.68 5.51 3.93 0.58 2.81 0.52
Silence of the Lambs 39,972 6.53 2.86 2.59 0.86 1.98 0.70
Goldfinger 40,104 9.77 6.60 4.57 0.51 3.26 0.38
Table 5. Simulation of the MPEG streams ( :












1 2 3 4
Movies: 1(Wizard of Oz), 2(Star Wars), 
3(Silence of the Lambs), 4(Goldfinger).
No-Sync 8-Sync 4-Sync 2-Sync
Figure 5. Normalized storage requirement
vs. synchronization. In movies 3 and 4, 2-
synchronized cannot be achieved.
The algorithm in Figure 4 finds the minimal storage re-
quirement for a set of a priori applications. For each of
the above MPEG video movies, we find the storage require-






























I:    I(4KB,8B),  D(4KB,8B)
II:    I(4KB,16B),D(4KB,8B)
III:   I(4KB,16B),D(4KB,16B)
IV:  I(4KB,16B),D(4KB,64B)
V:   I(4KB,64B),D(4KB,64B)
Figure 6. System performance (cycles per instruction) vs. the silicon area (mm 2) for different hardware
configurations.
uler. In EDF, a tie is broken randomly5. Our off-line algo-
rithm has been applied four times with no synchronization,
2-sync, 4-sync, and 8-sync. The off-line optimal storage re-
quirements are normalized with respect to that for the EDF
policy as shown in Figure 5. The key feature of these so-
lutions is that they have synchronization guarantees and the
trend is clear: better synchronization needs more storage.
In all cases, the off-line EDF policy, which achieves 12
15-synchronized, requires more storage.
Different processor cores use different amount of silicon
area and deliever different performance (see Table 4). We
investigate each processor core with I-cache, D-cache (size
varies from 4KB to 32KB) and cache line (size from 8B to
512B) setups. For the sample MPEG frames, we conclude
that the LSI TR4101 core with a 4KB I-cache and 4KB D-
cache have the best performance in terms of silicon area.
Details are reported in Figure 6.
6 Conclusion
In this paper, we address the problem of how to design
system-on-chip with minimal silicon area that meets the
QoS requirements for real-time applications. We select the
timing constraints (synchronization and latency) as the mea-
sure for QoS and propose an algorithm to determine the
minimal storage and feasible schedule for a given hardware
configuration to provide QoS guarantees for given applica-
tions. We propose a two-phase design methodology of hard-
ware configuration selection and storage minimization. For
5Actually, we experiment two EDF policies where the task with the
largest memory and the one with least execution time wins the tie respec-
tively. However, they provide solutions of negligible difference in terms of
storage requirement.
a fixed hardware configuration, our storage minimization al-
gorithm provides the optimal solution to meet all the QoS
requirements. We show that better synchronization can be
achieved at the cost of more storage. Experiments on sim-
ulated MPEG movies demonstrate that our scheduler saves
storage over off-line EDF policies and provides synchro-
nization guarantees.
References
[1] S. Cen, C. Pu, R. Staehli, C. Cowan, and J. Walpole.A Distributed
Real-Time MPEG Video Audio Player.Proceedings of the fifth Inter-
national Workshop on Network and Operating System Support for
Digital Audio and Video, pp. 151-162, 1995.
[2] H-J. Chen and T.D.C. Little.Storage allocation policies for time-
dependent multimedia data.IEEE Transactions on Knowledge and
Data Engineering, Vol.8, No.5, pp. 855-864, 1996.
[3] International Organization for Standardization.I formation Technol-
ogy Coding of Multimedia and Hypermedia Information.ISO/IEC
13522-1. October 1994.
[4] M. Krunz and S.K. Tripathi.On the characterization of VBR MPEG
streams.ACM International Conference on surement and Modeling
of Computer Systems (SIGMETRICS 97), pp. 192-202, 1997.
[5] C.L. Liu and J.W. Layland.Scheduling algorithms for multiprogram-
ming in a hard-real-time environment.Journal of ACM. Vol.20, No.1,
pp. 47-61, 1973.
[6] B.D. Markey.HyTime and MHEG.Digest of Papers, Thirty-Seventh
IEEE Computer Society International Conference, pp. 25-40. 1992.
[7] L. Qiao and K. Nahrstedt.Lip synchronization within an adaptive
VOD system.Proceedings of the International Society for Optical En-
gineering, pp.170-181, 1997.
[8] http://infopad.eecs.berkeley.edu/CIC/.
[9] S.J.E Wilton and N.P Jouppi. CACTI: an enhanced cache access and
cycle time model.IEEE Journal of Solid-State Circuits, Vol.31, No.5,
pp. 677-688, 1996.
