Application-specific workload shaping in resource-constrained media players by RAMAN BALAJI
APPLICATION-SPECIFIC WORKLOAD SHAPING IN
RESOURCE-CONSTRAINED MEDIA PLAYERS
BALAJI RAMAN
Master of Science, NUS
A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILIOSOPHY
DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF COMPUTING
NATIONAL UNIVERSITY OF SINGAPORE
January 2009
Acknowledgments
Samarjit Chakraborty, my graduate advisor and guru, accepted me as his
PhD student, proposed this thesis topic, involved substantially in my re-
search, writing, and presentation. Samarjit’s empathy towards students, his
tolerance for my annoying demands, and his patience with my tortoise pace
deserves a standing ovation from heaven. Samarjit taught me to acquire
excellence as a habit, and to reject mediocrity, especially in writing. His
countless advice on both technical, and non-technical matters resonates in
my everyday academic life.
Wei Tsang Ooi, my co-advisor and mentor, taught and trained me the
fundamental skills that a research student should possess. This thesis ben-
efited on Wei Tsang’s insistence on clarity in writing, correctness in results,
and simplicity in style. His emphasis on research ethics was such that those
rules are hammered into my head. Wei Tsang spent innumerable amount of
hours in meetings, and reviewing my writing. This countably infinite hours
does not include the hours he spent on devising small courses on writing,
reading and presentation, and pondering on my research topics on his own.
Not being tired of these labors, being an excellent listener, he offered great
career advice that suited me.
Tulika Mitra, my master thesis advisor, paved the way for my doctoral
studies. I enjoyed our weekly meetings, when I learned why and how to put
an effort to think and concentrate on a research problem. I practice the
discipline and the integrity that Tulika taught, conveying through her own
actions. Apart from all these advices, I benefited greatly on Tulika’s teaching
on diligence in writing, especially, when presenting related work.
I had a good fortune when Paolo Ienne gave me an opportunity to do
internship at EPFL. The intense intellectual discussion on my thesis helped
me to a great extent to write my thesis after my internship. Paolo, presented
my thesis work in an important international forum, and explained its impact
to the relevant audience. His advice on my career had a significant, positive
impact in my application process to postdoctoral jobs.
I thank the numerous reviewers of my publications, who pointed out sev-
i
ii Acknowledgments
eral improvements, and gave concrete suggestions. In particular, I thank my
thesis committee members, Weng Fai Wong, Wang Ye, and Andy Pimentel.
Many people gave generously of their time, and helped me with the ad-
ministration. I thank Loo Line Fong for responding me promptly at critical
times. I thank her as well for administrative support during my student years
at NUS. I thank Chan Tim Fook, Embedded Systems laboratory in-charge,
who provided me with all the computational resources I needed. I thank the
following friends who helped me to communicate with staff at NUS, when I
came to France: Ankit Goel, Ashwin Nanjappa, and Deepak Gangadharan.
Chantal Schneeberger, administrative staff at EPFL, went beyond her means
to help during my internship in Lausanne, Switzerland.
My friends provided the needed rest and relaxation in the forms of plays
and movies. I thank Chanakya, Subramanian, and Sudharsanan for counsel-
ing me at difficult times, for loaning money when needed, and for providing
company when the deadlines required to work past midnight. I cherished the
company of Ramkumar, Senthilnathan, Unmesh, Chandra, Vijaykumar, Pan
Yu, Linh, Kathy, Yanhong, Satish, Cheng Wei, Ma Lin, and other friends.
I am profoundly grateful to my parents, who tolerated when I was busy
for trips to India, who stayed with me in Singapore for many months, who
responded with useful advice and counseling every week, and who energized
me during my vacations in India. As though that were not enough, my father
tolerated with me when I discussed all the technical details of my research
work, and my mother sounded persuaded when I reasoned why I am a student
for so many years.
I am indebted to my sister Sudha Raman, whose confidence and success
are infectious, and her encouragement provided me with the essential moral
support needed for my stay in Singapore. She provided me with partial
financial support for attending conferences, and when my stipend arrived
late. Sudha showed lots of patience whenever I stressed out over studies,
and vented at home. Sudha, from childhood, led me in my personal and
academic life. While I will chose a different venue to completely state her
positive influence on me, in brevity:
I dedicate this thesis to my sister Sudha Raman.
I thank you all and God.
Table of Contents
1 Introduction 5
1.1 What is Workload Shaping? . . . . . . . . . . . . . . . . . . . 6
1.2 Shaping Techniques . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Background 19
2.1 Analytical Model: A Bird’s Eye Review . . . . . . . . . . . . . 20
2.2 Tuning Scheduler Parameters . . . . . . . . . . . . . . . . . . 25
2.2.1 Methodologies . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Our System Model . . . . . . . . . . . . . . . . . . . . 34
3 Buffering for Smoothing 39
3.1 Buffering Vs Workload . . . . . . . . . . . . . . . . . . . . . . 40
3.1.1 Basic Intuition . . . . . . . . . . . . . . . . . . . . . . 40
3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Frequency Estimation . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Delay Redistribution . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Relation to Previous Work . . . . . . . . . . . . . . . . 47
3.3.3 Illustrative Example . . . . . . . . . . . . . . . . . . . 48
3.3.4 Problem statement . . . . . . . . . . . . . . . . . . . . 51
3.3.5 Playout Delay Redistribution . . . . . . . . . . . . . . 52
3.3.6 Buffer Size Estimation . . . . . . . . . . . . . . . . . . 56
3.3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
iii
iv TABLE OF CONTENTS
4 Buffering for Multiple Applications 65
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 Reference works . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . 74
4.3 Dynamic Buffering . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.1 Schedulability Analysis . . . . . . . . . . . . . . . . . . 77
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5 Buffering with Stochastic Guarantees 89
5.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Minimizing Buffering . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.1 Buffer Underflow . . . . . . . . . . . . . . . . . . . . . 96
5.5 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . . 103
5.5.1 Minimum playout delay . . . . . . . . . . . . . . . . . 103
5.5.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . 106
6 Future Work and Conclusions 109
6.1 Modeling Processor Waiting Time . . . . . . . . . . . . . . . . 110
6.2 General Stochastic Framework . . . . . . . . . . . . . . . . . . 118
6.2.1 A motivating example . . . . . . . . . . . . . . . . . . 119
6.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
List of Figures
1.1 Shaping Techniques for Multimedia players. . . . . . . . . . . 7
2.1 Dimensions of SoC Design. . . . . . . . . . . . . . . . . . . . . 21
2.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Our system model and technique. FIFO buffers connect PEs in
pipeline. An application is partitioned and mapped onto the dif-
ferent PEs that run tasks concurrently. Buffer size reduces on
redistributing playout delay. . . . . . . . . . . . . . . . . . . . . 46
3.2 Buffer fill levels with initial playout delay: (a) very small, (b) large,
and (c) redistributed. . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Initial playout delay values as minimum required processor fre-
quency drops and stabilizes. . . . . . . . . . . . . . . . . . . . . 57
3.5 Change in buffer fill levels with redistributing playout delay. . . . 61
3.6 Playout delay estimation w.r.t processing requirement of tasks (VLD
and IQ) running in PE 1. . . . . . . . . . . . . . . . . . . . . . 63
4.1 Setup for dynamic workload shaping. . . . . . . . . . . . . . . 68
4.2 Dynamically controlling the playout buffer fill level as two ap-
plications are being scheduled. . . . . . . . . . . . . . . . . . . 71
4.3 Buffering time versus workload for a low bit rate and low res-
olution video stream. . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 A schedulable system. . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Schedulable regions for different flow. . . . . . . . . . . . . . . 81
4.6 A non-schedulable system. . . . . . . . . . . . . . . . . . . . . 82
v
vi LIST OF FIGURES
4.7 Schedulable regions of a periodic task (p = 600 ms, e = 80×
106 cycles). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8 Schedulable region for a setup consisting of a periodic task
along with an MPEG-2 decoder decoding a low bit rate and
low resolution video stream. . . . . . . . . . . . . . . . . . . . 84
5.1 Processing requirement reduces with large initial delay. The
production rate is high when playout starts after small delay. . 91
5.2 Delay value reduces on relaxing buffer constraints. The out-
put stream at times cannot catch-up with consumption and
playout buffer underflows. . . . . . . . . . . . . . . . . . . . . 92
5.3 Correlation among playout delay, buffer size, and buffer un-
derflow. Increase in playout delay (and buffer size) decreases
buffer underflow. . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 Playout buffer underflow over time. The variability in under-
flow substantially reduces with large increase in playout delay. 96
5.5 Meeting desired stochastic constraints. The probability that
the playout buffer underflows is no more than the stochastic
bounding function. . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6 The cumulative distribution of processor frequency. Processor
cycles/second allocated to the video decoding task and there-
fore the playout buffer underflow are probabilistic. . . . . . . . 105
5.7 Accuracy of analytical model. Minimum playout delay esti-
mated using mathematical model is close to the delay values
obtained from simulation. . . . . . . . . . . . . . . . . . . . . 107
6.1 Multimedia SoC model. . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Case a: Buffer underflow due to processor latency, Case b:
Play-out constraint met with increase in processor share for
decoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3 Model of communication . . . . . . . . . . . . . . . . . . . . . 114
6.4 System architectures and models used for analysis in previous
works. Memory latency modeled for architectures with off-
chip memory, shared memory, and FIFO (right to left). . . . . 115
Abstract
Much research in system-level design for multimedia devices is based on anal-
ysis with system models, but how insightful are they? System simulation is
the prime technique used in computer architecture and embedded system
design to explore potential design solutions and validate design choices. Un-
fortunately, simulation seldomly gives real insight and strong guarantees on
the dynamic behavior of a system. On the other hand, existing analytical
models could not capture some important attributes of multimedia systems.
Consequently, the analysis with such mathematical models is not beneficial
for efficient system design. A useful analysis with either simulation or an-
alytical models should provide resource saving techniques. These methods
can exploit the key characteristic features of the multimedia streams. The
fluid nature in arrivals and inconstant processing requirements of data items
are multimedia’s inherent characteristic features. But, these characteristic
features are predictable. So, the foreseeable properties could be studied to
yield techniques that can significantly save on-chip resources.
This thesis proposes techniques to shape multimedia workload so as to
effectively utilize on-chip resources such as processor and memory. These
shaping techniques attempt to solve the problem in providing guarantees
for high-quality media output with minimal on-chip resources. The research
approach is to use analytical models and accurately capture the variable
characteristics in arrival and execution of items in multimedia streams. Such
mathematical models after analysis yield deep-insights to tune certain ap-
plication parameters. Using this parameter tuning, it is possible to reshape
variable media workloads to reduce processing and storage requirements. The
central tenet of this parametric tuning is to adapt the workload such that
vii
Abstract 1
only average or minimum processor cycle time required for every multimedia
data item is provided, and not the maximum.
Our results show that choosing the appropriate initial playout delay (af-
ter which the video starts) can lead to effective processor utilization. This
delay parameter is typically arbitrarily chosen. Instead, we propose to esti-
mate the value of the parameter such that it is sufficient to provide average
cycle time required for every data item. This delay, however, could be large
and can lead to huge buffer sizes. Hence we propose two-ways to reduce
the buffer sizes: (1) in a multi-processor set-up this delay parameter could
be redistributed to different processors i.e., apart from the output device,
the processors also start after some delay; and (2) allowing tolerable loss in
quality. Both these methods show substantial reduction in buffer size. The
model we have estimates the delay parameter in all of the above mentioned
techniques.
Our mathematical framework fits well to deal with media streams in that
it could express variability effortlessly and quickly explore cost-quality trade-
offs. These essential attributes of our model substantially brought out the
benefits in workload shaping. An important advantage of the workload fitting
techniques is from the stochastic models; relaxing constraints that guarantee
full output quality yielded significant reductions in processing and memory
requirements.
2 Abstract
List Of Publications and Talks
Published
• Balaji Raman and Samarjit Chakraborty. Application-specific work-
load shaping in multimedia-enabled personal mobile devices. ACM
Transactions on Embedded Computing Systems, 7(2) : 10, Feburary
2008.
• Balaji Raman, Samarjit Chakraborty, Wei Tsang Ooi, and Santanu
Dutta. Reducing data-memory footprint of multimedia applications by
delay redistribution. In Proceedings of the ACM/IEEE annual confer-
ence on Design automation (DAC), pages 738− 743, June 2007.
• Balaji Raman and Samarjit Chakraborty. Application-specific work-
load shaping in multimedia-enabled personal mobile devices. In Pro-
ceedings of the international conference on Hardware/software codesign
and system synthesis (CODES+ISSS), pages 4−9, New York, October
2006 (nominated for best-paper award, among top-2 papers).
• Balaji Raman, Samarjit Chakraborty, and Wei Tsang Ooi. Meeting
CPU constraints by delaying playout of multimedia tasks. In Proceed-
ings of the international workshop on Network and operating systems
3
4 List of Publications and Talks
support for dig- ital audio and video (NOSSDAV), pages 165 − 170,
New York, June 2005.
Workshop Talks
• Analytical Models of Communications of MPSoCs, International Fo-
rum on Application-Specific Multi-Processor SoC (MPSOC), Aachen,
Germany, June 2008. (an overview of my research was presented by
Dr. Paolo Ienne, among top-5, most-relevant talks.)
• Analytical Models of Communications for SoC Multimedia Design,




The usage of mobile devices is pervasive, and hearing music and watching
videos with these media players have become commonplace. Although VLSI
technology is advancing at an incredible rate, the processing and storage
requirements of multimedia applications are still a dominant factor in the
cost price of a portable media device.
Naturally, system designers want to reduce processor capacity and mem-
ory size, and this is achieved, typically, with slight degradation in output
quality. On the other hand, researchers try to improve processor utilization
(with scheduling) and buffer management, while providing guarantees on the
desired output. This research often involves using analytical or simulation
models, and the accuracy of these models determines the benefits of pro-
posed ideas; indeed, not capturing the inherent characteristics of multimedia
applications can lead to losing valuable insights.
Instead, this thesis demonstrates that there is much room for improve-
ment in portable device design. Our results show significant reduction in
5
6 CHAPTER 1. INTRODUCTION
processing and memory requirements, with no loss in output quality. The
insights that led to these resource savings were primarily due to the modeling
of data sequence in multimedia streams, before and after processing. In ad-
dition, this report also proposes a model in which the constraints on quality
could be relaxed. This analytical framework enables an informed trade-off
between tolerable loss in output and device cost. Together, as explained
soon, we term our techniques as ’workload shaping’.
The following section in this chapter defines workload shaping, and in-
troduces three shaping techniques. Then follows a brief discussion on the
novelty of the proposed research. The secondary objective of this first chap-
ter is to establish the thesis goal and the research approach for the problem
stated. Finally, the contributions and the organization of the document are
presented.
1.1 What is Workload Shaping?
To define shaping, we must first understand the System-on-Chip (SoC) in a
media player. Thus, we begin with an overview of the components in a SoC
in portable players, and their main functions.
A SoC contains one or more processing elements, some buffer memory
and interfaces between memories and processors. Figure 1.1 shows this: the
input and playout buffer are memories, and the processing element is linked
to these buffers. Below, we look how these elements function while processing
a multimedia stream. The advantages in capturing the characteristics of a
multimedia stream will become clear.



















































Figure 1.1: Shaping Techniques for Multimedia players.
8 CHAPTER 1. INTRODUCTION
The input buffer, temporarily stores the data items from a multimedia
stream. The multimedia application being executed in the processor, fetches
data from input buffer, and stores output in the playout buffer. The output
device, displays items in the playout buffer at a constant rate. For example,
a video decoding application decompresses the input stream and the decoded
items are displayed at the required rate (say 30fps). The workload then can
be described as follows.
The load on the processor is to complete processing a certain number of
data items per unit time, and the work the processor does is in providing the
multimedia task with sufficient number of processor cycles per unit time such
that the given load could be handled. It is shown that different data items
take a varying number of processor cycles to completely execute. Therefore,
the load, and consequently the work varies over time. Note that the require-
ment that a certain number of data items has to be processed per unit time,
is constant, and the processor cycles required to complete executing the pre-
specified number of items is that which varies. To provide guarantees on
the output requirement, there is one naive method to handle the workload
variability, although inefficiently.
If the processor allocates a constant number of cycles per unit time to the
multimedia task, then the processor capacity required - to provide guarantee
- is higher than the cycle average of all data items; to always satisfy the
requirement on output, all data items have to be allocated with the worst-
case processor cycles required for processing an item. It is then ensured that
irrespective of which data item is being processed, it is always completed
within the desired time, thus guaranteeing display. Clearly, the processor is
1.2. SHAPING TECHNIQUES 9
ineffectively utilized; the variability in execution requirement is very high for
multimedia items, so few data items require maximum processor cycles, and
others close to average. If the variability, however, is a priori known, then
there is a possibility that the processor works for necessary and sufficient
time on the given load.
This thesis proposes techniques to utilize the variability in shaping the
workload such that it is sufficient to allocate the average cycle time required
for every multimedia data item, and not the maximum. The workload vari-
ability, similar to ineffective processor utilization, can also lead to huge mem-
ory requirements. The above discussed reason for ineffective utilization of the
processor cannot be exactly extended to requirements in memory; the worst-
case processor cycle requirement of a data item extending to all multimedia
items in the stream does not also lead to large space to store those data
items. The how of variability in workload having large data-memory foot-
print will be discussed in the following section. The techniques proposed,
the reader should note, target both processor utilization and buffer mem-
ory requirement. Following this, we briefly discuss these workload shaping
techniques.
1.2 Shaping Techniques
Below, we explain the shaping techniques, emphasizing the benefits that
shaping provides in terms of resource utilization. It will become clear that
the advantages of the proposed techniques are primarily based on the model’s
accuracy, that is in capturing the sequence of multimedia items in the stream.
10 CHAPTER 1. INTRODUCTION
The mathematical framework used to represent input and output multime-
dia streams is intrinsically good in modeling the inherent variability of mul-
timedia workloads. (The calculus that is used to construct these models is
described in detail in the subsequent chapter).
The three shaping techniques are: (1) smoothing, (2) squeezing, and (3)
slashing. The first of these techniques, smoothing, shows the importance in
tuning a key application parameter, namely the playout delay, that is, the
initial delay after which the video is displayed. Now, we describe why the
playout delay parameter has to be tuned and how it is done.
Smoothing: Our results show that with appropriate playout delay for a
stream, it is sufficient to provide the multimedia task with average cycles
required per unit time, rather than the maximum (Raman et al., 2005). The
number of data items to be processed per unit time is given and average cycles
required per unit time is known. Thus we compute the average processor
cycles required per unit time.
Clearly, delaying playout leads to saving processor resources, and it is
found that the gains are significant; there is a large difference in the maximum
cycles required for a data item to that of the average; the number of data
items requiring worst-case processor cycles in a stream are relatively lower
than the items that require number of average cycles. In other words, there
is a high variability in terms of the processor cycle requirement among data
items in the multimedia stream. This initial buffering of processed items
before playout, basically, has smoothened the work that the processor does
on the given load; the reserved processor cycles for multimedia tasks does
not vary over time.
1.2. SHAPING TECHNIQUES 11
Typically, the playout delay is arbitrarily chosen. Instead, this thesis
proposes an analytical framework, using which the delay can be precisely es-
timated. The delay computed corresponds to the scenario where there could
be maximum saving in terms of processing requirements, that is, it is suffi-
cient to just provide the media task with average cycles that it requires per
unit time. The inherent variability in the multimedia workload is captured
using the analytical model and that has led to precise computation of the
playout delay.
Buffering is a powerful technique for reducing processing requirements
for multimedia, but it is stymied by requiring large on-chip memory. Inter-
estingly, the reason that we require a large buffer is again due to workload
variability, and the variability in arrival of data items. The buffer size re-
quired is usually calculated as follows: the maximum fill-level of the buffer
over time is noted and that is buffer size. The arrival of data items to the
input buffer, and the writing of data items to the output buffer varies over
time, varying the fill-level of the input and playout buffer. Hence due to the
high variability in the multimedia workload and in the arrival of input data
items, we require a large buffer. In addition, if we have significant initial
playout delay, during which items are stored and the buffer is not emptied,
we indeed need a very large buffer; the fill-level of the buffer during initial
buffering may be the maximum.
To reduce the storage requirements this thesis proposes another technique
using the playout delay. This, too, is a smoothing technique in that all pro-
cessing elements including the one near to the playout buffer are considered;
the processor work for the given variable load is smoothened irrespective
12 CHAPTER 1. INTRODUCTION
of its position in the pipeline of processing elements and memories (for ex-
ample, in a multi-processor SoC). In the smoothing technique discussed for
single processor SoC, the output device starts after a certain delay, and the
processor starts without any delay.
Instead, if the processor itself starts after a certain delay, which is a small
fraction of the actual playout delay, then our results show that the total buffer
size required is reduced. This is explained as follows. The variability in the
buffer fill, as described earlier, is the reason for large buffer requirements. In
a pipeline of processing elements and memories, if the buffer fill variability
propagates from one buffer to the other, each of the buffer size increases, and
hence the total buffer requirement (sum of all buffers) is consequently large.
But if the processor starts after certain delay, this variability in buffer fill
stops propagating, reducing the memory requirements (Raman et al., 2007).
The delay after which the processor should start could be exactly computed
using the mathematical framework. In the case where there are multiple
processors, each of the processing elements runs a part of the multimedia
application. The delay associated with each processor then corresponds to
the variability of the task that the processor runs.
Squeezing: The squeezing technique proposes scheduling mechanisms to
effectively utilize processor bandwidth for multimedia tasks and other peri-
odic tasks concurrently executed on a processor. Since the processor cycles
allocated to the multimedia task over a time interval are adjusted such that
other tasks could fit in, we term this technique as squeezing. Consider a
situation in which the multimedia task running in the processor consumes
most of the processor bandwidth and could not run any other task. Thus an
1.2. SHAPING TECHNIQUES 13
incoming periodic task has to be shed because meeting the deadline of both
the periodic and the multimedia task is infeasible. Now, we explain how with
a slight increase in buffer space, the multimedia task and the periodic task
could concurrently run and still meet their deadlines.
With a slight increase in buffer space, the multimedia task can pre-decode
some data items before the periodic tasks starts executing. Note that this
would require slightly higher processing capacity then the processing re-
sources allocated normally (which corresponds to the average processor cycle
requirement per time unit). Once some extra data items are decoded, then
the execution of the periodic task is started. This is facilitated with reduc-
ing the processor share (less than normal) for the multimedia task. During
this time period, that is when the multimedia task is running at lower speed
than normal, the extra items that have been previously produced are being
consumed. After some pre-specified time, the periodic task is suspended and
the multimedia task is provided with a higher processor share. This cycle of
lowering and raising processing share of the multimedia task is repeated until
the execution of the periodic task is complete. The usage of our model in
this set-up enables the designer to decide apriori all scheduling parameters.
Apparently, modeling the variability in the workload has helped to es-
timate the processing requirements to decode the extra stream objects. In
addition, the time required to fill the playout buffer in excess and the time
required to drain the buffer could also be estimated. During the buffer fill,
the periodic task has not started or the task is in suspension, and during the
buffer drain the periodic task is in execution. Hence within a buffer fill and
drain is the period and the deadline of the periodic task. The deadline of
14 CHAPTER 1. INTRODUCTION
the multimedia task that is to display the multimedia stream at the required
rate is met and the deadline of the periodic task is also met. The analy-
sis using the mathematical framework thus enhances schedulability of these
concurrent tasks.
Slashing: Towards maximizing resource utilization, the slashing technique
takes a different approach altogether. The workload is reduced or cut in
this technique and hence we term this is as slashing. While smoothing and
squeezing proposed methods to provide guarantees on display quality, they
always required that the full output quality be met. But, if the constraints
on the output are relaxed, then there could be significant resource savings.
Also, studies have shown that multimedia applications can tolerate certain
loss in quality, and this deterioration in quality is not perceivable up to some
extent. These quality degradations have been previously utilized in saving
on-chip resources, albeit there were no guarantees on the design and SoCs
were built to handle only average-case scenarios. Instead, our technique pro-
poses a framework where loss in quality could be represented and guarantees
on throughput could be obtained. To illustrate this, along similar lines to
previous two techniques, we tuned the playout delay parameter with relaxed
constraints. Now we explain in detail what is the benefit in having loss in
quality with small delays.
Consider the playout delay estimated using our mathematical framework.
This delay is the minimum delay required such that there is no loss in qual-
ity and the processing requirement were minimal. The no loss in quality
corresponds to the case where the buffer never underflows. Now if we relax
this buffer underflow constraint, that is, the buffer can underflow at times, it
1.3. THESIS 15
corresponds to choosing a smaller delay than actually required. With smaller
delay, the output requirement is not met; the buffer underflows, meaning that
the consumption of items is at a faster rate than the production. The playout
delay, in slashing, however, could be smaller than required. This is because
the buffer can underflow to some extent and the loss in quality due to this is
acceptable. But then what is the benefit in lowering the delay? A legitimate
question. Recap that the initial playout delay is in fact the one that deter-
mines the buffer size, and hence any reduction in delay consequently leads
to smaller buffer. The amount of reduction in the playout delay from the
value required for no loss in quality can be precisely computed, again using
the models.
There have been several efforts using the same framework that we use to
estimate the buffer size and processor requirements (Liu et al., 2004; Wan-
deler et al., 2005), but there has been little effort to use the models to ef-
fectively utilize resources such as the processor and memory. Also, there are
techniques that have been used with other models as well, but, either they
do not provide any guarantees on output or these models do not accurately
capture the variable characteristics (Nandi and Marculescu, 2001). In the fol-
lowing section, the goal of this thesis, the problem tackled, and the research
approach used are discussed.
1.3 Thesis
Having introduced the motivation and title terminology in the previous sec-
tions, we are now ready to describe the thesis itself, both its content and
16 CHAPTER 1. INTRODUCTION
form.
In summary, the motivation of this thesis is that there are several oppor-
tunities for better design of portable media player; while providing desired
output quality, on-chip resources have to be minimal and therefore effec-
tively utilized; designing media players, given their multitude of constraints
and unique needs, can be better handled when using analytical frameworks;
the mathematical framework used if captures inherent characterstics of the
multimedia application, then it can lead to efficient designs; the flexibility of
the analytical model is desired, in particular, in accounting for the soft-real
time nature of the application. So, what then is the goal of this thesis?
Goal and Problem: The primary goal of this thesis is to tune certain ap-
plication parameters, which can act as resource managers, so as to effectively
utilize the on-chip resources. In this thesis, we propose insights to shape me-
dia application workloads using such design parameters (e.g. playout delay)
so as to significantly reduce the on-chip resource requirements. The shaping
techniques, the reader understands, exploits the inherent characterstics of
the multimedia streams, the variability that is.
The problem though is in predicting with accuracy the resource require-
ments; if the input data items arrive in variable sizes and executing them
takes varying time then storage and processing capacity is variable, too. But,
fortunately, these characterstics are predictabily variable. Hence we model
this variability.
Research Approach: Our methodology is a combination of system simu-
lation and analysis of mathematical models, to be precise. The input to the
models are traces obtained from simulation, not a complete simulation of the
1.3. THESIS 17
entire system, but the functional simulation of the individual components in
a SoC (such as processor, etc.,). For example, an instruction-set processor
simulator is used to provide the processor cycles required for executing data
items in a multimedia stream.
The models later constructed after one-time simulation provide bounds
on the arrived and processed data items over any time interval. These bounds
are for example the maximum and minimum number of data items that arrive
over any time interval of 1 second. The mathematical framework is a calculus
(based on algebra with min and max operators) that with inputs as bounds
on arrival and processor capacities provides bounds on output. Thus output
constraints such as display rates could be formulated in terms of the input
and service provided in terms of processor capacities. This way of modeling
then enables calculation of the minimum service, and the maximum storage
required.
Contributions: Listed below, are the key contributions of this thesis, and
they are primarily the insights obtained towards saving on-chip resources.
• the observation that the increase in playout delay decreases the proces-
sor cycles required to meet certain output rate, has given opportunities
to precisely estimate this delay and save processing resources;
• to reduce the memory requirements in delaying playout, the redistri-
bution has provided significant gains, especially, in multi-processor set-
ups;
• with slight increase in buffer space, the schedulability of the multimedia
and the periodic task is enhanced;
18 CHAPTER 1. INTRODUCTION
• a new stochastic framework to model tolerable loss in quality is a sig-
nificant advancement in terms of reducing on-chip resources;
• finally, a major limitation in the existing model has been removed,
namely, in modeling the processor to memory latency.
Organization: Following this chapter the thesis consists of three main
parts. The Chapter 3, buffering for smoothing, details the technique in which
the play-out delay parameter is tuned to reduce the processing requirement
of multimedia applications. Then the scheduling part in Chapter 4 explains
how different tasks can concurrently run with multimedia application if the
scheduling of these tasks is not feasible. Finally, the methodology to pro-
vide stochastic guarantees to use the soft-real time nature of the multimedia
applications is introduced in Chapter 5.
In this chapter, we established the aim of the thesis that is the effective
utilization of on-chip resources in devices running multimedia applications.
Towards achieving this goal, three workload shaping techniques are proposed
in this thesis, they are, smoothing, squeezing, and slashing. The insights
from each of these techniques if applied to system design would potentially
save significant resources and provide guarantees on the output quality of
the multimedia applications. The mathematical theory behind these shaping
techniques, accurately captures the stream characterstics, in particular, the
arrival and execution variability. The resource saving insights that we ob-
served exploit this variabilistic nature of multimedia streams and the analyt-
ical model that captures this variability tunes appropriate design parameters
to implement those insights.
Chapter 2
Background
The previous chapter introduced the main goal of the thesis - to develop
workload shaping techniques so as to effectively utilize processing and mem-
ory resources on-chip. In achieving this goal, the preceding chapter also
stated the research approach, that is, to predict variability in the multimedia
workload. It was argued that the variation in the number of data items ar-
riving, and the variation in the execution requirement of the data items are
the major factors that leads to huge requirements in processing and memory
resources. This chapter provides the reader with the understanding of other
existing research approaches that mainly falls into two topics; estimating the
on-chip resource requirements for multimedia processing; proposing ideas to
reduce the on-chip resource needs.
The primary aim of this chapter is to understand the existing literature
at two levels: (1) a broad perspective on the existing performance modeling
techniques for SoC design, and (2) a close look at the methods proposed for
scheduling and buffer management for multimedia devices. The former, the
19
20 CHAPTER 2. BACKGROUND
broader view, highlights the contribution of this thesis in that the mathe-
matical framework proposed is appropriate for multimedia applications. The
latter, a close-up study, shows the effectiveness of having insights to save
resources from mathematical models, than from other techniques such as
simulation, and so on.
Following this, firstly, we present the broader view on existing mathe-
matical frameworks for SoC design. Secondly, we zoom to the techniques
on tuning scheduler parameters for effective processor utilization and mem-
ory management. Thirdly, we present the fundamentals of the mathematical
model proposed in this thesis. In discussing the basics of the model, the
MPEG2 application is described. The application details provide informa-
tion on how multimedia streams are modeled.
2.1 Analytical Model: A Bird’s Eye Review
First, we will look at the initial classification in methods for SoC design and
discuss the pros and cons of the approaches. Existing approaches for SoC
design can be broadly classified as follows: (1) analytical models, and (2)
simulation. The main disadvantages of simulation based techniques is that
they are slow for any design that involves a large number of iterations (for
example, designs that involves identifying several design parameters). More-
over, simulation techniques do not provide any special insights that can lead
to resource savings. Advantages and disadvantages of the simulation and
mathematical modeling are discussed in the Table 2.1. More importantly, for
designs with throughput requirements, the cycle-accurate simulation tech-
2.1. ANALYTICAL MODEL: A BIRD’S EYE REVIEW 21
niques only provide guarantee on the correctness of the results, but do not





















Figure 2.1: Dimensions of SoC Design.
Now we detail other dimensions of SoC design. Figure 2.1 sketches various
existing design methodologies. We discussed earlier the pitfalls in simulation
based techniques, now we note the advantages of the alternative method-
ology, namely, the performance modeling techniques. System-level design
using mathematical framework involves fast exploration of design parame-
ters. This thesis shows that there could be valuable insights obtained from
such analysis. These insights provide significant resource savings. Clearly,
guarantees on throughput could be formulated in mathematical frameworks.
With understanding of the benefits of mathematical models, we now look at
the second-level of classification in performance modeling.
In general, analytical models can be divided as deterministic and proba-
bilistic frameworks. Mostly, deterministic models are for worst-case analysis





















Table 2.1: Architecture Simulation Vs. Mathematical Modeling
Characterstics Simulation Mathematical Modeling
Speed Slow: for iterating over several Fast: for estimating parameters with
design parameters (e.g., buffer size). relative accuracy to simulation (e.g., playout delay).
Insights Trends: relationship between two Efficient Design: rigorous analysis could
design parameters could be observed. ideas towards saving significant resources.
e.g., buffer size vs. processor frequency. e.g., reducing buffer size with playout delay.
Accuracy Cycle-accurate: an instruction set Time-driven: a high-level model could
simulator could model functionality of capture the data flow on a given
the processor and memory. architecture at a desired time granularity
Re-usability Specific: system-level simulators could General: mathematical model can be
be re-used until the chip design reused for various designs whose
of the architecture. architectures are similar.
Limitations Middle: use of simulator tools early Early: The mathematical models cannot replace
(design stage used) in the design stage often requires detailed simulation, so, the use of
designers to use the same design analytical modeling and system-level design
without room for big changes. is limited to early design stage.
2.1. ANALYTICAL MODEL: A BIRD’S EYE REVIEW 23
For soft-real time systems, however, a probabilistic framework is appropriate.
Although, the existing probabilistic mathematical frameworks only analyze
average-case scenarios or most-probable scenarios. We have seen so far two
levels of classification in methods for SoC design: (1) either they are simu-
lation based or analytical models, and (2) in mathematical modeling if they
are deterministic or probabilistic. Now we look at the third level.
The granularity of the application, more precisely, if they are modeled as
tasks or events is the basis of this classification. Further, if they are events
they could be again divided into techniques that model standardized or gen-
eral events. Here we discuss in detail some of the popular analytical models.
The reader will be able to map these models with the classification dis-
cussed. Analytical methods that have gained attention are: (1) synchronous
data flow graphs, (2) stochastic automata networks, and (3) event adaptation
functions.
Synchronous Data Flow graphs (SDF): These models are currently
being used in industry to analyze multimedia systems, in particular, to rep-
resent data flows in DSP kernels (Stuijk et al., 2006). The primary benefit
of this model is that it naturally captures the concurrency behavior of the
application in a multi-processor architecture. The type of analysis that is
usually done using SDFs is in deriving static time schedules (i.e., during
application compile time). Thus the timing behavior of the application is
predicted. So, throughput constraints and storage space requirements of a
multimedia system could be studied.
24 CHAPTER 2. BACKGROUND
Stochastic Automata Networks (SAN): A network of automaton are
SANs (Zamora et al., 2007). For example, an MPSoC could be modeled as a
SAN where each automaton describes the state of an application running on
a single processor. The edges in the automaton represent the communication.
The edges between the nodes in the SAN represent the transition rates. The
mathematical theory behind the analysis of SAN is using Markovian models.
The problem of state-space explosion is solved since the transition matrix is
not stored or generated.
Event Adaptation Functions (EAF): EAFs are used to represent a
heterogeneous MPSoC with processing elements and components having dif-
ferent behavior (Richter and Ernst, 2002a). For example, one component
(say an ASIC) may generate periodic output but an other component (say
a processor) may generate output with period and jitter. The output of
these two components are inputs to another component. To analyze such
systems EAFs provide functions to couple local event models using buffer-
ing and time-triggering, which is how in reality systems with heterogeneous
components are designed. Schedulability analysis, timing, and buffer memory
requirement could be studied with EAFs.
The mathematical model we propose differ from other existing models in
the following ways:
• The arrival of items in our model is not limited to standard input
models such as periodic, Poisson, and so on.
• Our model captures the variability in the processing of data items such
2.2. TUNING SCHEDULER PARAMETERS 25
that this variable nature of the media stream could itself be exploited
for efficient system design.
• The analytical framework that we present can be applied to any level of
granularity i.e., each data item in the stream can be a bit, a macroblock,
or a frame.
• In contrast to average-case analysis of the existing probabilistic models,
we show that guarantees on output requirement could be still provided
with our stochastic mathematical framework.
• Our framework applies to any kind of multimedia streaming applica-
tion. In other words, it does not rely on the specific characteristics of
the application.
2.2 Tuning Scheduler Parameters
This section takes a closer look at the techniques for scheduling and buffer
management of SoC design for multimedia applications. As mentioned be-
fore, multimedia applications constitute sizable workloads in today’s hand-
held devices such as PDAs, MP3 players and video players. The characteristic
feature of the media workload is that they show high data-dependent vari-
ability (Hughes et al., 2001; Liu et al., 2004) and this variability brings diffi-
culty in predicting the resource demands of multimedia applications. Hence,
in multimedia devices, it is a challenging job for the RTOS, which is actu-
ally designed for general purpose real-time systems use, to gracefully and
efficiently allocate resources to various tasks. The RTOS job is complicated
26 CHAPTER 2. BACKGROUND
primarily due to the: (1) resource conflicts between the real-time media and
other non-real time applications, and (2) unpredictability in the requirement
of processing resources by the media workload (Patil and Audsley, 2005).
So, there is a continuing interest in the OS research community in de-
signing application-specific operating systems (Plagemann et al., 2000) such
that in devices with limited processing capacity and buffer space, resources
can be allocated with the knowledge of the application requirements. This
effort in building application specific OS has opened a new research direction
in designing embedded systems, RTOS co-design, which deals with exploring
methodologies to build customized RTOS along with the hardware/software
elements. In this thesis, we address building one customizable portion of the
RTOS, namely, the scheduler of the operating system.
In real-time systems literature, very few techniques directly address the
problem of building a scheduler or choosing scheduler parameters (Maxi-
aguine et al., 2004), so analytical models for designing schedulers are yet
to be extensively looked at. In designing schedulers for embedded systems,
application and architecture models should assist system design architects to
realize efficient static/dynamic scheduling techniques. Since designers cur-
rently use off-the-shelf RTOS, they do not use models for evaluating their
system or for tuning the design-time parameters. Hence, for designing cus-
tomized schedulers for RTOS in multimedia embedded systems, we propose
a mathematical framework. We illustrate the model usability in evaluating
scheduling polices; the models efficiency in predetermining scheduler parame-
ters, such as the size of the play-out buffer (read by the real-time video/audio
output devices).
2.2. TUNING SCHEDULER PARAMETERS 27
2.2.1 Methodologies
In this section, we will study some recent work that propose techniques to
model and evaluate RTOS for embedded systems design. We first classify pre-
vious studies based on the level of abstraction used in designing the RTOS
with other hardware/software modules in system design. Only few techniques
in the past focus on abstract models of the RTOS and even those techniques
do not fully utilize the potential of formal models in evaluating several pa-
rameters of the components (such as schedulers) in the RTOS.
Co-simulation: Gerstlauer et. al., (Andreas Gerstlauer, 2003) built a
RTOS model on top of the existing system level design language SpecC. The
RTOS model allows the designer to model the dynamic behavior of multi-
tasking systems at higher abstraction levels. The RTOS main tasks such as
system management, task management, event handling and time modeling
are modeled using special routines, which extensively use existing SpecC’s
primitives and libraries. Using the routines defined for the RTOS interface
the application models (task, synchronization) are appropriately refined. In-
order to evaluate the complete system design a co-simulation is performed
using refined system level design models integrated with the RTOS models.
Moigne et al., (Moigne et al., 2004) model the RTOS’s behavior and its timing
properties using SystemC primitives. The model reflects the scheduling pol-
icy and the preemptive/non-preemptive modes of the RTOS behavior. After
simulation, the models report the overhead incurred in making a scheduling
decision and in context switching among tasks.
28 CHAPTER 2. BACKGROUND
He et. al., (He et al., 2005) built a configurable RTOS model on top
of the SystemC framework. The main contribution in this work is that a
timed simulation could be performed with the configured RTOS. Timing pa-
rameters such as time required to create a task, hardware interrupt latency,
software interrupt enable/disable costs, etc., are obtained from OS bench-
marks and this timing information is embedded into RTOS models as delay
annotations. Using these delay annotations the authors have implemented
novel algorithms to predict the next OS-wide event and estimate its time-
stamp, and so an accurate timed simulation of the RTOS is achieved. There
exists other similar techniques (Chevalier et al., 2004; Yoo et al., 2002).
Performance Models: Madsen et. al., (Madsen et al., 2003) proposed
a framework to study the effects of running multi-threaded applications on
a multiprocessor system-on-chip (SoC) platform with abstract RTOS’s. The
tasks in the application are modeled as a finite state machine comprising of
states such as idle, ready, running and preempted. The execution order of
the tasks (scheduling model), the synchronization among the tasks and the
resources requested by the tasks are modeled based on the finite state machine
of the task. Thus the modeling framework is composed of independent basic
RTOS service models namely the scheduling, synchronization and resource
allocation models.
Paul et. al., (Paul et al., 2003) introduce an approach in designing sched-
ulers called model-based scheduling for programmable heterogeneous multi-
processors (PHM). The custom schedulers in PHM, unlike the role played
by the schedulers in a traditional OS, dynamically decide on the next thread
2.2. TUNING SCHEDULER PARAMETERS 29
to be executed or the next packet to be sent based on the way the appli-
cations will execute on the underlying hardware. Also, some functionality
of the schedulers in a PHM is derived statically from design time mod-
els, which are developed together with other design elements in the PHM.
Model-based scheduling allows customization of schedulers to make dynamic,
data-dependent scheduling decisions which in turn leads to optimized PHM
performance.
Implementation Techniques: Unlike performance models, there are nu-
merous research effort that propose OS software generation, synthesis and
RTOS implementation techniques (Desmet et al., 2000; Gauthier et al., 2001;
Cho et al., 2005; Patil and Audsley, 2005). Patil and Audsley implemented
an application specific OS using the reflection mechanism (Patil and Audsley,
2005). In particular, they proposed a reflective scheduler, which modifies the
current schedule of the tasks based on the reified data (data that the appli-
cation sends to the scheduler). Cho et. al., (Cho et al., 2005) implement a
static scheduler for multi-processor SoC, which is implemented with a pre-
defined schedule of tasks/communications and the scheduler implemented is
used to explore the trade-offs between a centralized and a distributed schedul-
ing mechanism. The methodology to automatize the construction of the OS
specific to an embedded systems software and automatic targeting of the
software to the OS is detailed by Guthier et. al. (Gauthier et al., 2001).
We now discuss the practical utility of our technique and briefly list some
its merits. We also discuss some limitations of our technique and highlight
the possible directions for future work. Currently, SoC designers evaluate
30 CHAPTER 2. BACKGROUND
their designs by directly building a generic SoC simulation model, such as
transaction-level models using SystemC (Rutten et al., 2002; Dutta et al.,
2001). They then reuse this simulation template in all their future designs.
The main drawbacks of this approach are: (1) it is inefficient to simulate
all possible designs using the template, and (2) the template framework is
not flexible enough to try-out new and different designs and designers are
reluctant to build a different and a new template. High-level models of some
components of a RTOS and other hardware/software elements would ease
complex SoC design and lead to efficient, high-performance designs. We
present one such model in this thesis.
Our technique can also be viewed as a preceding step to detailed simu-
lation. Hence, our methodology could provide inputs (design parameters) to
simulation templates, which designers typically use during the design-phase.
In particular, the RTOS in embedded devices should exploit the characteris-
tics of the application, which leads us to the topic: RTOS-co-design. In this
context of designing application-specific RTOS, previous work have focused
mostly on software generation, whereas in this work we present high-level
models to evaluate parameters of certain components such as schedulers in
the RTOS. The inherent limitation of any such high-level models is its ab-
stract characterization of the application and the behavior of the system
under design. Hence it is essential to strike a balance between providing
models useful for the actual design of the system and formulating modeling
frameworks with less detail compared to their simulation counterparts.
As mentioned earlier, high-level models for components of an RTOS have
not been extensively studied. So, we want to follow this line of research and
2.3. PRELIMINARIES 31
build a complete model suite for all possible components of an RTOS such
as a resource allocator, a synchronizer etc.. Some previous work (Madsen
et al., 2003) have followed a similar methodology but they have tightly cou-
pled their task models to the RTOS component models. The task models are
very generic and hence do not exploit characteristic features of the applica-
tion. Some studies have completely modeled the RTOS behavior as a state
machine (He et al., 2005).
We intend to preserve the system model and the stream models that we
present and hope to mathematically evaluate parameters corresponding to
other components such as the synchronizer and allocator of the RTOS. Our
immediate goal, however, is to construct models for evaluating application-
specific hierarchical schedulers for heterogeneous SoCs. It would also be
interesting to study using high-level models the trade-off between centralized
and distributed customized schedulers for complex MpSoCs. In this section,
we identified where this thesis work fits in. Now with this overall picture, we
detail the real-time calculus system model so as to illustrate its advantages
over other existing methods.
2.3 Preliminaries
In this section, we first provide details on the mathematical framework and
the multimedia application used in our case studies, namely, the MPEG2
decoding. Secondly, the SoC architecture with components such as processor
and memory are presented. The mapping of the multimedia application on
this architecture is also discussed. These details help to understand how the
32 CHAPTER 2. BACKGROUND
mathematical framework is constructed for the SoC application/architecture
model. For example, modeling the arrival of multimedia stream objects to
the input buffer is shown.
MPEG2 Decoding : In this section, we discuss the following topics on the
MPEG2 application: the reason to choose MPEG2 decoding as the multime-
dia application, and the processing and memory requirements for a multime-
dia stream. The primary reason that we chose MPEG2 decoding is as follows:
(1) the reference implementation for MPEG2 decoding was widely available,
(2) the MPEG2 reference implementation were also optimized for speed and
multimedia architectures (such as MMX), and (3) the application partition
and mapping to the hardware architecture is well studied in literature.
Now we describe some of the application details. The multimedia stream
we consider for analysis constitutes a sequence of data items. These items
are also known as stream objects. They could be for example, macroblocks,
frames, pictures, and so on. The granularity of the multimedia stream that we
use in our analysis is a macroblock. A macroblock constitutes of bits and each
and every macroblock can constitute different number of bits. For example,
in a video decoding application, the input stream is a compressed sequence
of macroblocks and after processing it is a sequence of decompressed mac-
roblocks. The number of macroblocks varies for each frame in a compressed
video stream and the amount of macroblocks is constant for each frame in a
decompressed video stream. The required output rate is usually specified in
terms of frames to be decompressed per unit time.
As mentioned earlier, the processing and memory requirements is high
2.3. PRELIMINARIES 33
for decoding multimedia streams. This is due to the fact that there is the
variability in the number of bits that each macroblock constitutes and the
variability in the amount of processing of each macroblock (the requirements
in processing and memory greatly varies).
Min-Plus Algebra: Before we discuss the real-time calculus model, we
first present the min-plus algebra, and then we follow in the next section
with the actual details of the system model. Min-plus algebra models dis-
crete event systems that do not involve concurrency. Discrete event systems
(e.g., MPSoC) are essentially man-made systems that consist of finite re-
sources (processors), shared by several users (jobs), working on a common
goal (parallel computation). The dynamics present in this system are syn-
chronization and concurrency. Min-plus algebra uses linear models, which
are a set of equations. These equations contain variables that can be added
together and variables that can also be multiplied by coefficients, which are
a part of the data of the model.
For example, consider that we are computing the departure and arrival
timings of a train. To estimate the departure time (as it is synchronized
with the arrival time of all other trains in the same station) we take the
max (or maximum) of the arrival time of all other trains (of interest) to the
station. To estimate the arrival time of the train to the station, we sum the
departure time and traveling time. Thus, in this system, using the operators
max and addition, we computed the values we require and described the
synchronization behavior. The operations of the min-plus algebra compared
to conventional algebra is when we substitute the ’addition to max’ and
34 CHAPTER 2. BACKGROUND
Figure 2.2: System Model
’multiplication to addition’.
2.3.1 Our System Model
In this section, we present the basic framework to model the variabilities in
the arrival of input items and the variabilities in the processing requirements
of the items.
Figure 2.2 shows our system model, which consists of a processor with
an internal buffer, a playout buffer and a playout (or output) device. After
decoding the input stream, the processor writes the data in the playout buffer
which is consumed by the playout device (e.g., a video display or a speaker)
at a fixed rate.
We assume that the input bit stream to be decoded is fed into the internal
buffer at a constant rate of r bps. Further, for the sake of simplicity, we will
consider a stream to be made up of a sequence of stream objects. A stream
object might be a macroblock in the case of video decoding or a granule in
the case of audio decoding tasks. Now, given a media clip to be decoded,
let x(t) denote the number of stream objects arriving in the internal buffer
over the time interval [0, t]. Due to the variability in the number of bits
constituting a stream object, the function x(t) varies with the media clip.
2.3. PRELIMINARIES 35
We define two functions αl(∆) and αu(∆) to bound the variability in the
arrival process of the stream objects into the internal buffer of the processor.
These two functions are defined as
αl(∆) ≤ x(t +∆)− x(t) ≤ αu(∆) (2.1)
for all t and ∆ ≥ 0, where αl(∆) and αu(∆) denotes the minimum and
maximum number of stream objects that can arrive in the internal buffer
within any time interval of length ∆, respectively.
To compute αl(∆) and αu(∆), we introduce two functions φl(k) and φu(k).
The former denotes the minimum number of bits constituting any k consec-
utive stream objects in an audio bit stream, and the latter denotes the cor-
responding maximum number of bits. These two functions can be obtained
by analyzing a number of media clips that are representative of the clips to
be processed by the target decoder.
Given the functions φl(k) and φu(k), it is possible to compute the pseudo-









(n) returns the maximum
and minimum number of stream objects that can be constituted by n bits
respectively. Since we assume the input bit stream arrives in the internal
buffer at a constant rate of r bits/sec, we have
αl(∆) = φu
−1
(r∆) and αu(∆) = φl
−1
(r∆). (2.2)
36 CHAPTER 2. BACKGROUND
Let y(t) be the number of stream objects written into the playout buffer
over the time interval [0, t], for all t ≥ 0. Let the service provided by the
processor at frequency f be represented by the function β(∆). Similar to
αl(∆), β(∆) represents the minimum number of stream objects that are
guaranteed to be processed (if available in the internal buffer) within any
time interval of length ∆. It can be shown that (Liu et al., 2004) y(t) ≥
(αl ⊗ β)(t), ∀t ≥ 0, where ⊗ is the min-plus convolution operator1. Hence,
for the constraint y(t) ≥ C(t), ∀t ≥ 0 to hold, it is sufficient that the following
inequality holds:
(αl ⊗ β)(t) ≥ C(t), ∀t ≥ 0 (2.3)
It is known from the duality between ⊗ and⊘, that for any three functions
f , g and h, h ≥ f ⊘ g if and only if g ⊗ h ≥ f (Boudec and Thiran,
2001), where ⊘ is the min-plus deconvolution operator2. Using this result on
inequality (2.3) we obtain:
β(t) ≥ (C ⊘ αl)(t), ∀t ≥ 0 (2.4)
Note that β(t) in inequality (2.4) is defined in terms of the number of
stream objects that need to be processed within any time interval of length
t. To obtain the equivalent service in terms of processor cycles, we can use
the function γu(k).
1The min-plus convolution operator ⊗ is defined as follows. For any two functions f
and g, (f ⊗ g)(t) = inf0≤s≤t{f(t− s) + g(s)}
2The min-plus deconvolution operator ⊘ is defined as follows: For any two functions f
and g, (f ⊘ g)(t) = sup
s≥0{f(t+ s)− g(s)}
2.3. PRELIMINARIES 37
We can characterize the variability in the number of processor cycles
required to process any stream object using two functions γl(k) and γu(k).
Both these functions take the number of stream objects k as an argument.
γl(k) returns the minimum number of processor cycles required to process any
k consecutive stream objects and γu(k) returns the corresponding maximum
number of processor cycles.
Finally, we assume that the playout buffer is readout by the output device
at a constant rate of c stream objects/sec, after a playout delay (or buffering
time) of d seconds. Let the function C(t, d) be the number of stream objects




0 if t ≤ d
c(t− d) if t > d.
(2.5)
In this chapter, we showed that our approach could be complimentary to
other existing methods for system-level design, for example, after a math-
ematical analysis a detailed simulation could be done for accurate analy-
sis. Moreover, techniques from other existing analytical approaches could be
borrowed to model system-design features. So far concurrent behavior and
feedback loops are yet to be modeled in real-time calculus. We could look
at other existing models to learn how they have modeled such feedback and
concurrent behavior. Some previous related work have already shown that
combined analysis methods also yield significant benefits. Finally, the ap-
38 CHAPTER 2. BACKGROUND
proach we used to deduce key insights could also be applied when designing
systems using other analytical models.
Chapter 3
Buffering for Smoothing
In the previous chapters we discussed the research approach to solve the
problem, namely, to find resource saving techniques for effective utilization
of processing and storage resources available on-chip. Towards this, in this
chapter we propose the first technique to reduce the processing requirements
for multimedia applications running on a system-on-chip.
The technique proposed in this chapter reduces the processing require-
ment of multimedia applications with tuning the playout delay parameter of
the portable device. It is shown that using the real-time calculus approach,
we could accurately characterize the worst-case execution behavior of multi-
media applications and the worst-case scenario could be smoothened out to
average behavior using the play-out delay parameter.
The section immediately following discusses briefly the intuition behind
this technique. We propose the system-model and the variability character-
ization curves. Using this system model we show how the buffering reduces
the workload and how precisely the playout delay parameter could be com-
39
40 CHAPTER 3. BUFFERING FOR SMOOTHING
puted. Later, in the context of multi-processor system-on-chip we show how
the delay parameter can be used not only to reduce the processing require-
ment but also the data memory requirement.
3.1 Buffering Vs Workload
3.1.1 Basic Intuition
Before presenting the model, we would first like to explain the intuition be-
hind our scheme. The playout rate associated with the application imposes
certain real-time constraints on it. When the playout delay is negligible,
such constraints translate to an upper bound on the time that can be spent
in processing each data item. Since the number of processor cycles required
by each data item is variable, the minimum processor bandwidth is deter-
mined by the item which requires the maximum number of processor cycles.
When the playout of the application is delayed (i.e. the processed data
items are buffered before being played out), the minimum required processor
bandwidth decreases. For any given delay, the “amount” of decrease is pro-
portional to the variability in the execution requirement of the stream. With
a sufficiently large playout delay, the minimum required processor bandwidth
corresponds to the average processor cycle requirement per data item (Raman
et al., 2005).
Further, many multimedia tasks have variable input-output rates, i.e.,
the number of input data items consumed to produce one processed data
item at the output is variable (Varatkar and Marculescu, 2002, 2004). For
3.1. BUFFERING VS WORKLOAD 41
example, the variable length decoding task in an MPEG decoder consumes a
variable number of bits to produce one partially decoded macroblock. This
provides additional possibility for reducing the required processor frequency
by buffering the decoded frames before playout.
3.1.2 Related Work
In this section we discuss some of the previously proposed techniques for
estimating processor cycle requirements of multimedia tasks, with the aim of
either improving CPU utilization or minimizing energy consumption. We also
point out the major differences between these techniques and our proposed
framework.
All dynamic frequency or voltage scaling algorithms rely on the fact that
lowering the processor’s clock frequency and/or supply voltage reduces its
power consumption. However, accurately predicting the variation in the
workload generated by a multimedia task, so that the processor’s frequency
can be changed accordingly, is a difficult problem (Choi et al., 2004; ongh-
wan Son, 2001). In (onghwan Son, 2001), the resource requirement for the
current workload in the case of MPEG decoding is predicted from the frame
drop and delay encountered during the immediately previous time interval.
Further, the time required to decode any MPEG frame is predicted using
a frame classification technique. On the other hand, in (Choi et al., 2004),
the decoding process is classified into two parts: frame dependent and frame
independent decoding. The decoding time for both these parts are estimated
and the processor’s voltage is scaled accordingly. In contrast to these ap-
42 CHAPTER 3. BUFFERING FOR SMOOTHING
proaches, our work relies on an oﬄine analysis of a multimedia stream to
accurately characterize the variability in the workload generated by it. It
does not rely on any run time prediction of the processor cycle requirements
of the stream.
A number of papers also exploit buffering techniques to smooth out the
variability in multimedia workloads, so that dynamic power management
techniques can be used (Cai and Lu, 2004; Im and Ha, 2004, 2003; Lu et al.,
2002). In (Im and Ha, 2004), a job scheduler delays the execution of soft real-
time tasks, by buffering the streams processed by these tasks, and uses the
generated slack to process tasks with stringent real-time constraints. How-
ever, this technique does not provide any insights into what would be an
optimal buffering time that would lead to energy savings and at the same
time allow continuous playback of a multimedia task. Our framework, on
the other hand, can be used to identify an optimal playout delay, such that
increasing this delay does not lead to further savings in processor cycle re-
quirements and decreasing this delay significantly increases the processor
cycle requirements of a task.
Lately, a number of energy aware scheduling techniques have been pro-
posed in the literature (Dudani et al., 2002; hui Wang et al., 2001; Jejurikar
and Gupta, 2004). A few recent papers have also addressed the problem
of energy savings in the specific context of multimedia applications (Hughes
et al., 2001; Tamai et al., 2004). Our work differs from these in the following
ways:
• The framework is only concerned with computing the buffering time or
3.2. FREQUENCY ESTIMATION 43
the playout delay of each stream. It is independent of the scheduling
policy used to schedule these streams. This framework is supposed to
work at a level below the scheduler which would typically be used to
schedule the multiple streams being processed by the device.
• As mentioned earlier, the framework that we present in this chapter can
be extended to the case of dynamic voltage-frequency scaling. However,
we do not explore this option here.
In the domain of computer networks, several techniques have been pro-
posed for exploiting the playout delay of multimedia applications in the con-
text of link scheduling (Moon et al., 1998; Ramjee et al., 1998b). For example,
in (Moon et al., 1998), tight bounds on optimum average audio playout delay
based on tradeoffs between per-packet loss and delay were estimated. Using
these bounds, a history dependent adaptive packet playout delay adjustment
algorithm was proposed. Our work is concerned with computing a fixed
playout delay, rather than dynamically adjusting it at run time. Further, our
technique is more relevant in the context of playing stored audio and video.
Hence, we did not exploit any network-related parameters such as packet loss
and delay.
3.2 Frequency Estimation
We now present our analytical framework to compute the minimum processor
frequency in order to continuously playback a given media file, with a playout
delay d.
44 CHAPTER 3. BUFFERING FOR SMOOTHING
Now, given the input bit rate r, the functions φl(k), φu(k), γl(k) and
γu(k) characterizing the possible set of media clips to be decoded, and the
function C(t), we can compute the minimum processor frequency f to sustain
the playout rate of c stream objects/sec. This is equivalent to saying that the
playout buffer never underflows. Let y(t) denote the total number of stream
objects written into the playout buffer over the time interval [0, t], then this
constraint is equivalent to requiring that y(t) ≥ C(t) for all t ≥ 0. To obtain
the equivalent service in terms of processor cycles, we can use the function
γu(k) defined above. The minimum service that needs to be guaranteed by
the processor to ensure that the playout buffer never underflows is given by:
γu(β(t)) = γu((C ⊘ αl)(t)) = γu(C(t)⊘ φu
−1
(rt)) (3.1)
processor cycles for all t ≥ 0. Hence, the minimum frequency at which the
processor should be run to sustain the specified playout rate is given by:
min{f | ft ≥ γu(β)(t), ∀t ≥ 0}
3.3 Delay Redistribution
3.3.1 Motivation
Many system-on-chip platform architectures targeted towards the multimedia
domain consist of multiple processing elements (PEs) connected by FIFO
buffers in a pipelined fashion (e.g., Eclipse and Viper from Philips) (Rutten
3.3. DELAY REDISTRIBUTION 45
et al., 2002). Each such PE executes a part of an application (e.g., Variable
length decoding in an MPEG-2 decoder application) and runs concurrently
with the other PEs. An important design constraint in such set-ups is to
ensure that none of the FIFO buffers overflow and in addition, the playout
buffer never underflows. To ensure the overflow constraint, PE stalls when
the buffer it is writing to fills up. To prevent playout buffer underflow, the
PE that writes to the buffer is usually clocked at a slightly higher frequency
than the clock of the output device that reads the playout buffer.
Although a combination of these two techniques ensure both the buffer
overflow and underflow constraints, its application involves additional stalling
circuitry and the use of at least two different clock domains. To avoid these
overheads, a common design tactic is to use a sufficiently large playout delay
(i.e. the delay after which the output device starts reading the playout buffer)
to avoid possible playout buffer underflows, in combination with large buffers
to avoid buffer overflows. However, given the high variability in the execution
requirements of most multimedia applications, the amount of buffer space
required can be significant if one aims at a design with a buffer targeting the
worst-case application 3.1.
In this chapter, we propose a technique where the playout delay is not
associated solely with the output device. Rather, this delay is redistributed
over the multiple PEs in the pipeline. In other words, each PE starts reading
its input buffer after a certain amount of time, or delay, has elapsed since
the previous downstream PE started reading its input buffer. We show that
such delay redistribution can significantly reduce the total buffer require-
ments, without increasing the total playout delay of the application being

















Delay        
Redistribution






Figure 3.1: Our system model and technique. FIFO buffers connect PEs in
pipeline. An application is partitioned and mapped onto the different PEs that
run tasks concurrently. Buffer size reduces on redistributing playout delay.
executed. Given that on-chip buffers occupy a large fraction of the chip area,
our technique is useful for reducing chip area. At the same time, it involves
no extra implementation overheads.
Our delay redistribution scheme takes into account the variability in the
execution requirements of an application and also the burstiness in the on-
chip traffic. The manner in which the delay is distributed also depends on
the partitioning of the application onto the multiple PEs. As a rule of the
thumb, we propose that the total playout delay be redistributed among the
different PEs in the architecture based on the ratio of the variability of the
tasks running on the respective PEs. In other words, a PE running a task
with high variability in its execution requirement should be associated with
a higher delay. Although this basic technique is fairly intuitive, most current
designs associate all the playout delay solely with the output device.
Our contribution in this chapter is in conceiving the delay redistribu-
tion technique for multimedia pipelines and in formulating a mathematical
framework that can guide a system designer in applying this scheme. We
3.3. DELAY REDISTRIBUTION 47
also present experimental results (based on an MPEG-2 decoder) to quan-
titatively show the reductions in the buffer size required after applying our
technique. The buffer sizes we estimate are validated with the results ob-
tained from simulation-based techniques.
3.3.2 Relation to Previous Work
Previous efforts (Panda et al., 2001; Yang et al., 2006; Ko and Bhattacharyya,
2005; Murthy and Bhattacharyya, 2004; Sarshar and Wu, 2004; Han et al.,
2006; Stuijk et al., 2006) have specifically been directed towards optimizing
on-chip memory in system-on-chip architectures designed for embedded mul-
timedia systems. Most of the previous papers attempted to reduce mem-
ory requirements of synchronous data-flow (SDF) graphs which are used
for specifying compute-intensive kernels of DSP applications. Murthy and
Bhattacharya (Murthy and Bhattacharyya, 2004) proposed buffer merging
to reduce memory requirements of SDF graphs. Buffer merging is achieved
through sharing buffers that two different processes use. After analyzing
the lifetime of actors (nodes specifying application code blocks in SDFs), it
is determined whether two different processes containing these actors can
potentially share buffers. Stuijk et al. (Stuijk et al., 2006) computed the
pareto-optimal points that give the minimum storage space needed to exe-
cute a graph under given throughput constraints. We take a completely new
approach in that we study the on-chip traffic characteristics of the applica-
tion and exploit the playout delay parameter associated with the multimedia
application to reduce the buffer size. Similar approaches have been followed
48 CHAPTER 3. BUFFERING FOR SMOOTHING
in the domain of computer networks to counter the burst in network traffic
so as to effectively utilize network resources. In comparison, our work is con-
cerned with fixed playout delay, rather than dynamically adjusting it at run
time (Ramjee et al., 1998b). Further our technique is more relevant in the
context of playing stored audio and video. Hence, we did not exploit network
related parameters such as loss and delay.
3.3.3 Illustrative Example
In this section, we first explain the mapping of the application to the multi-
processor system-on-chip architecture platform. Then, for this set-up (shown
in Figure 3.1), the advantages of the delay redistribution technique over oth-
ers is illustrated with some scenarios.
An MPEG-2 decoder is the multimedia application we chose to map to the
system-on-chip platform. The sub-tasks of the application, namely, variable
length decoding (VLD), inverse quantization (IQ), inverse discrete cosine
transform (IDCT), and motion compensation (MC) are bound to the two
on-chip processing elements (PE 1 and PE 2 in Figure 3.1). PE 1, running
VLD and IQ, reads the input buffer, where the stream objects (or items) of
the encoded input video arrive. PE 2, executing IDCT and MC, reads the
partially processed stream objects (output of PE 1) from the intermediate
buffer. The output device, after an initial delay d, reads the decoded stream
objects (that PE 2 stored) from the playout buffer at a constant rate and
displays the video. To show how our technique reduces the required on-chip
memory size, we present three different scenarios. Figure 3.2 sketches the fill

























Figure 3.2: Buffer fill levels with initial playout delay: (a) very small, (b) large,
and (c) redistributed.
levels of the on-chip FIFO buffers in those scenarios. Scenario (c) is where
we apply our delay redistribution technique, and scenarios (a) and (b) are
existing techniques. Let us now discuss them in further detail.
Scenario (a), where the playout buffer underflows, occurs when the output
device reads the playout buffer with no or small initial playout delay. The
execution of some tasks of multimedia applications (e.g., VLD in MPEG-
2) show high data-dependent variability. The PEs running these variable
tasks (for e.g., PE 1 which is running VLD in our set-up) writes to their
output buffer at a variable rate. In the example discussed here, the output
50 CHAPTER 3. BUFFERING FOR SMOOTHING
buffer of PE 1 is the intermediate buffer and its fill level fluctuates through
time (shown in Figure 3.2). This phenomenon of buffer fill level oscillation
propagates to the playout buffer resulting in a cascading effect: varying fill
levels at one buffer cause the fill level to vary at the next buffer in the
pipeline, and this effect continues until it reaches the playout buffer. The
output device, however, reads items from the playout buffer at a constant
rate. Hence, it is possible that the playout buffer underflows resulting in a
loss in quality of the displayed video. Also, due to the cascading effect, the
intermediate buffer and the playout buffer might overflow if the buffers are
not large enough. To avoid such overflows and underflows – as mentioned in
Section 3.3.1 – the PEs have to stall (avoids overflows) and run at a higher
clock frequency domain (avoids underflow at playout buffer). This technique
has large overheads (e.g., in terms of stalling circuitry). In the next scenario,
we will see how the initial playout delay absorbs the variabilities arising due
to task execution (and due to variable event arrivals).
The display device starts reading from the playout buffer after a consid-
erable initial playout delay in Scenario (b), and the playout buffer does not
underflow. Only after the initial delay the output device starts consuming
items from the playout buffer. Hence, during the delay, stream objects ac-
cumulate in the playout buffer. If this initial delay is appropriately chosen,
then the variabilities occurring at the output of PE 1 will not propagate to
the playout buffer (see playout buffer fill level in Figure 3.2 (b)). Since there
are no variabilities in the fill level of the playout buffer, it never underflows.
Effectively, we evaded the cascading effect. However, bursts at the interme-
diate buffer remains (see intermediate buffer fill levels in Figure 3.2). Now,
3.3. DELAY REDISTRIBUTION 51
like the playout buffer, what if the intermediate buffer plays the role of an
consumer? That is, if we start PE 2 after a certain delay, then the variabili-
ties at the output of PE 1 shall be absorbed at the intermediate buffer itself.
The benefits of this delay redistribution are discussed next.
In Scenario (c), PE 1 initially starts to decode items. Then after a delay
of d1, PE 2 starts. Finally, after a delay d, the display device starts reading
stream objects from the playout buffer. In Figure 3.2(c), we see no fluctua-
tions in the intermediate buffer and the playout buffer, and the fill level at
the playout buffer substantially reduces compared to Scenario (b). Note that
the delay after which PE 2 should start (d1) has to be appropriately chosen
such that the intermediate buffer size does not increase after the delay redis-
tribution. This is because the total buffer size required will not reduce in that
case. For slight increase in the intermediate buffer size, our results show that
the total maximum buffer size required reduces after redistribution. Since,
at the intermediate buffer, unlike the playout buffer, the stream objects are
partially compressed, they require less memory.
3.3.4 Problem statement
We solve the following problem using the analytical framework presented in
the next section. Consider our system model shown in Figure 3.3, and assume
a streaming application that is partitioned into tasks and mapped to the PEs
of the system. A constant bit rate input stream arrives at the input buffer,
PE 1 partially processes it, and writes it to the intermediate buffer. PE 2
reads items from the intermediate buffer, completely processes the stream,
52 CHAPTER 3. BUFFERING FOR SMOOTHING
and writes it to the playout buffer. Finally, the display device consumes
items from the playout buffer at a constant rate. Given the frequency at
which PE 2 runs (f2), the minimum frequency at which PE 1 should run
(f1), the corresponding minimum playout delay (d) is estimated such that the
playout buffer never underflows. If the output device initially consumes items
after delay (d), the playout buffer is guaranteed to never underflow. Then,
preserving the playout buffer underflow guarantee, the maximum playout
buffer size and intermediate buffer size required after delay redistribution
should be estimated. It has to be shown that the total maximum buffer size



















Figure 3.3: System Model
3.3.5 Playout Delay Redistribution
In this section, first we present our analytical framework to estimate the
minimum playout delay required such that PE 1 and PE 2 run at minimum
processor frequency required to meet the display rate. Later, we estimate
the total maximum buffer size required (sum of the intermediate and the
playout buffer sizes). Now, given the input bit rate r, the functions (φl/u(k),
γ
l/u
1/2(k)) characterizing the possible set of media clips to be decoded, and the
function C(t, d), we can compute the minimum processor frequencies f1 and
3.3. DELAY REDISTRIBUTION 53
f2 to sustain the playout rate of c stream objects/sec. This is equivalent to
requiring that the playout buffer never underflows.
Let y(t) denote the total number of stream objects written into the play-
out buffer over the time interval [0, t]. Then the playout buffer underflow
constraint is equivalent to requiring that y(t) ≥ C(t, d) for all t ≥ 0.
Let the service provided by PE 2 at frequency f2 be represented by the
function β2(∆). Similar to α
l
2
(∆), β2(∆) represents the minimum number
of stream objects that are guaranteed to be processed (if available in the
intermediate buffer) within any time interval of length ∆. It can be shown
that y(t) ≥ (αl
2
⊗β2)(t), ∀t ≥ 0, where ⊗ is themin-plus convolution operator.
Hence, for the constraint y(t) ≥ C(t, d), ∀t ≥ 0 to hold, it is sufficient that
the following inequality holds
(αl
2
⊗ β2)(t) ≥ C(t, d), ∀t ≥ 0. (3.2)
By applying this result of duality on inequality (3.2), we obtain
β2(t) ≥ (C ⊘ α
l
2
)(t, d), ∀t ≥ 0. (3.3)
Note that β2(t) in Inequality (3.3) is defined in terms of the number of
stream objects that need to be processed within any time interval of length
t. To obtain the equivalent service in terms of processor cycles, we can use
the function γu
2
(k) defined above. The minimum service that needs to be
guaranteed by PE 2 to ensure that the playout buffer never underflows is









)(t, d)), ∀t ≥ 0 (3.4)
processor cycles. Hence, the minimum frequency at which PE 2 should run
to sustain the specified playout rate is given by
min{f2 | f2t ≥ γ
u(β2)(t)}, ∀t ≥ 0. (3.5)
From the above equation, if PE 2 is run at frequency f2, with a playout delay
d (see Equation (2.5)), then it is guaranteed that the playout buffer will never
underflow. Now, let us compute the minimum processor frequency for PE 1.
Considering that the playout delay is redistributed, the service that PE 2




0 if t ≤ d1
(C ⊘ αl
2
)(t, d) if t > d1,
(3.6)
with d > d1.
PE 2 is idle during the initial delay d1. After the initial delay, PE 2 pro-
duces items at a speed to sustain the display rate. To estimate the minimum
playout delay required d, we will set d1 to 0.
In the above equation, αl
2
(t) captures the minimum number of items that
arrives to the buffer in-front of PE 2. So, αl
2
(t) lower bounds the number of
3.3. DELAY REDISTRIBUTION 55
items that PE 1 should produce and it can be written as
(αl
1
⊗ β1)(t) ≥ α
l
2
(t), ∀t ≥ 0. (3.7)






)(t), ∀t ≥ 0. (3.8)
Similar to Equation (3.5), let f1 be the minimum processor frequency at
which PE 1 should be run to guarantee an output of at least αl
2
(∆). Then,
f1 can be computed from the following equation as
min{f1 | f1t ≥ γ
u
1
(β1)(t)}, ∀t ≥ 0. (3.9)
The frequency of PE 1 and PE 2, f1 and f2 respectively, depends on the
playout delay d, which is a parameter of the consumption function (Equa-
tion (3.3)). So, we have to estimate the playout delay required to run the
PEs at a minimum processor frequency computed in Equations (3.5) and
(3.9). We denote f(d) as the minimum frequency of the processing element
corresponding to the playout delay d.
As shown in Figure 3.4, we define three different playout delays corre-
sponding to processor frequency f1 and f2: initial, stabilization and maxi-
mum playout delays. The initial playout delay, di, is the largest playout delay
below which the minimum processor frequency (for any PE in the pipeline)
required to decode the stream is infinity. Maximum playout delay, dm, is the
smallest playout delay value above which there is no decrease in the mini-
56 CHAPTER 3. BUFFERING FOR SMOOTHING
mum processor frequency (for all PEs in the pipeline) required to decode the
stream and can be written as
min{dm | fi(dm + δ) = fi(dm)}, ∀δ ≥ 0, ∀i. (3.10)
The stabilization delay, which lies between the initial and the maximum
delay, is the minimum playout delay required to run PE 1 and PE 2 at
frequencies f1 and f2 and is defined as follows
Definition 3.3.1 Stabilization playout delay (ds) is the delay value at which
the minimum processor frequency (of all PEs in the pipeline) stabilizes to a
value close to f(dm). It is defined for a given ǫ as follows:
min{ds | fi(ds)− fi(dm) ≤ ǫ}, ∀i ≥ 0. (3.11)
In other words, PE 1 and PE 2 could run at a minimum processor frequency
f1 and f2 respectively. But to sustain the playout rate, the display device
should start consuming items from the playout buffer after the stabilization
playout delay. Hence from the above equation, given the minimum processor
frequencies f1 and f2, the minimum playout delay required can be estimated.
3.3.6 Buffer Size Estimation
To show that the buffer size reduces after redistributing the delay, we first
have to compute the maximum intermediate and playout buffer sizes required.
The following constraints must be satisfied when estimating the buffer sizes:
(i) playout buffer should never underflow, and (ii) intermediate and playout

















Figure 3.4: Initial playout delay values as minimum required processor frequency
drops and stabilizes.
buffer should never overflow.
To ensure that the playout buffer never underflows, as we showed in the
previous section, the processing elements have to run at the minimum pro-
cessor frequencies f1 and f2 (from Equations (3.5 and 3.9)). The output
device must start consuming items from the playout buffer after the stabi-
lization delay d. Now, let us compute the maximum playout buffer and the
intermediate buffer size required such that these buffers never overflow.
The playout buffer stores decoded stream objects that are to be consumed
by the display device. We know that in time interval [0, t], the number of
stream objects processed is y(t), and the number of stream objects consumed
is C(t, d). Hence the playout buffer size required is given by
B(d) = max {y(t)− C(t, d)} , ∀t ≥ 0. (3.12)




)(t) and hence the maximum playout buffer
58 CHAPTER 3. BUFFERING FOR SMOOTHING
size required such that the buffer never overflows is given by




)(t)− C(t, d)} , ∀t ≥ 0. (3.13)
In the above equation, the arrival of items to the playout buffer is variable,
but the consumption rate is constant. The delay d, after which the playout
device starts to consume items from the buffer, smoothens the variability in
the fill levels of the playout buffer.
Similarly, the maximum intermediate buffer required is computed as fol-
lows











, ∀t ≥ 0. (3.14)
From the above equation, we see that both the arrival and the consumption
of items from the intermediate buffer is at a variable rate. The variability
in the fill level of the intermediate buffer due to the arrival of items to the
buffer can be smoothened if PE 2 starts consuming items after appropriate
delay d1. This delay d1 is redistributed from the playout delay d and in the
next section we numerically show that the total maximum buffer size required
reduces because of this redistribution.
3.3.7 Results
Before presenting the results, first we reason some choices we made while do-
ing these experiments. (a) application: Our mathematical framework can
model many streaming applications other than video decoding for e.g., audio
decoding, gaming applications, and so on. We chose MPEG-2 decoder specif-
3.3. DELAY REDISTRIBUTION 59
ically because the task partitioning and its mapping to the system-on-chip
platform is well studied in literature. The MPEG-2 reference implementation
was obtained from libmpeg2 (2006) library as its source code was optimized
for speed. (b) inputs: Today’s multimedia systems such as set-top boxes,
gaming devices, etc., have to support high definition display devices (e.g.,
HDTV). So, to study the resource requirements of these type of high-end sys-
tems we chose to experiment with high-bit rate and high-resolution videos,
however, we did experiments with low-bit rate and low-resolution clips and
observed the buffer savings obtained using our technique. (c) simulators:
In order to compute the γ curves, as mentioned before, we need a one time
simulation using a processor simulator. Familiarity with the tool and the
ease in checking correctness of the cycle values obtained made us to go for a
simple processor configuration namely the sim-safe in SimpleScalar tool set.
To validate the estimated buffer sizes obtained from our mathematical frame-
work we had to use a full system simulator and SystemC was the obvious
choice as it is free and widely used.
Frequency versus playout delay: We first compute the total initial
playout delay required such that the processor frequency required for running
tasks in PE 2 and PE 1 reduces to minimum. To find the delay, we first have
to plug in Equation 3.9 the processor frequency that video decoding tasks
in PE 2 need. Now using Equation 3.9, we compute the total playout delay
required. Figure 3.6 shows the processor frequency f1 versus delay for tasks
running in PE 1 for two different values of f2. We can now estimate the total
playout delay to be the maximum delay such that processor frequency for
tasks running in PE 1 reduces to minimum. So, if the output device starts
60 CHAPTER 3. BUFFERING FOR SMOOTHING
after this initial playout delay the processor frequencies of tasks running in
PE 1 and PE 2 drops to the lowest value possible. In Figure 3.6, for f2 = 460
MHz, f1 reduces from 100MHz to 40MHz as the total delay value increase
from 100ms to 500ms. Hence d is 500ms. In our experiments, we chose
numerous processor frequency values for PE 2 (i.e. for f2). We show the
results when f2 = 460, 500 MHz because to execute the tasks mapped to
PE 2, the average frequency at which PE 2 should run is 460 MHz. From
Figure 3.6 it is evident even if f2 is increased beyond the average frequency
required for PE 2, the minimum required processor frequency f1 is still 40
MHz. Hence there is no advantage in running PE 2 at a higher frequency
than its average. On the other hand reducing f2 below 460 MHz, we observed
that (results not shown) the minimum required processor frequency for f1
computed from Equation 3.9 is a large value than 40 MHz.
Buffer size reduction: Having found the total playout delay corre-
sponding to minimum processor frequencies required for the processing ele-
ments, we now present results that shows memory reduction on delay redis-
tribution. For the total playout delay d, the maximum playout buffer and
intermediate buffer size required is estimated (Equation 3.13 and 3.14). The
buffer sizes are estimated with and without redistributing the playout delay.
The estimated maximum buffer sizes from our analytical framework closely
matches with that of the simulation results. We present a SystemC-based
simulation results and show how much savings in buffer space we could obtain
after delay redistribution. We used the video clip “cact”(obtained from (Tek-
tronix, 1996)) in this simulation. The main result of this chapter is shown
in Figure 3.5(a) and Figure 3.5(b). The total buffer size savings (including
3.3. DELAY REDISTRIBUTION 61





















d1 = 450 ms
d1 = 0 ms
(a) Playout Buffer
















Figure 3.5: Change in buffer fill levels with redistributing playout delay.
62 CHAPTER 3. BUFFERING FOR SMOOTHING
the intermediate buffer) we obtained after delay redistribution for this clip is
70%. In Figure 3.5(a), the simulation results for d1 = 0ms and d1 = 450ms is
shown. Recall that d1 is the delay after which PE 2 starts processing. As we
could see in this figure, the playout buffer fill level substantially reduces after
redistributing the delay. Figure 3.5(b) shows the fill level of the intermediate
buffer over time, and we see that after the redistributing delay, the fill level of
the intermediate buffer increases. Note here that irrespective of the increase
in the fill level of the intermediate buffer, the total buffer size in terms of
bits does not increase. In fact, it reduces. Since the partitioned decoded
macroblocks in the intermediate buffer occupy less memory as compared to
the playout buffer.
Note on decoder source code instrumentation: We instrumented the libmpeg2
decoder source code to obtain the bits corresponding to partially decoded
macroblocks. After VLD and IQ, the DCT blocks corresponding to each
macroblock will have several zero and non-zero coefficients. Instead of stor-
ing each block in matrix format at the intermediate buffer, we only store the
location and the value of each coefficient. We needed maximum 16 bits to
store a coefficient and 6 bits to store the location of the coefficient, leading
to a more efficient storage.
In this chapter we presented an analytical framework using which it is
possible to compute the trade-offs between playout delay and the minimum
required processor frequency for processing multimedia streams. This is
achieved by exploiting the high variability in multimedia workloads. The
framework can account for both, the data-dependent variability in the pro-
cessor cycle requirements of stream objects, as well as the variability in the
3.3. DELAY REDISTRIBUTION 63

















s f2 = 460 MHz
f2 = 500 MHz
Figure 3.6: Playout delay estimation w.r.t processing requirement of tasks (VLD
and IQ) running in PE 1.
input-output rates of multimedia streams. A novel technique to reduce the
on-chip memory size required for stream processing on multiprocessor system-
on-chip architectures is proposed in this chapter. Playout delay associated
with the display device in a multimedia embedded system is redistributed to
the processing elements on-chip connected in pipeline to the output device.
This delay redistribution reduced the maximum on-chip memory required
because the variabilities in the buffer fill levels (due to event arrival and task
execution at the PEs) were stopped from propagating to subsequent buffers.
We presented a mathematical framework, using which we could estimate the
total playout delay required and the delay to be redistributed. We validated
our results using simulation, and we obtained up to 70% savings in buffer
size after applying our technique.




In the previous chapter, we illustrated how the buffering reduces the process-
ing requirement and the storage requirement. In this chapter we see how the
variability characterstics of the multimedia application could be used to have
several applications sharing the resources and the resources being effectively
utilized.
This chapter inline with the previous chapter establishes how the pro-
cessor workload and the periodic task overload could be squeezed in such a
way that the processor could be effectively utilized. The squeezing technique
allows the playout buffer to be filled with little in excess and checks how the
processor frequency has to be slightly increased so as to meet and handle the
deadline requirements of the periodic task.
Using the system-model proposed in the previous chapter, we first have
the problem statement proposed. Then we explain in detail the schedulability
65
66 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
analysis technique proposed in this work. Then in the sections following we
present results that clearly shows the advantages of the techniques proposed.
4.1 Motivation
Today, most personal mobile devices (e.g., cell phones and PDAs) are multimedia-
enabled and support a variety of concurrently running applications such as
audio/video players, word processors and web browsers. Media-processing
applications are often computationally expensive and most of these devices
typically have 100 – 400 MHz processors. As a result, the user-perceived
application response times are often poor when multiple applications are
concurrently fired. In this chapter we show that by using application-specific
dynamic buffering techniques, the workload of these applications can be suit-
ably “shaped” to fit the available processor bandwidth. Our techniques are
analogous to traffic shaping which is widely used in communication networks
to optimally utilize network bandwidth. Such shaping techniques have re-
cently attracted a lot of attention in the context of embedded systems design
(e.g., for dynamic voltage scaling). However, they have not been exploited
for enhanced schedulability of multiple applications, as we do in the chap-
ter (Raman and Chakraborty, 2006).
The last few years have seen a huge proliferation of personal mobile de-
vices such as PDAs, cell phones, portable audio/video players and gaming
devices (Nytimes, 2005, 2007; Nieh and Lam, 2003). Many of these devices
now support multiple functionalities, have an operating system running on
them and allow multiple applications to be run concurrently. For example,
4.1. MOTIVATION 67
a common use scenario for a PDA is to play an audio or a video clip, and at
the same time use a Datebook application. These applications often exhibit
poor response times when run on the 100 – 400 MHz processors found in
most personal mobile devices. Clearly, this problem will become more acute
as the user demand increases in terms of the number of applications being
concurrently run, they are becoming increasingly rich in graphics/animation
and the increasing use of audio/video content in mobile devices.
Although a lot of existing work from the processor scheduling domain
(especially in the context of multimedia applications) addresses this prob-
lem (Goyal et al., 1996; Yuan and Nahrstedt, 2003; Im et al., 2001; Brandt
et al., 2003; Banachowski et al., 2004), in this work we look at it from a
different perspective. We show that by using dynamic buffering techniques,
the workload of different concurrently running applications can be appropri-
ately shaped to fit the available processor bandwidth. Although buffering is
a well-established technique to smooth out the variabilities associated with
continuous media streams, it has predominantly been used in the context of
streaming media (i.e. when the audio/video data is downloaded from the
network before being processed) (Thiran et al., 2001; Ramjee et al., 1998a).
The goal here is to smooth out the burstiness in the arrival process of a
stream due to different network conditions.
As mentioned in the previous chapter, however, much less importance is
attached to buffering in the case of stored media. Here, the playout delay
used is typically very small and is chosen in an ad hoc fashion, without taking
into account the characteristics of the application or the media stream being
processed. More importantly, the playout delay is chosen independently of
68 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
Figure 4.1: Setup for dynamic workload shaping.
the other tasks running on the processor. Further, current processor schedul-
ing techniques do not exploit buffering as a means for shaping the workload
associated with the tasks being scheduled. We show that by appropriately
addressing these three issues, the schedulability of multiple concurrently run-
ning tasks can be substantially enhanced.
4.1.1 Our Contribution
Towards this, we present a mathematical model that can be used to an-
alyze applications processing continuous media streams, and setups where
such applications concurrently run with other applications processing more
static input data. More specifically, we show that the choice of the playout
delay has a significant impact on the workload generated by such applica-
tions. Given the available processor bandwidth, our model can be used to
calculate the minimum playout delay using which an application may be sup-
ported by this available bandwidth. We also show that the buffer fill level
can be dynamically changed at run time to periodically free up sufficient pro-
4.1. MOTIVATION 69
cessor bandwidth to support other concurrently running applications. The
parameters of such dynamic scheduling and buffer management policies are
application specific and can be calculated using our model. Determining
these parameters by trial and error and repeatedly validating them using
simulation is expensive in terms of the simulation time involved and is also
highly error prone (Gries, 2004; Lahiri et al., 2001a). On the other hand, our
model requires each application to be simulated only once, and in isolation,
using representative input data (or audio/video clips) to determine the val-
ues of certain parameters characterizing the application. These parameters
then serve as input to our mathematical model and are used to determine
scheduling and buffer management policies when multiple applications run
concurrently. Simulating the execution of each application in isolation is
considerably easier than a full system simulation where multiple applications
run together and get preempted by the scheduler.
4.1.2 Reference works
The basic idea we exploit in this chapter is similar to traffic shaping, which
is a well-established technique in the communication networks domain (Le
Boudec, 2002; Elwalid and Mitra, 1997; Georgiadis et al., 1996). A traffic
shaper is used to buffer network packets from an incoming packet stream
and delay them so that the outgoing stream from the shaper conforms to
a pre-defined traffic specification. Such shaping is used to smooth out the
burstiness in the packet stream, thereby preventing such burstiness to ac-
cumulate as the stream passes from one network node to the next. This
70 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
results in improved network bandwidth utilization and reduces global buffer
requirements. We, on the other hand, exploit buffering to smooth out the
variabilities in the execution requirements of media processing applications.
The aim is to shape the workload arising from such applications, so that
it can be served at an “average” rate by appropriately choosing the initial
playout delay or buffering time. Further, we dynamically change the amount
of buffering associated with a running application to free up sufficient pro-
cessor bandwidth in response to new tasks fired by a user. The novelty of
our work stems from the analogy we establish between our representation of
the variability in the execution requirements of an application, and existing
models for quantifying burstiness in network traffic.
Recently, shaping techniques have also attracted a lot of interest within
the embedded systems domain (Cai and Lu, 2005; fabiana Chiasserini and
Rao, 2001; Heithecker and Ernst, 2005; Hu and Lu, 2005; Poellabauer and
Schwan, 2004; Manolache et al., 2006; Wandeler et al., 2006). Shaping has
been used as a means for aggregating the workload of media-processing ap-
plications (e.g., video decoders) to create idle periods. Such idle periods
are then exploited to shut-off a processor or scale down its operating volt-
age/frequency to save power (see Cai and Lu (2005); fabiana Chiasserini and
Rao (2001); Hu and Lu (2005); Poellabauer and Schwan (2004) for applica-
tions of this scheme in different setups). Very recently, shaping has also been
used in the particular context of designing multi-processor System-on-Chip
(SoC) architectures. It has been shown that shaping on-chip traffic leads
to reduced on-chip global buffer requirements and improves overall system
performance and predictability (Heithecker and Ernst, 2005; Wandeler et al.,
4.2. ILLUSTRATIVE EXAMPLE 71
Figure 4.2: Dynamically controlling the playout buffer fill level as two appli-
cations are being scheduled.
2006). Similar techniques have also been shown to be useful when applied to
Networks-on-Chip (NoC) architectures, where again they result in improved
worst-case response times and global buffer requirements (which in turn lead
to reduced chip area and power consumption) (Manolache et al., 2006). Our
work in this chapter follows this line of research and specifically shows that
application-specific buffering can be used to shape the workload of real-time
applications processing continuous media streams, and such shaping can be
exploited to significantly improve the overall schedulability of a pre-defined
set of applications.
4.2 Illustrative Example
Figure 4.1 is a high-level view of our setup. It shows a processor running
two tasks, an MPEG decoder and a Datebook application that is commonly
supported on PDAs. The input to the decoder is a compressed video stream
72 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
that arrives at the input buffer b. This buffer is read by the decoder and
the decoded video stream is written into the playout buffer B, which in
turn is read by the playout (or display) device at a pre-defined rate (e.g., 25
frames/sec). Throughout this chapter we assume that the tasks running on
the processor are scheduled using some proportional-share scheduler (Duda
and Cheriton, 1999; Goyal et al., 1996) which allocates a specified amount of
CPU bandwidth to each task. Our goal in this chapter is to show how such
a scheduler1 can exploit dynamic buffering to enhance the schedulability of
a set of applications. Towards this, we modify the scheduler to dynamically
change the processor share allocated to each task at run time. These shares,
as we show in this chapter, are determined by the fill-level of the playout
buffer B, the execution demands of the running tasks (which we characterize
using a mathematical model in Section 4.3), and the set of tasks in the ready-
queue of the scheduler (see Figure 4.1).
To illustrate our scheme, let us consider the scenario shown in Figure 4.2.
We assume that our processor has an effective bandwidth of fmax MHz avail-
able for running user applications (i.e. after supporting operating system
tasks). At time t = 0, the user triggers the MPEG decoder application,
which is started immediately. However, the playout from the buffer B only
starts at t = tpd, which is the playout delay. In other words, the playout or
display device starts reading the buffer B only from time t = tpd onwards.
With tpd as the playout delay, the decoder occupies a processor bandwidth
1Our scheme can be applied to other scheduling disciplines as well. But for simplicity
of exposition, we only restrict ourselves to proportional-share schedulers in this chapter,
especially because such schedulers seem to be the most natural choice for the proposed
scheme.
4.2. ILLUSTRATIVE EXAMPLE 73
of fav MHz. At time t = t
′ the user now triggers the Datebook application
which requires a processor bandwidth of fdb. However, it turns out that
fmax − fav < fdb and hence the Datebook task cannot be executed (immedi-
ately). In response to this, the scheduler increases the processor bandwidth
allocated to the decoder task, from fav to fhi (where fhi ≤ fmax). With this
bandwidth, the average decoding rate is higher than the consumption rate
of the output device from the playout buffer B. This results in the fill-level
of B to increase till the time t = t′ + tfill, when the allocated bandwidth
to the decoder is reduced to flo (< fav). flo is chosen such that the freed
bandwidth (i.e. fmax−flo) is sufficient to support the Datebook task. Hence,
this task starts at t = t′ + tfill and continues till t = t
′ + tfill + tdrain. During
the time interval [t′ + tfill, t
′ + tfill + tdrain) the fill-level of B continuously
decreases because the bandwidth flo is not sufficient to sustain the playout
rate demanded by the output device. t′ + tfill + tdrain is the earliest time
2
at which B is fully drained. At this time, the processor bandwidth allocated
to the decoder is again increased to fhi for the next tfill time units and this
cycle is repeated till the Datebook task is terminated by the user.
Note that this scheme will work if tfill is relatively small compared to
tdrain. For many applications such as Datebook, which involves interactive
text processing and input from a user, this is indeed the case and the (small)
periodic time intervals during which the task is suspended are tolerable.
2The rate at which the decoded video is written into the buffer B is variable, because
of the data-dependent variability in the decoding time of each video frame/macroblock.
Hence, it might take longer for B to become empty.
74 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
4.2.1 Problem Statement
To formally analyze this setup, we model the Datebook application as a
periodic task, with an execution requirement, period and deadline. Such
a model is general enough to capture a wide variety of applications. The
decoder application, on the other hand, processes a continuous media stream
and is modeled differently. The performance constraint that needs to be
satisfied in this case is that the output device should always be able to read
a decoded video frame from the playout buffer B. Given the playout rate of
the output device, this translates to the constraint that the playout buffer
should never underflow.
Hence, our problem reduces to a schedulability analysis problem where a
system designer has to estimate whether the media-processing task can satisfy
its buffer underflow constraint and the periodic task its deadline constraint.
However, in contrast to classical schedulability analysis problems, here it
leads to the following question:
Given the execution demands of the two tasks (which we for-
mally model in Section 4.3) does there exist tfill, tdrain and flo,
such that the buffer underflow and the deadline constraints are
satisfied?
For setups where the answer to this question is “yes”, the designer would
additionally want to know all possible values of tfill, tdrain and flo which lead
to a schedulable system. In the following sections we show how to address
this problem.
In summary, we would like to emphasize that our scheme attempts to
4.3. DYNAMIC BUFFERING 75
shape (lower-bound) the output from the playout buffer B to closely match
the consumption pattern of the output (display) device. This in turn shapes
the workload of the media-processing application to create slacks which are
used to accommodate a periodic task.
4.3 Dynamic Buffering
In this section we use the model proposed above to develop the schedulability
test outlined in Section 4.2. Recall from our example in Section 4.2 that such
a schedulability test amounts to computing feasible values of the parameters
tfill, tdrain and flo.
We first formulate the two constraints outlined in Section 4.2, i.e. (i) the
playout buffer associated with the media-processing task should not under-
flow, and (ii) that the periodic task should meet its deadline. Recall that our
scheduling strategy involves a cyclic repetition of two stages:
Stage 1. Once the periodic task is triggered (say at time t′), the processor band-
width allocated to the media-processing task is increased to fhi to fill
up the playout buffer B. For simplicity, we assume that fhi is equal to
fmax (which is the available bandwidth for running user applications).
During this stage, the periodic task is suspended (if fhi < fmax then it
runs with a processor bandwidth of fmax−fhi). This stage lasts during
the time interval [t′, t′ + tfill).
Stage 2. Now the media-processing task receives a bandwidth of flo and the
76 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
remaining bandwidth of fmax − flo is allocated to the periodic task.
This stage lasts during interval [t′ + tfill, t
′ + tfill + tdrain).
To satisfy the buffer underflow constraint, the fill level of B should be greater
than or equal to zero at time t′+ tfill+ tdrain. In order to mathematically for-
mulate this constraint, let βhi be the service provided to the media-processing





is the pseudo-inverse of γu; it takes as an argument a certain
number of processor cycles and returns the minimum number of stream ob-
jects that are guaranteed to be processed using these cycles. Similarly, let
βlo denote the service corresponding to the processor bandwidth flo.
The change in the fill-level of B over the time interval [t′, t′+ tfill+ tdrain)
can now be lower bounded by (i.e. the fill-level of B cannot decrease by more
than this amount):
(




(αl ⊗ βlo)(tdrain)− c · tdrain
)
(4.1)
With a slight abuse of notation, let us refer to the above expression as
Eq. (4.1). The first term of this sum captures the change in fill-level over
the interval [t′, t′ + tfill) and the second term corresponds to the interval
[t′ + tfill, t
′ + tfill + tdrain). Clearly, if Eq. (4.1) is greater than or equal to
zero, then the playout buffer underflow constraint is satisfied.
Next, we formulate the second constraint, i.e. the periodic task should
4.3. DYNAMIC BUFFERING 77
meet its deadline. Let the execution requirement, period and deadline of this
task be e, p and d, with p ≥ d. Clearly, this task is schedulable if:
tfill + tdrain ≤ p (4.2)
and
(fmax − flo) · tdrain ≥ e (4.3)
Hence, our task set is schedulable if there exists tfill, tdrain and flo for
which Eq. (4.1) ≥ 0 and Eqs. (4.2) and (4.3) are satisfied. Unfortunately,
this system of equations cannot be solved to obtain a closed-form solution
to tfill, tdrain and flo. Hence, we compute Eq. (4.1) for all possible values of
tfill, tdrain and βlo and then identify the combinations of (tfill, tdrain, βlo) for
which Eqs. (4.2) and (4.3) are satisfied. Such combinations then constitute
schedulable solutions.
4.3.1 Schedulability Analysis
Recall the example described in Section 4.2. To estimate the processor band-
width fav occupied by the MPEG-2 decoder for different values of playback
delay tpd, we use the following equation for frequency estimation. The mini-
mum processor bandwidth that the scheduler needs to allocate to the media-
processing task, to sustain its playout rate is given by:
fav = min{f | ft ≥ γ
u(β)(t), ∀t ≥ 0} (4.4)
78 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
In other words, if the scheduler allocates this bandwidth then it can be guar-
anteed that the playout buffer B will never underflow, provided the output
device starts consuming stream objects after a delay of tpd time units.
Figure 4.3 shows how fav decreases with increasing tpd, starting with
tpd = 10 ms. With no other tasks running on processor, tpd is typically chosen
to be a relatively small value. However, Figure 4.3 shows the potential for
decreasing the allocated bandwidth to the decoder by increasing the fill-level
of the playout buffer B.






















Figure 4.3: Buffering time versus workload for a low bit rate and low resolu-
tion video stream.
Let us now consider a setup where the MPEG-2 decoder concurrently runs
with a periodic task on a 510 MHz processor (bandwidth available for user
applications). The periodic task is characterized by a period of 500 ms, which
is also equal to its deadline and has an execution requirement of 100 × 106
cycles. Hence, fhi = 510 MHz. We would like to estimate if this system
4.3. DYNAMIC BUFFERING 79
is schedulable, i.e. whether there exists feasible tfill, tdrain and flo. Note
from Figure 4.3 that with a sufficiently long playout delay, the MPEG-2
decoder occupies approximately 500 MHz of the processor bandwidth. Since
the periodic task has a period (and deadline) of 500 ms and an execution
requirement of 100× 106 cycles, it requires 200× 106 processor cycles every
second. In other words, it occupies a processor bandwidth of 200 MHz.
Hence, the two tasks clearly cannot be run concurrently. However, as we
show below, they may still be scheduled using our dynamic buffering scheme.
Figure 4.4 shows the value of Eq. (4.1) for this setup for different values
of tfill and tdrain with flo set to zero. Note that the vertical axis (z-axis) of
this plot corresponds to the number of (excess) decompressed macroblocks
in the playout buffer after one cycle of tfill + tdrain. The region of this plot
that is above the plane z = 0 corresponds to values of tfill and tdrain for
which the playout buffer does not underflow. This region is labeled as the
feasible region. The region below z = 0 corresponds to values of tfill and
tdrain for which the playout buffer underflows (i.e. its fill level decreases after
the tfill+ tdrain cycle). This region is labeled as the infeasible region. Clearly,
we are interested in the feasible region. However, this region (or a subset
of it) also has to satisfy the constraints given by Eqns. (4.2) and (4.3) for
the periodic task to be schedulable. The subset of the feasible region that
satisfies these two equations is labeled as the schedulable region.
Figure 4.5 shows the schedulable regions for three different values of flo.
The lower-most surface corresponds to flo = 0 MHz, the middle surface cor-
responds to flo = 50 MHz and the topmost corresponds to flo = 100 MHz.
All other parameters, such as fhi and those describing the periodic task re-














4.3. DYNAMIC BUFFERING 81
main the same as before. Note from Figure 4.5 that the system is schedulable
for all these three values of flo. However, the schedulable values of tfill and
tdrain change with different values of flo. It may be noted that different values
of tfill and tdrain are also associated with different scheduling overheads and
buffer requirements. Our model offers the possibility of quickly visualizing
the design space for selecting the appropriate scheduler parameters.


































Figure 4.5: Schedulable regions for different flow.
We next consider the case where the available processor bandwidth is
300 MHz. Hence, fhi is now equal to 300 MHz. The parameters of the
periodic task remain unchanged and flo = 0 as in the previous case. The
feasible and infeasible regions for this case are shown in Figure 4.6. There
is no schedulable region in this scenario because the maximum processor
bandwidth available for both – decoding the media stream and executing the
82 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
periodic task – is not sufficient. The feasible region corresponds to points
where the buffer does not underflow. However, the remaining processor ca-
pacity is not sufficient to run the periodic task (i.e Eq. 4.3 is not satisfied).
Figure 4.6: A non-schedulable system.
In Figure 4.7, we modify the period and the execution requirement of
the periodic task to 600 ms and 80 × 106 cycles respectively (the previous
values were 500 ms and 100 × 106 cycles). Further, we set flo = 0 MHz
and fhigh = 510 MHz. Compared to the setup shown in Figure 4.4, here the
periodic task has a larger period and a lower execution requirement and as a
result exerts a lower demand on the processor. Hence, it may be noted from
Figure 4.7 that compared to Figure 4.4, the area of the schedulable region
has increased.
Finally, Figure 4.8 shows the schedulable region for a setup consisting
4.3. DYNAMIC BUFFERING 83
Figure 4.7: Schedulable regions of a periodic task (p = 600 ms, e = 80 ×
106 cycles).
of a periodic task (with a period of 500 ms and execution requirement of
10 × 106 cycles) and a MPEG-2 decoder decoding a low bit rate and low
resolution video (resolution of 352 × 240 pixels and bit rate of 1500 kbps).
The display rate is maintained at 30 fps as before. Here, fhigh, the maximum
processor frequency is set to 120 Mhz. It may be noted from Figure 4.3
that for this class of clips, as the playout delay is increased, the processor
bandwidth required for decoding a clip reduces to finally stabilize at around
120 MHz. Hence, this is also the average processor bandwidth (fav) required
to decode any clip from this class. As expected, this frequency is lower than
what is required to decode any high bit rate and high resolution clip.
Clearly, the total available processor bandwidth of 120 MHz – when used
84 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
in a straightforward manner – is not sufficient for supporting both the video
decoding application and the periodic task even when video streams are
buffered over sufficiently long time intervals. This is because the video de-
coder occupies the entire available processor bandwidth. However, as shown
in Figure 4.8, our dynamic buffering/shaping technique helps in accommodat-
ing the periodic task, which would not have been possible to run otherwise.
Figure 4.8: Schedulable region for a setup consisting of a periodic task along
with an MPEG-2 decoder decoding a low bit rate and low resolution video
stream.
4.4 Discussion
In this section, we clarify some of the modeling and experimentation deci-
sions related to our proposed scheme. In particular, we discuss some of the
4.4. DISCUSSION 85
restrictions imposed by the scheme (e.g., applications to be added to the mo-
bile device should be pre-profiled to determine their execution requirements
and real-time constraints). Further, we also highlight some of the overheads
associated with this scheme (e.g., the additional buffers required) and clarify
why such overheads might be tolerable.
Modeling/Technique
1. Buffer space required to store extra macroblocks belonging to the decoded
video stream: The proposed scheme relies on dynamically changing the
amount of buffering associated with a media stream. In particular, as
illustrated in the previous sections, a video stream is buffered to create
the necessary slack for supporting a periodic task. This clearly requires
extra buffer space compared to an implementation where the video
stream is played out with a small, constant playout delay. However,
the amount extra buffer space required is not significant as explained
below.
The execution demand of the periodic task is assumed to be several
times lower than the processing demand of the video decoding task.
As mentioned in Section 4.2, the proposed scheme is only applicable to
setups where the processing demand of the periodic task is relatively
low. Note from the example in Section 4.2 that prior to executing the
periodic task, excess macroblocks are produced to ensure that the play-
out buffer never underflows. However, since the periodic task runs only
for small intervals of time before the video decoding task is resumed,
86 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
combined with the fact that the execution demand of the periodic task
is assumed to be relatively low, the additional buffer space required to
store the excess macroblocks is not substantial. Further, recall that
given a problem specification, our model is used to identify the values
of the parameters tfill, tdrain and flo that lead to a schedulable system.
Note from Section 4.3.1 that given any particular setup, there can be
several (or no) values of these parameters that lead to a schedulable
system, with each set of values being associated with a specific buffer
requirement. Our proposed model allows a system designer to iden-
tify the parameter values which lead to a schedulable system with the
minimum buffer requirement.
2. Change in processor frequency during context switch (between the media
processing and the periodic task): We would like to emphasize that the
processor is always run at a constant frequency. It is only the processor
share – available to the video decoding and the periodic task – that is
changed from time to time based on the chosen values of tfill, tdrain and
flo.
3. Other scheduling policies and task sets: Our proposed scheme can be
applied on top of any scheduling policy, in a hierarchical manner. For
simplicity of illustration, and because it is the most natural choice,
we have explained this scheme using a proportional-share scheduler.
Further, in our experiments we only considered setups with one media
processing task and one periodic task. It is easy to see that the scheme
is also applicable to setups consisting of multiple such tasks. The re-
4.4. DISCUSSION 87
sulting analysis however becomes computationally more expensive.
4. Adding new tasks/applications to an existing system: Determining the
parameters tfill, tdrain and flo used by the proposed dynamic buffering
scheme clearly requires the execution requirements of the tasks and
their real-time constraints as input. Hence, the tasks to be scheduled
need to be profiled before they are added to the mobile device. This
clearly imposes certain restrictions on how new applications may be
added by the user to a pre-designed or pre-configured system.
The proposed technique assumes that the applications to be used in
the device are known during design time. The user has the option of
installing and using any of the applications that the vendors provide
(which are assumed to have been profiled by the vendors to determine
their workload and timing constraints). This assumption is reasonable
because typically portable device manufactures provide a controlled
environment so that the OS could be tuned for the applications that
the device could potentially run. This restriction has a lot of benefits
in terms of device usage efficiency and performance of the applications
in contrast to providing a generic OS and full flexibility in terms of
the applications that may be added. The users, however, are still able
to add new applications to a pre-configured system, provided these
applications are supported (or certified) by the vendor who has designed
the operating system and configured the scheduler.
Experimental Evaluation
88 CHAPTER 4. BUFFERING FOR MULTIPLE APPLICATIONS
1. MPEG-2 case-study: Our proposed framework can be used for schedul-
ing any streaming application (e.g., audio and video encoding and de-
coding) provided they offer the possibility of being buffered to free up
sufficient processor bandwidth to schedule other tasks. We chose an
MPEG-2 decoder specifically because this application has been widely
studied in the literature. Moreover, it exhibits a significant variabil-
ity in its execution demand and hence is well-suited for illustrating
our proposed scheme. Lastly, the code for MPEG-2 decoders is also
widely available (we used the reference implementation obtained from
the libmpeg2 (libmpeg2, 2006) library and optimized this source code
for improved decoding speed).
Our proposed technique can be exploited to enhance the schedulability of
media processing applications when concurrently run with other applications
that can be modeled as periodic tasks. The underlying idea was to accurately
model the variability in the processing requirements of media-processing ap-
plications and appropriately use buffering to periodically free up a portion of
the processors bandwidth to support other tasks. Our results can be useful
for designing and tuning application-specific schedulers for personal mobile
devices which run a restricted set of applications. As a part of future work,





The previous two parts of this thesis introduced the smoothing and the
squeezing techniques to effectively utilize the resources on-chip for processing
multimedia streams. The techniques proposed so far discuss the determinis-
tic approaches for providing throughput guarantees. In this part we take a
different approach in that we provide stochastic guarantees.
Since multimedia applications are soft-real time guarantees need not be
provided. There could be much resource saving with probabilistic guarantees.
In the section following this, we first describe the set-up using a motivating
example, how the playout delay reduces buffer space when stochastic guaran-
tees are used. Further, we show how the playout buffer underflow constraint
can be characterized using techniques from stochastic network calculus. In
this part, we derive the closed form equation for the playout delay parameter.
89
90 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
5.1 Basic Idea
In media players, if the playback starts after an appropriate initial delay, (i.e.
after buffering) then the display constraint (e.g. 30fps for video) is always
met. For considerable playout at the display device, large on-chip buffers are
necessary. Such designs that require expensive memories are undesirable in
resource-constrained portable players. We show that when the output con-
straints are slightly relaxed, with no perceivable loss in display quality, the
required initial delay reduces to a negligible value (for e.g., the delay value
can be reduced from 160 milliseconds to 40 milliseconds). Towards this, we
propose a mathematical framework, using which the playout delay could be
precisely estimated. Perhaps more importantly, unlike existing mathematical
models, our framework allows to specify loss and provides guarantees that
the desired display quality is achieved. Experimental results obtained with
MPEG player case-study indicates that with relaxed constraints, output ad-
heres to the expected quality. Consequently, playout delay reduces providing
huge savings in playout buffer size.
5.2 Motivation
In this section, we motivate the need for constructing stochastic constraints
to model buffer underflow.
Recapitulate the media player set-up, with a processing element, a input
memory, and a playout buffer. The amount of processing cycles required for
each stream object (e.g., frames) is variable. So, if the processing element
5.2. MOTIVATION 91
that executes the application runs at some constant speed then the number
of stream objects produced per unit time is variable. But the display device
consumes stream objects at a constant rate. Therefore, it is possible that the
playout buffer is empty at times leading to buffer underflows (i.e. produced
items are insufficient for consumption). If the buffer should never underflow,
then the display must start after an initial playout delay. We explain this

















































Figure 5.1: Processing requirement reduces with large initial delay. The
production rate is high when playout starts after small delay.
Figure 5.1 shows two scenarios where the output device consumes stream
objects after a small and large delay (seen with the shift in consumption).
Assume that we are given an input stream for processing (not shown in fig-
ure). The stair-case lines shows the cumulative number of stream objects
produced with two different processor speeds. The slope of the dashed-lines
gives the production rate. As shown in Figure 5.1, when the playout starts
after a small or near zero delay, the required production rate is high. Since,
to guarantee that the buffer should never underflow, for a given stream, we
should consider the worst-case scenario where stream objects or sequence of
92 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
them would take maximum processor cycles to execute. The processor must
produce at a high rate, so that during the worst-case, enough items are pro-
duced to meet the playout rate. With considerable initial delay, previous
work in this direction have shown that the processor cycles/second require-
ment reduces to the average cycles/second required to process the stream
(refer Chapter 3). While buffering is a powerful technique for reducing pro-

































Figure 5.2: Delay value reduces on relaxing buffer constraints. The out-
put stream at times cannot catch-up with consumption and playout buffer
underflows.
Now assume that we are required to find the minimum playout delay such
that it is guaranteed that the playout buffer never underflows (i.e. the display
quality is always met). In this chapter, we propose that on relaxing the
output constraints, we can determine the minimum playout delay required so
as to reduce the playout buffer size required. Now consider that we are given
a class of input streams (for e.g., streams with same bit-rate and resolution)
and we are required to find the initial playout delay required such that the
buffer never underflows. In Figure 5.2, the output stream ym (ymax in the
5.2. MOTIVATION 93
figure) lower bounds all other stream y1, y2, and y3. With playout delay d
none of the output streams underflow the playout buffer and with delay dmin
(or d), the stream ym and y3 underflows the playout buffer. The delay value
d is chosen such that for any stream that is bounded by ym, only at times the
playout buffer should underflow. The streams bounded by ym satisfies the
underflow constraint, if ym satisfies the constraint. Then as explained, we
require a large delay to guarantee that there is no loss in quality due to buffer
underflow. Instead, we propose to relax the buffer underflow constraint. The
amount of acceptable buffer underflow could be specified using stochastic
constraints.
Literature studies (Apteker et al., 1995; Wijesekera et al., 1999; Wije-
sekera and Srivastava, 1996) have clearly shown that such slight deadline
misses/loss in quality is often not perceivable during the display. Assume
we are also given the bound on the probability that any stream in a given
set of streams can underflow. For example, the given probability might be
that 95% of the time, for any given stream in the class, should not underflow
(i.e. only 5% of the time underflow can occur). Our results indicates, with
such relaxed buffer constraints, the initial delay required significantly reduces
saving buffer space. In Section 5.3, we illustrate the above concept with ex-
perimental data. While the delay reduction is fairly intuitive, the problem is
to estimate it.
Towards this, we propose a mathematical framework, using which for
a given set of streams, the minimum playout delay required is estimated
such that the desired display quality is achieved. In Section 5.4, we for-
mally define the probabilistic function and derive the minimum playout de-
94 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
lay. To the best of our knowledge, our framework is the first to provide
analysis allowing quality loss, whilst providing guarantees for designing mul-
timedia. Present analytical modeling techniques are inadequate for design-
ing multimedia-processing system-on-chips, either because the model do not
provide any guarantee on output quality, or because their analysis are for
worst-case scenarios and therefore often lead to cost-inefficient designs.
To illustrate these aspects, we report in this work how the analysis has
been applied to a real case-study, namely, the MPEG player. The accu-
racy of the results obtained is cross-validated with simulator results. The
experimental set-up and results are detailed in Section 5.5.
5.3 Illustrative Example
In this section, using a concrete example we show and reason the relation
between playout delay, buffer size and buffer underflow.
We took a video stream of resolution 352X240 that MPEG-2 compressed
to a bit-rate of 1.5 Mbps such that 30 frame per second could be displayed.
A complete simulation set-up was built (explained in detail in Section 5.5)
and the video stream was decoded and played-out after different initial delay
values. During each experiment the maximum playout buffer underflow and
playout buffer size required (such that there is no overflow) was noted. Two
important observations were made as the playout delay was increased (see
Figure 5.3):
1. buffer size increased, and




















Playout Delay    
(in milliseconds)





















)   
    
  
Figure 5.3: Correlation among playout delay, buffer size, and buffer under-
flow. Increase in playout delay (and buffer size) decreases buffer underflow.
2. maximum buffer underflow decreased.
First, the increase in playout delay increases the maximum playout buffer
size, because the variability in the buffer fill level, which often determines
the maximum playout buffer fill level is less than the initial buffer fill level.
Second, increase in the playout delay value decreases the playout buffer
underflow (i.e. the number of stream objects that the buffer underflows
substantially decreases). The buffer underflows because of the presence of
variability in the arrival of stream objects and the execution of them. Hence,
if there is an substantial initial delay, the variability in the arrival and the
execution of the items could be smoothened. Consequently, the underflow
in the buffer also reduces. Figure 5.4 shows that with increase in delay the
96 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
variability reduces.



























 d = 100ms
d = 80ms
Figure 5.4: Playout buffer underflow over time. The variability in underflow
substantially reduces with large increase in playout delay.
5.4 Minimizing Buffering
In this section, we derive the minimum playout delay required such that the
output rate is met with acceptable loss. We show that with relaxed con-
straints on playout, the required initial playout delay substantially reduces,
consequently, the playout buffer size is minimized.
5.4.1 Buffer Underflow
We first state the constraint to be satisfied for continuous playback of the
stream such that the actual output rate is met. If the playout buffer never
underflows then the output stream is played out with no loss in quality.
So, the playout buffer always has greater number of stream objects than
consumed by the output device over any time interval.
5.4. MINIMIZING BUFFERING 97
Experimental results show that the initial playout delay increases the
maximum playout buffer size required. The playout delay, however, can be
reduced if we allow the playout buffer to underflow at times. As detailed in
Section 5.2, we slightly relax the playout constraint so that we could chose a
smaller initial playout delay value. The playout buffer underflow of b stream
objects can be written as,
C(s, d)− y(s) > b, ∀s ≥ 0, ∀b ≥ 0. (5.1)
Equation 5.1 captures buffer underflow when exactly equal to zero and
greater than zero. Note that if the buffer underflow is negative then there is
excess items in the buffer, that is, there is no underflow. But Equation 5.1
does not capture the scenario where the buffer underflow is not negative (i.e.
there are more stream objects in the playout buffer than required). Also, note
that Equation 5.1 denotes a concrete output stream y. Now, we compute the
lower bound for this output stream y using known parameters.
The upper and lower bounds on the number of items arriving and the
number of items that are guaranteed to be processed over any time interval
are given. These bounds are represented as arrival and processing curves (i.e.
α and β respectively). Using these known parameters, we compute the lower
bound of the output stream as follows (here ym is the minimum number of
stream objects that are produced in the interval [0, t]),
ym(t) ≥ (α
l ⊗ βl)(∆), ∀t ≥ 0, ∀∆ ≥ 0. (5.2)
98 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
Now, if we compute the buffer underflow (Equation 5.1) using the lower
bound on the output stream (Equation 5.2) then the following condition
holds,
C(s, d)− ym(s) ≥ C(s, d)− y(s), 0 ≤ s ≤ t. (5.3)
From Equation 5.2 and Equation 5.1, we write,
C(s, d)− ym(s) > b, ∀b ≥ 0, 0 ≤ s ≤ t. (5.4)
Let the buffer underflow event be a random event. For example, if the
processor cycles per second allocated to the video decoding task is probabilis-
tic (due to another task consuming random cycles) then the buffer underflow
event is also random. There could be other reasons where the buffer under-
flow event is random including the arrival of stream objects to be processed
and execution requirement of the stream objects being random.
The focus of this chapter is to estimate the playout delay in a scenario
where buffer could underflow and hence for simplicity we have assumed the
processor cycles allocated to the video decoding task is random and arrival
of stream objects and execution requirement is deterministic.
A potential future direction is to estimate the minimum playout delay
considering all of the above mentioned scenarios causing buffer underflow a
probabilistic event. Such an effort would be a general stochastic framework
for designing system-on-chip for multimedia applications.
So, in summary, we have assumed that there are one or more tasks that
concurrently run with video decoding. These concurrent tasks consume ran-
dom number of processor cycles. Hence the processor capacity available for
5.4. MINIMIZING BUFFERING 99
video decoding is probabilistic. If, fmax is the maximum processor cycles
per second and let F be the random variable denoting the processor cycles
allocated to the video decoding task. Then the cumulative distribution of
the cycles allocated to the video decoding task is namely, P{F = fi}, for all
fi, where 0 ≤ fi ≤ fmax.
Our goal in this work is to estimate the minimum playout delay required
such that the desired stochastic guarantees are met. Given that the buffer
underflow is random, we denote the probability of the event (i.e. the under-
flow) as follows.,
P{C(s, d)− ym(s) > b}, ∀b ≥ 0, 0 ≤ s ≤ tmax, (5.5)
where tmax is the maximum analysis interval.
In other words, we have defined a finite sample space of events happening
over any time interval [0, tmax]. More precisely, the total sample space are the
points over any time interval [0, tmax], where the buffer underflow is exactly
zero or more. The Equation 5.5 represents the sub-set of the sample space
where the buffer underflow is greater than b stream objects.
The designer could specify the acceptable, tolerable loss in the quality
of the video display using statements such as: Case 1- the buffer should
never underflow more than two consecutive frames in displaying 30 frames
sequentially; Case 2- the buffer should never underflow more that 17 frames
(in total) in displaying 100 frames sequentially. These type of constraints
could be written as (for Case 1): (In the below equation, g(b) is a stochastic
bounding function. We will elaborate the properties of the function later in
100 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
this section. p represents the number of time units that have elapsed.)
P{C(s, d)− ym(s) > b} ≤ g(b), (5.6)
where 0 ≤ s ≤ tmax, s =
2
30
∗ p, ∀0 ≤ p ≤ tmax∗30
2
).
The time instance s corresponds to the point where 30 frames should
be displayed (we know in this case 30 frames takes one second. So, when
p = 1, we have s = 2
30
and at that point in time it is tolerable to have buffer
underflow of at most 2 frames in 30 frames. The b corresponds to stream
objects of 2 frames and g(b) = 0. g(b) is a stochastic bounding function,
which defines the probability that the buffer underflow is no more than b
stream objects.
Hence, in Case 1, if b corresponds to stream objects for 2 frames and
for every s (i.e. at points where 30 frames ideally should be produced),
g(b) = 0. The probability that the buffer underflows more than b stream
objects (corresponding to 2 frames) is zero.
Similarly, Case 2, could be denoted and it would be a variation in Equa-
tion 5.6 in terms of defining b, s, and p. For generality, that is for all b ≥ 0,
we consider g(b) gives the upper bound on the desired probability. Later
in this chapter we show interesting conclusions could be drawn at when as-
suming this generality and minimum playout delay is estimated. For now,
we want the reader to understand that for increasing values of b, g(b) is a
strictly decreasing function. This is because, as mentioned before, the total
sample space in Equation 5.5 is of all events where the buffer underflow is
greater or exactly equal to zero. The sample space denoted by Equation 5.5
5.4. MINIMIZING BUFFERING 101
for the value b = 0 is equal to the total sample space. Therefore, g(0) = 1
always. For increasing value of b the sample space noted by Equation 5.5 is
a decreasing and a sub-set of the sample space lower than b. Hence g(b) is
decreasing.
The probabilistic event buffer underflow is dependent on a random event,
which is, cycles per second allocated to the video decoding task. Hence, to
compare the buffer underflow probability of more than b stream objects, we
have to first compute the conditional probability that the buffer underflow is
greater than the b stream with certain frequency. In other words,
P{C(s, d)− ym(s) > b|F = fi}, (5.7)
∀0 ≤ fi ≤ fmax, ∀b ≥ 0, where 0 ≤ s ≤ tmax.
The total probability theorem states that given n mutually exclusive
events A1, ...An, whose probabilities sum to unity, then
P (B) = P (B|A1)P (A1) + ...+ P (B|An)P (An), (5.8)
where B is an arbitrary event, and P (B|Ai) is the condition probability of B
assuming Ai. To compute the minimum playout delay(d), we could write the
playout buffer underflows probability with respect to the stochastic bounding
function as follows,
P{C(s, d)− ym(s) > b} ≤ g(b), 0 ≤ s ≤ tmax, ∀b ≥ 0. (5.9)
102 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
From total probability theorem we have,




P{[C(s, d)− ym(s)] > b | F = fi} ∗




where 0 ≤ fi ≤ fmax, where 0 ≤ s ≤ tmax, and ∀b ≥ 0.
Problem Statement: To deduce the minimum playout delay required
(d) such that the desired output quality specified using function g(b) is guar-
anteed. Given input stream is bounded by the arrival curves (αu, αl). The
processor cycles per second for the video decoding task fits the given cumu-
lative distribution PF = fi, for all fi, where 0 ≤ fi ≤ fmax. Now we list the
input parameters required to estimate the playout delay:
• Available processor cycle cumulative distribution for decoding P{F =
fi}, for all 0 ≤ fi ≤ fmax
• Consumption rate of the output device (c),
• Probabilistic bounding function g(b),
• Upper bound on the arrival of stream objects, αu,
• Maximum analysis interval tmax (or t)
• Maximum processor frequency fmax, and
• Minimum processor frequency (f) 1 corresponding to the stabilization
1The minimum processor frequency is given as: min{f | ft ≥ γu(β)(t), }, ∀t ≥ 0 (refer
Chapter 3.
5.5. NUMERICAL EVALUATION 103
delay (ds). Note to optimize both the frequency and delay, we compute
the minimum possible processor frequency to compute the minimum
playout delay.)
In the following section, we show using that with exhaustive search, we
find the playout delay value such that Equation 5.10 is satisfied. In other
words, for a given stochastic bounding function g(b), we find a minimum
playout delay value.
5.5 Numerical Evaluation
In this section, we describe the two simulation set-ups: (1) SystemC sim-
ulation for validation of analytical model results, and (2) MATLAB imple-
mentation of our analytical model. There are two main observations from
the SystemC and MATLAB experiments: (1) the delay value reduces as the
maximum buffer underflow increases. Consequently, the buffer size reduces,
and (2) the minimum playout delay estimated using the analytical model
corresponding to maximum buffer underflow value is accurate with respect
to the mathematical model.
In the following section, we first describe the simulation set-up required
for getting the input parameters required for the mathematical model.
5.5.1 Minimum playout delay
In this section, we estimate the minimum playout delay required.
We show that the minimum playout delay required is such that the
stochastic boundary function g(b) is greater than the probability that the
104 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
buffer underflows. The cumulative distribution function for various frequency
values is shown in Figure 5.6. As mentioned in the problem statement in Sec-
tion 5.4.1, the cumulative distribution function is an input function to our
mathematical model. This minimum playout delay is smaller than the delay
that would be otherwise required if the buffer never underflows. In Figure 5.5,
the probability that the buffer underflows is plotted. The probability curves
are shown for the stochastic bounding function and the probability that the
buffer underflows maximum of 100 stream objects. Using the mathematical
model we proposed we found that for a delay value 151.4ms, the stochas-
tic bounding function is greater than the probability the buffer underflows b
stream objects (b varies from 0 to 406).
The objective in showing this result is to portray that using the exhaustive
search method we were able to find the minimum playout delay required.
Note that the case study described in the previous part of this chapter, we
described how the playout buffer size is varied for different playout delays
and buffer underflows.
The reason we chose a stochastic bounding function as an exponential
decay function is to have a simple function, with which we can motivate
the advantage in using stochastic constraints to model multimedia system
properties: using a infinitesimal probability to allow a slight relaxation in the
buffer underflow constraints can reduce required buffer space substantially.









, 0 ≤ b ≤ bmax. (5.11)
5.5. NUMERICAL EVALUATION 105


























Figure 5.5: Meeting desired stochastic constraints. The probability that the
playout buffer underflows is no more than the stochastic bounding function.














Figure 5.6: The cumulative distribution of processor frequency. Processor
cycles/second allocated to the video decoding task and therefore the playout
buffer underflow are probabilistic.
106 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
The exponential decay function varies with the maximum number of stream
objects than can underflow (i.e. bmax).
5.5.2 Validation
The simulation set-up was used to validate our results obtained using the
mathematical model. We fixed the maximum buffer underflow and estimated
the minimum playout delay required using the mathematical model and from
the SystemC simulation.
The SystemC model we used in this set-up is similar to that we explained
in the previous chapters. The playout delay is iterated for different values,
and the probability that the buffer underflows more than b stream objects is
evaluated (from the SystemC simulation results). For each delay value, which
we pre-select before the start of the simulation, we check if the bounding
function g(b) is satisfied, that is, if the probability that the buffer underflows
more than b stream objects is less than g(b) for all b. Note that unlike
the playout delay value the simulation need not be iterated for all possible b
values. Finally, we chose the minimum playout delay, for which the stochastic
bounding function is satisfied. We found the results of our model very close
to simulation results (see Figure 5.7).
We showed that with having stochastic constraints, we reduce the playout
delay to a greater extent. Multimedia players do not require hard real-time
guarantees instead soft-real time guarantees are sufficient. Exploiting this
fact, we have provided bounds on the allowed quality of the display. To satisfy
these stochastic constraints, a playout delay is chosen. Apparently, with this
5.5. NUMERICAL EVALUATION 107
































Figure 5.7: Accuracy of analytical model. Minimum playout delay estimated
using mathematical model is close to the delay values obtained from simula-
tion.
108 CHAPTER 5. BUFFERING WITH STOCHASTIC GUARANTEES
play-out delay the required play-out buffer is reduced to a great extent. Thus
enormous savings in terms of memory is obtained using stochastic constraints.
In conclusion, this chapter shows that stochastic guarantees are in fact the
constraints to be used when estimating processing requirements and memory
requirements for processing multimedia.
Chapter 6
Future Work and Conclusions
In the previous chapters, we understood how workload shaping techniques
when employed can effectively utilize resources on-chip. The real-time cal-
culus based model of the application and the architecture form the basis
from which the parameters (e.g., playout delay) defining the shaping tech-
niques are estimated. The analytical model used in the previous chapters
is sufficient to support the main theme of this thesis, however, the model is
constructed with certain assumptions. This chapter reports on our ongoing
work to remove those assumptions that limit the framework from being used
to model a complete architecture, and to fully capture the characterstics of
the application.
In the section following this introduction, we present two current work
in one of which we model an architecture feature (processor stalling), and in
other we account for the randomness in a multimedia set-up (probabilistic
arrivals to the input buffer):
• Memory Latency: The framework proposed in this thesis assumes that
109
110 CHAPTER 6. FUTURE WORK AND CONCLUSIONS
the processor never stalls, but in reality the buffer on-chip is small, and
so gets filled up. This read and write memory latency due to processor
stalling should be modeled.
• Stochastic Framework: In Chapter 5, we introduced that the playout
buffer underflow can be probabilistic, that is, we allowed the processor
to provide a probabilistic service to the media stream. This stochas-
tic behavior can be extended to the items arriving to the input buffer.
Then with respect to our mathematical framework, the input, the out-
put, and the service could be modeled as probabilistic. Such a stochas-
tic framework would provide opportunities for a low-cost system design.
Later in this chapter we discuss why it is currently difficult for us to
model a SoC using real-time calculus. Towards this, we list features that we
want to model, and state the practical difficulties (i.e. in terms of modeling
complexity) in adding a new feature or an element to the model. This chapter
concludes the thesis drawing that the variability of the multimedia workload
can be effectively utilized in saving resources on-chip.
6.1 Modeling Processor Waiting Time
In this section, we motivate the requirement for including waiting time as
a parameter in our mathematical model. In the previous chapters of this
thesis, we used the real-time calculus model without accounting for processor
stalling. We show in this section how our existing model (see Figure 6.1) could
be modified to include point-to-point communication of processing elements
6.1. MODELING PROCESSOR WAITING TIME 111
and memory units (see Figure 6.3). We discuss the advantage of real-time















Figure 6.1: Multimedia SoC model.
In our SoC set-up (see Figure 6.1), design parameters (such as the pro-
cessing requirement) are estimated such that the buffer underflow constraint
is guaranteed to be satisfied. But, small on-chip buffers often fill up, so,
the processing element writing to the buffer stalls. This processor latency, if
significant, should be modeled when estimating the processing requirement.
Our analytical model, based on real-time calculus, could capture the la-
tency of processor reads and writes to the memory, for any given intercon-
nect architecture, such as dedicated point-to-point, bus, and network-on-chip.
With additional elements to the existing model, we could study whether there
is an impact due to interconnect architecture, when the communication la-
tency due to processor stalls is included in the model.
Consider the model we used in our previous chapters, and assume it allows
processor stalling. Recapitulate that we formulated estimate for the mini-
mum processor frequency. In the following, we show how minimum processor
frequency varies with our new assumption.
Figure 6.2 sketches two scenarios: case (a), where the processing element
allocates video decoding the minimum cycles required to meet the play-out








































(Chosen frequency CANNOT meet play-out rate 























Deadline                
Miss
Item 1 
consumed All deadlines met
Initial play-out
delay
Consumption at  
constant rate
Case B
(Increased frequency compensates for decreased 
buffer size and meets play-out rate)
Figure 6.2: Case a: Buffer underflow due to processor latency, Case b: Play-out constraint met with increase in
processor share for decoding.
6.1. MODELING PROCESSOR WAITING TIME 113
than required processor share. Each case in Figure 6.2 shows the time line
of events at the processor and the play-out buffer, and graphs beneath the
time line show processor utilization and buffer fill level.
In case (a), the processor waits after executing item 2, because the play-
out buffer is full. Once the output device reads items from the play-out buffer,
the processor writes item 2, then starts to execute item 3. The deadline,
however, for item 3 is missed, that is, the output device cannot display item
3 at the required time.
Note the waiting period and the buffer underflow situations in the graphs
for case (a). The processor share for video decoding is increased in case (b).
Consequently, deadlines are met, as item 3 is processed in time. The buffer
never underflows (see graphs for case (b)). Thus, the required play-out rate
can only be guaranteed with an estimate of processor frequency including the
waiting time of the processor.
To estimate the processor latency, we propose a communication model,
that captures and represents the communication pattern among the process-
ing components, namely the processing elements, memories, and so on(the
communication model is shown in Figure 6.3). The nodes in the model are
components that initiate and receive communication among other compo-
nents in the system. Such communication is represented with a directed edge
between the components involved. The labels on the directed edge represent
the width and the delay of the communication. For example, the processor
waiting for the memory unit to write into it can be represented as the latency
between the processor and the memory (see the directed edge from memory
to processor in Figure 6.3). The communication model is general in the sense
114 CHAPTER 6. FUTURE WORK AND CONCLUSIONS





(Bw – Bandwidth, l - Latency)
Bw Bw Bw Bw
l1 l2 l3 l4
Output
DeviceMemory Memory
Figure 6.3: Model of communication
To estimate the bandwidth and latency, however, we will have detailed
models based on the interconnect architecture. The values of the bandwidth
and latency could then be used to compare the design parameters, such as
the processor frequency and buffer sizes. Hence, different interconnect archi-
tectures could be compared with respect to the parameters of their commu-
nication model, namely the bandwidth and latency. We could have the same
real-time calculus model proposed in the previous chapters in this thesis to
model the SoC with an interconnect architecture.
There have been several efforts in estimating the processor latency to ac-
cess memory, in particular, formal approaches to determine worst-case mem-
ory access latency (Henriksson et al., 2007; Schliecker et al., 2006; Moonen
et al., 2007). Present analytical methods model processor stalling time mostly
using concrete architecture templates that signify memory latency as a key
design parameter. The following list is a brief summary of the context in
which memory latency is modeled in existing literature (the architecture set-
ups A, B, and C, shown in Figure 6.4 are discussed below).
• Set-up A shows an off-chip memory and processing units, namely, the
ARM, Trimedia, Scalar and display controller. The memory latency





























































R W R W R W R
A B C
Figure 6.4: System architectures and models used for analysis in previous works. Memory latency modeled for
architectures with off-chip memory, shared memory, and FIFO (right to left).
116 CHAPTER 6. FUTURE WORK AND CONCLUSIONS
sors (e.g., ARM) and the dedicated hardware units (e.g., scalar). The
communication among the processing units through the shared off-chip
memory is minimal. The memory access, however, of each of the indi-
vidual units is frequent. This memory access latency is modeled using
basic flow models of network calculus (Henriksson et al., 2007).
• In Set-up B the memory is on-chip and access is through an intercon-
nect architecture. First, there could be memory access conflicts if there
are tasks running in different processing units and accessing the same
memory. Second, a task is stalled, even with the highest priority, if
another task running in the same processing unit is accessing the mem-
ory. Finally, the processor latency could be due to the communication
architecture. The memory access latency in multiprocessor set-ups is
modeled using event adaptation functions (Schliecker et al., 2006).
• The architecture template in Set-up C shows local FIFO memories
and the communication assists, which transfer data between FIFOs
in different processing units. The communication assists decouple the
computation in the processor from communication so as to reduce the
processor stall time. The processors only have to write to the FIFOs.
Such a system with computation and communication separated also
improves the predictability of the design (Moonen et al., 2007).
This work could fill a gap in existing approaches for modeling memory
latency in that we propose while estimating memory latency to include the
time the processing element waits for a space in the memory or buffer. The
on-chip buffer is certainly small (for example, the FIFO in Set-up C can only
6.1. MODELING PROCESSOR WAITING TIME 117
have 2 tokens), and in multimedia platforms there exists an on-chip buffer
that the processing units frequently access (Set-up A and Set-up B do not
show them).
Despite the differences in the architecture that the above mentioned ap-
proaches use (Henriksson et al., 2007, and others), there is an underlying
similarity: they model the latency due to communication between processor
and memory and between processing units. Since the methodology (followed
in the past) for modeling processor busy time is with different architecture
set-ups, there is a chance that some vital parameters affecting the application
performance are omitted.
A simple but crucial difference between architecture-based models (Wieferink
et al., 2005; Zhu and Malik, 2007) and the communication model is that the
latter always allow many possible architectures to be compared against each
other. Such communication models should be general to the extent that
they could be used for analysis (irrespective of any underlying mathemat-
ical theory) and specific to the manner that possible architectural designs
could be analyzed. The generality of the communication model we propose
is the characteristic feature that distinguishes our work from the previous
mathematical models (Lahiri et al., 2001b; Zamora et al., 2007, and others
previously mentioned).
We have motivated the importance of including processor waiting time
as a design parameter, and the need for a communication model for SoC
design. This section showed that how communication could be modeled with
small modifications to the framework we used in the previous chapters in
this thesis. Our current efforts towards this research direction is in estimating
118 CHAPTER 6. FUTURE WORK AND CONCLUSIONS
the processor waiting time and finding the corresponding minimum processor
frequency required.
6.2 General Stochastic Framework
Wemodeled the buffer underflow as probabilistic so as to relax the constraints
on the display requirements of the multimedia system (see Chapter 5). The
mathematical framework, however, mostly modeled the deterministic behav-
ior such as the arrival of items to the input buffer, and so on. We propose to
have a stochastic framework, which would allow arrivals to the input buffer
to be probabilistic. In this section, we show by example the benefits of hav-
ing such a stochastic framework that in general can allow the probabilistic
behavior of the application (e.g., arrival of stream objects, and processor cy-
cle requirement of stream objects) and the system or architectural elements
(e.g., the processor share to multimedia task).
The illustrative example below first takes a SoC design that guarantees
that the display requirement is always met provided the requirement on ar-
rivals is met. Next, it is shown that the requirements on arrivals need not be
met if some relaxations on display requirements are allowed. The idea here is
to show with no perceivable loss in output quality, designs with minimal on-
chip resources (e.g., a design for low bit-rate video) could support high-end
features such as a high bit-rate video stream.
6.2. GENERAL STOCHASTIC FRAMEWORK 119
6.2.1 A motivating example
Our analysis shows that allowing the actual display requirement to dete-
riorate slightly, high-quality input video could be processed with minimal
resources. More specifically, for a given input video quality (for e.g., video
bit-rate is a quality parameter), we estimate design parameters such as pro-
cessor frequency and buffer size such that the actual display rate is met. Then
we estimate the maximum video bit-rate the SoC (designed for lower input
quality) could process with acceptable loss in display quality. We describe
this set-up in the following.
First, we list the values of parameters of interest. The bit-rate of the video
stream is a constant that gives the average number of bits that arrives in the
internal buffer (Refer in Figure 6.1). In this example, we took a video stream
cact.m2v (Tektronix, 1996) of bit-rate 1500 kbps. The required display rate
is 30 frames/second and the frame resolution is 352 x 240. The PE running
the multimedia task has to run at a constant speed such that the display
rate is met. The frequency at which the PE runs is the average processor
frequency required to decode the stream. For this experiment, we found
that the average processor frequency required is 114 MHz. The input buffer
required is the maximum buffer needed such that there is no underflow and
the playout buffer required is the maximum buffer size such that there is no
underflow/overflow.
Note that in the previous experiment we had a stringent display require-
ment that 30 frames/second should be displayed always. Let us assume that
it is acceptable if the display requirement is met 90% of the time (i.e. at
120 CHAPTER 6. FUTURE WORK AND CONCLUSIONS
times the number of frames produced per second is less than 30). We in-
crease the bit-rate of the input video to 1600 kbps. But we preserve all other
parameters (output video resolution, processor frequency, and buffer sizes)
used in the previous experiment. This is because we intend to show that with
the resources required for the input video of bit-rate 1500 kbps, we decode
a higher bit-rate video. As expected the relaxed output constraint is satis-
fied despite the fact that the bit-rate of the video stream is increased. Thus
allowing negligible loss in the output quality we have saved resources: the
buffer size and processor frequency requirement for 1600 kbps is significantly
higher compared to that required for 1500 kbps.
In the experiment discussed in this section, our framework could esti-
mate first the processor requirement and the buffer sizes required for 1500
kbps video such that 30 frames/second could be displayed. Then with the
resource estimations as input parameters, and given the relaxed constraint
on the output quality, the maximum bit-rate video that could be supported
is estimated (without any increase in processor capacity or the buffer size).
Our work in this direction is to provide a mathematical framework to esti-
mate, with acceptable loss in output quality, the maximum bit-rate that could
be supported by the SoC. The SoC was originally designed to provide a de-
terministic guarantee for a low input quality video. The case-study shows the
primary benefit of the framework that is to design low-cost SoCs for process-
ing multimedia. Besides the case-study also implies that our model is general,
and could be leveraged in for analyzing designs for worst-case/average-case
scenarios.
As we mentioned earlier in Chapter 5, existing mathematical frame-
6.3. FINAL REMARKS 121
works for system-level design cut the design-time tremendously, the mod-
els generally account either for worst-case (Thiele et al., 2000; Richter and
Ernst, 2002b) or average-case scenarios (Zamora et al., 2007). In designing
multimedia-processing system-on-chips, worst-case based designs over provi-
sion the on-chip resources required and average-case analysis do not provide
any guarantees, which is desirable from the designers perspective. To address
these shortcomings, we need a general stochastic framework that allows de-
signers to specify acceptable loss in quality. Such a general framework should
be capable to capture all possible stochastic behavior of the application and
the architecture.
6.3 Final Remarks
In this section, we conclude the thesis with first listing some current lim-
itations in our mathematical model. Second, the difficulties faced in con-
structing the model are discussed. Finally, we revisit the contributions of
this thesis.
We presented in Section 6.1 an existing limitation in modeling memory
latency. In Section 6.2, a possible extension to the mathematical model is
discussed. Below, we present some other important extensions to our current
model.
1. The system architecture model used in this thesis considers that the
multimedia stream arrives directly to the input buffer, and the display
device reads directly from the playout buffer. A complete model should
consider the starting point of the data flow from the input device such
122 CHAPTER 6. FUTURE WORK AND CONCLUSIONS
as CD, external memory, and so on. The end point of the data stream
in the full model would be the display memory. Towards this we are
currently studying related work, which model the input device itself.
2. The application model we proposed do not account for split or join in
the data flow. Our model incorporates a simple feed forward sequence
flow. The multimedia application, however, does contain such splits
and joins in the stream. Nevertheless, the proposed model in this the-
sis applies when application blocks merge as they are mapped to an
architecture.
3. Our current interconnect architecture model assumes a point-to-point
network. The bus architecture, and network-on-chip are possible ex-
tensions.
The primary conclusion of this thesis is that the effective utilization of
on-chip resources is possible with workload shaping techniques. We showed
that the processor, and memory requirements can be brought significantly
lower when shaping techniques such as smoothing, squeezing, and slashing
influence system design in such a way that the insights of these techniques
can be used through the life time of the multimedia device.
Buffering for smoothing (presented in Chapter 3) is a technique where
with playout delay computed with precision, it is possible to provide just
the minimum amount of processing resources to meet guarantees of output
devices. At the playout buffer, with buffering, the variabilities due to arrival
and processing of items are smoothened. The buffer space required, however,
is large to have the processor running at minimal resources. To handle this
6.3. FINAL REMARKS 123
issue of large buffer requirement, we propose to distribute this delay across
different processors.
With distributing playout delay, the processors in the pipeline start after
a delay relative to the preceding processor. The playout delay after which
the output device starts to display items remains the same, that is, the delay
value is such that each processor in the pipeline requires only minimum pro-
cessor frequency for processing items. This delay redistribution reduces the
total memory requirement because the variability in the buffer fill levels do
not propagate in the pipeline. We explained this phenomenon in Chapter 3.
Buffering for multiple applications considered periodic tasks along with
audio and video tasks (presented in Chapter 4). Squeezing took advantage
of the predictability of the varying nature of the multimedia task. These
predictions (in terms of processor requirement) allowed running other tasks,
which used the available processor bandwidth when processor share of media
task were lowered. The media tasks periodically were provided with almost
full processor capacity leading to producing excess macroblocks. These excess
macroblocks are the items that the output device consumes during the period
when other tasks use larger processor share. Thus multiple applications are
being run, and at the same time their executions requirement are being met.
The inherent nature of a multimedia system is that the items arriving to
be processed at the input buffer do not conform to any standard models such
as periodic, periodic with jitter, and so on. Hence we modeled the variability
as well when we are characterizing a family of multimedia streams. Still, we
are expecting that the arrival of items, and the processing requirements of
items conform to certain bounds. Essentially, with a deterministic bound on
124 CHAPTER 6. FUTURE WORK AND CONCLUSIONS
the arrivals (and processing requirements) either we are pessimistic on the
number of items that are arriving at the input buffer or we do not capture
a large set of streams. The later point being that our arrival and processor
service models will not include even streams that mostly conforms to the
bounds.
Towards this, as a first step, we relaxed in our mathematical model a
constraint that requires the playout buffer should never underflow. In Chap-
ter 5, we numerically evaluated the minimum playout delay required such
that the buffer underflow is tolerable. The work highlighted how much sav-
ings in terms of resources and processing requirement could be achieved with
such relaxations on deterministic buffer underflow constraints.
These insights were made possible since the system design is mathemat-
ically modeled using real-time calculus; the mathematical framework pre-
served the inherent characteristic feature of a multimedia stream, that is,
the variability in terms of arrival and execution of stream objects. The pre-
liminary advantage of using such mathematical models is the fast exploration,
and the analysis of several design parameters of the SoC.
Currently, our effort is focused towards moving from the deterministic
setting to the stochastic framework. This research direction is promising,
so, in future, we envision to develop a system design software tool, which
is based on the mathematical framework we developed. The mathematical
framework would have incorporated all the important extensions pointed in
this chapter. We believe such a tool would be beneficial for the system design
community in terms of designing complex systems in a cost-effective manner.
Bibliography
Daniel D. Gajski Andreas Gerstlauer, Haobo Yu. RTOS modeling for system
level design. In DATE’03: Proceedings of the Design, Automation and Test
in Europe Conference and Exhibition, pages 130–135, Munich, Germany,
March 2003.
Ronnie T. Apteker, James A. Fisher, Valentin S. Kisimov, and Hanoch Neish-
los. Video acceptability and frame rate. IEEE MultiMedia, 2(3):32–40,
Spring 1995.
Scott Banachowski, Timothy Bisson, and Scott A. Brandt. Integrating best-
effort scheduling into a real-time system. In Proceedings of the Real-Time
Systems Symposium (RTSS), pages 139–150, Washington, December 2004.
IEEE.
Jean-Yves Le Boudec and Patrick Thiran. Network calculus: a theory of
deterministic queuing systems for the internet. Springer-Verlag, New York,
2001.
Scott A. Brandt, Scott Banachowski, Caixue Lin, and Timothy Bisson. Dy-
namic integrated scheduling of hard real-time, soft real-time and non-
125
126 BIBLIOGRAPHY
real-time processes. In Proceedings of the Real-Time Systems Symposium
(RTSS), pages 396–405, Washington, December 2003. IEEE.
Le Cai and Yung-Hsiang Lu. Dynamic power management using data buffers.
In Proceedings of the conference on Design, automation and test in Europe
(DATE), volume 1, pages 526–531, Washington, DC, Feburary 2004.
Le Cai and Yung-Hsiang Lu. Energy management using buffer memory for
streaming data. IEEE Trans. on Computer-Aided Design of Integrated
Circuits and Systems, 24(2):141–152, 2005.
Jerome Chevalier, Olivier Benny, Mathieu Rondonneau, Guy Bois,
El Mostapha Aboulhamid, and Francois-Raymond Boyer. Languages for
system specification: Selected contributions on UML, systemC, system Ver-
ilog, mixed-signal systems, and property specification from FDL’03, chap-
ter Space: a hardware/software systemC modeling platform including an
RTOS, pages 91–104. Kluwer Academic Publishers, 2004. ISBN 1-4020-
7990-7.
Youngchul Cho, Sungjoo Yoo, Kiyoung Choi, Nacer-Eddine Zergainoh,
and Ahmed Amine Jerraya. Scheduler implementation in MpSoC de-
sign. In Proceedings of the conference on Asia South Pacific design
automation(ASP-DAC 2005), pages 151–156, Shanghai, China, January
2005.
Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. Off-chip latency-
driven dynamic voltage and frequency scaling for an MPEG decoding. In
BIBLIOGRAPHY 127
Proceedings of the annual conference on design automation (DAC), San
Diego, California, June 2004.
Dirk Desmet, D. Verkest, and Hugo De Man. Operating system based soft-
ware generation for systems-on-chip. In Proceedings of the annual confer-
ence on Design automation (DAC), pages 396–401, Los Angeles, California,
June 2000.
Kenneth J. Duda and David R. Cheriton. Borrowed-virtual-time (BVT)
scheduling: supporting latency-sensitive threads in a general-purpose
scheduler. In Proceedings of the Symposium on Operating System Prin-
ciples (SOSP), pages 261–276, New York, December 1999. ACM.
Ajay Dudani, Frank Mueller, and Yifan Zhu. Energy-conserving feedback
EDF scheduling for embedded systems with real -time constraints. In
Proceedings of the joint conference on Languages, Compilers and Tools for
embedded systems: software and compilers for embedded systems (LCTES-
SCOPES), Berlin, Germany, June 2002.
Santanu Dutta, Rune Jensen, and Alf Rieckmann. Viper: A multiprocessor
SoC for advanced set-top box and digital TV systems. IEEE Design and
Test, 18(5):21–31, September/October 2001.
Anwar Elwalid and Debasis Mitra. Traffic shaping at a network node: Theory,
optimum design, admission control. In Proceedings of the Annual Joint
Conference of the Computer and Communications Societies (INFOCOM),
pages 444–454, Washington, April 1997. IEEE.
128 BIBLIOGRAPHY
Carla fabiana Chiasserini and Ramesh R. Rao. Improving battery perfor-
mance by using traffic shaping techniques. IEEE Journal on Selected Areas
in Communications, 19(7):1385–1394, July 2001.
Lovic Gauthier, Sungjoo Yoo, and Ahmed Amine Jerraya. Automatic genera-
tion and targeting of application specific operating systems and embedded
systems software. In DATE’01: Proceedings of the Design, Automation
and Test in Europe Conference and Exhibition, pages 679–685, Munich,
Germany, March 2001.
Leonidas Georgiadis, Roch Gue´rin, Vinod Peris, and Kumar N. Sivarajan.
Efficient network QoS provisioning based on per node traffic shaping.
IEEE/ACM Transactions on Networking, 4(4):482–501, Feburary 1996.
Pawan Goyal, Xingang Guo, and Harrick M. Vin. A hierarchical CPU sched-
uler for multimedia operating systems. In Proceedings of the Symposium
on Operating Systems Design and Implementation (OSDI), pages 107–121,
New York, October 1996. ACM.
Matthias Gries. Methods for evaluating and covering the design space during
early design development. Integration, The VLSI Journal, 38(2):131–183,
2004.
Sang-Il Han, Xavier Guerin, Soo-Ik Chae, and Ahmed Amine Jerraya. Buffer
memory optimization for video codec application modeled in simulink.
In In Proceedings of the ACM Annual conference on Design automation
(DAC), pages 689–694, San Francisco, CA, July 2006.
BIBLIOGRAPHY 129
Zhengting He, Aloysius Mok, and Cheng Peng. Timed RTOS modeling for
embedded system design. In Proceedings of the Real Time and Embed-
ded Technology and Applications Symposium (RTAS), pages 448–457, San
Francisco, California, March 2005.
Sven Heithecker and Rolf Ernst. Traffic shaping for an FPGA based SDRAM
controller with complex QoS requirements. In Proceedings of the annual
conference on Design Automation (DAC), pages 575–578, New York, June
2005. ACM.
Tomas Henriksson, Pieter van der Wolf, Axel Jantsch, and Alistair Bruce.
Network calculus applied to verification of memory access performance in
SoCs. In Proceedings of the Workshop on Embedded Systems for Real-Time
Multimedia (ESTIMedia), pages 21–26, Salzburg, Austria, October 2007.
Jianghai Hu and Yung-Hsiang Lu. Buffer management for power reduction
using hybrid control. In Proceedings of the Conference on Decision and
Control and the European Control Conference, pages 6997–7002, Washing-
ton, December 2005. IEEE.
Christopher J. Hughes, Jayanth Srinivasan, and Sarita V. Adve. Saving
energy with architectural and frequency adaptations for multimedia appli-
cations. In Proceedings of the annual international symposium on Microar-
chitecture (MICRO), Austin, Texas, December 2001.
Chia hui Wang, Jan ming Ho, Ray i Chang, and Shun chin Hsu. A feedback-
controlled EDF scheduling algorithm for real-time multimedia transmis-
130 BIBLIOGRAPHY
sion. Technical Report TR-IIS-01-008, Institute of Information Science,
Academia Sinica, Taipei, Taiwan, ROC, 2001.
Chaeseok Im and Soonhoi Ha. An energy optimization technique for latency
and quality constrained video applications. IEEE Design and Test, 21(5):
358–366, September-October 2003.
Chaeseok Im and Soonhoi Ha. Dynamic voltage scaling for real-time multi-
task scheduling using buffers. In Proceedings of the 2004 ACM SIG-
PLAN/SIGBED Conference on Languages, Compilers, and Tools for Em-
bedded Systems (LCTES), pages 88–94, Washington, DC, June 2004.
Chaeseok Im, Huiseok Kim, and Soonhoi Ha. Dynamic voltage scheduling
technique for low-power multimedia applications using buffers. In Proceed-
ings of the Symposium on Low Power Electronics and Design (ISLPED),
pages 34–39, New York, August 2001. ACM.
Ravindra Jejurikar and Rajesh Gupta. Dynamic volatge scaling for system
wide energy minimization in real-time embedded systems. In Proceedings
of the international symposium on Low power electronics and design (IS-
PLED), Newport Beach, California, August 2004.
Dong-Lk Ko and Shuvra S. Bhattacharyya. Modeling and optimization of
buffering trade-offs for hardware implementation of image processing ap-
plications. In IEEE Workshop on Signal Processing Systems Design and
Implementation, pages 591–596, Athens, Greece, November 2005.
Kanishka Lahiri, Anand Raghunathan, and Sujit Dey. System level perfor-
mance analysis for designing on-chip communication architectures. IEEE
BIBLIOGRAPHY 131
Transactions on Computer Aided-Design of Integrated Circuits and Sys-
tems, 20(6):768–783, June 2001a.
Kanishka Lahiri, Anand Raghunathan, and Sujit Dey. Performance and sta-
bility of communication networks via robust exponential bounds. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Sys-
tems, 20(6):952–961, June 2001b.
J.-Y. Le Boudec. Some properties of variable length packet shapers.
IEEE/ACM Trans. on Networking, 10(3):329–337, 2002.
libmpeg2. A free MPEG2 video stream decoder.
http://libmpeg2.sourceforge.net/, 2006.
Yanhong Liu, Alexander Maxianguine, Samarjit Chakraborty, andWei Tsang
Ooi. Processor frequency selection for SoC platforms for multimedia ap-
plications. In Proceedings of the IEEE Real Time Systems Symposium
(RTSS), pages 336–345, December 2004.
Yung-Hsiang Lu, Luca Benini, and Giovanni De Micheli. Dynamic frequency
scaling with buffer insertion for mixed workloads. IEEE Transacations on
Computer-Aided Design of Integrated Circuits and Systems, 21(11):1284–
1305, November 2002.
Jan Madsen, Kashif Virk, and Mercury Gonzales. Abstract RTOS model-
ing for multiprocessor system-on-chip. In Proceedings of the International
Symposium on System-on-Chip, pages 147–150, Tampere, Finland, Novem-
ber 2003.
132 BIBLIOGRAPHY
Sorin Manolache, Petru Eles, and Zebo Peng. Buffer space optimisation
with communication synthesis and traffic shaping for NoCs. In Design,
Automation and Test in Europe (DATE), pages 718–723, Belgium, March
2006. European Design and Automation Association.
Alexander Maxiaguine, Samarjit Chakraborty, Simon Kunzli, and Lothar
Thiele. Evaluating schedulers for multimedia processing on buffer-
constrained SoC platforms. IEEE Design and Test, 21(5):368–377, 2004.
R. Le Moigne, O. Pasquier, and J-P. Calvez. A generic RTOS model for
real-time systems simulation with systemC. In DATE’04: Proceedings of
the Design, Automation and Test in Europe Conference and Exhibition,
volume 3, pages 82–87, Paris, France, March 2004.
Sue B. Moon, Jim Kurose, and Don Towsley. Packet audio playout delay
adjustment: performance bounds and algorithms. Multimedia Systems, 6
(1):17–28, January 1998.
Arno Moonen, Marco Bekooij, Rene´ van den Berg, and Jef van Meerbergen.
Decoupling of computation and communication with a communication as-
sist. In Proceedings of the Euromicro Conference on Digital System Design
Architectures, Methods and Tools (DSD), pages 63–68, Lu¨beck, Germany,
August 2007.
Praveen K. Murthy and Shuvra S. Bhattacharyya. Buffer merging- A power-
ful technique for reducing memory requirements of synchronous dataflow
specifications. ACM Transactions on Design Automation of Electronic
Systems (TODAES), 9(2):212–237, April 2004.
BIBLIOGRAPHY 133
Amit Nandi and Radu Marculescu. System-level power/performance analysis
for embedded systems design. In Proceedings of the annual conference on
Design automation (DAC), pages 599–604, June 2001.
Jason Nieh and Monica S. Lam. A SMART scheduler for multimedia ap-
plications. ACM Transactions on Computer Systems, 21(2):117–163, May
2003.
Nytimes. Apple introduces iPod that plays videos,
New mobile phone signals apples ambition, 2005, 2007.
www.nytimes.com/2005/10/13/technology/13apple.html,
www.nytimes.com/2007/01/09/technology/09cnd-iphone.html.
Heung-nam Kim onghwan Son, Chansu Yu. Dynamic voltage scaling on
MPEG decoding. In Proceedings of the International Conference on Par-
allel and Distributed Systems (ICPADS), Kyongju City, Korea, June 2001.
Preeti Ranjan Panda, Nikil D. Dutt, Alexandru Nicolau, Francky Catthoor,
Arnout Vandecappelle, Erik Brockmeyer, Chidamber Kulkarni, and Eddy
de Greef. Data and memory optimization techniques for embedded sys-
tems. ACM Transactions Design Automation Electronic Systems, 6(2):
149–206, April 2001.
Ameet Patil and Neil Audsley. Implementing application specific RTOS poli-
cies using reflection. In RTAS ’05: Proceedings of the 11th IEEE Real Time
and Embedded Technology and Applications Symposium, pages 438–447,
San Francisco, California, March 2005.
134 BIBLIOGRAPHY
JoAnn M. Paul, Alex Bobrek, Jeffrey E. Nelson, Joshua J. Pieper, and
Donald E. Thomas. Schedulers as model-based design elements in pro-
grammable heterogeneous multiprocessors. In DAC’03: Proceedings of the
40th conference on Design automation, pages 408–411, Anaheim, CA, June
2003.
Thomas Plagemann, Vera Goebel, and Otto Anshus. Operating system sup-
port for multimedia systems. The Computer Communications Journal, 23
(3):267–289, Feburary 2000.
Christian Poellabauer and Karsten Schwan. Energy-aware traffic shaping
for wireless real-time applications. In Proceedings of the Real-Time and
Embedded Technology and Applications Symposium (RTAS), pages 48–55,
Washington, May 2004. IEEE.
Balaji Raman and Samarjit Chakraborty. Application-specific workload
shaping in multimedia-enabled personal mobile devices. In Proceedings
of the international conference on Hardware/software codesign and system
synthesis (CODES+ISSS), pages 4–9, New York, October 2006. ACM.
Balaji Raman, Samarjit Chakraborty, and Wei Tsang Ooi. Meeting CPU
constraints by delaying playout of multimedia tasks. In Proceedings of the
international workshop on Network and operating systems support for dig-
ital audio and video (NOSSDAV), pages 165–170, Stevenson, Washington,
June 2005.
Balaji Raman, Samarjit Chakraborty, Wei Tsang Ooi, and Santanu Dutta.
Reducing data-memory footprint of multimedia applications by delay redis-
BIBLIOGRAPHY 135
tribution. In Proceedings of the ACM/IEEE annual conference on Design
automation (DAC), pages 738–743, June 2007.
Ramachandran Ramjee, Jim Kurose, Don Towsley, and Henning Schulzrinne.
Adaptive playout mechanism for packetized audio applications in wide area
networks. In Proceedings of the Annual Joint Conference of the Computer
and Communications Societies (INFOCOM), pages 680–688, Washington,
June 1998a. IEEE.
Ramachandran Ramjee, Jim Kurose, Don Towsley, and Henning Schulzrinne.
Adaptive playout mechanism for packetized audio applications in wide area
networks. In Proceedings of the IEEE Conference on Computer Commu-
nications (INFOCOM), pages 680–688, June 1998b.
Kai Richter and Rolf Ernst. Event model interfaces for heterogenous systems
analysis. In Proceedings of the IEEE International Conference on Design
Automation and Test in Europe(DATE), pages 506–513, March 2002a.
Kai Richter and Rolf Ernst. Event model interfaces for heterogenous systems
analysis. In Proceedings of the IEEE International Conference on Design
Automation and Test in Europe(DATE), pages 506–513, March 2002b.
Martijn J. Rutten, Jos T. J. van Eijndhoven, Egbert G. T. Jaspers, Pieter
van der Wolf, Evert-Jan D. Pol, Om Prakash Gangwal, and Adwin Tim-
mer. A heterogeneous multiprocessor architecture for flexible media pro-
cessing. IEEE Design and Test, 19(4):39–50, July 2002.
Nima Sarshar and Xiaolin Wu. Buffer size reduction through buffer sharing
136 BIBLIOGRAPHY
for streaming applications. In IEEE International conference on Multime-
dia and Expo (ICME), pages 1635–1638, Taipei, Taiwan, June 2004.
Simon Schliecker, Matthias Ivers, and Rolf Ernst. Integrated analysis
of communicating tasks in MPSoCs. In Proceedings of the interna-
tional conference on Hardware/software codesign and system synthesis
(CODES+ISSS), pages 288–293, Seoul, Korea, October 2006.
Sander Stuijk, Marc Geilen, and Twan Basten. Exploring trade-offs in buffer
requirements and throughput constraints for synchronous dataflow graphs.
In In Proceedings of the ACM Annual conference on Design automation
(DAC), pages 899–904, San Francisco, CA, April 2006.
Morihiko Tamai, Tao Sun, Keiichi Yasumoto, Naoki Shibata, and Minoru
Ito. Energy-aware video streaming with QoS control for portable com-
puting devices. In Proceedings of the international workshop on Network
and operating systems support for digital audio and video (NOSSDAV),
Country Cork, Ireland, June 2004.
Tektronix. MPEG elementary streams.
ftp://ftp.tek.com/tv/test/streams/Element/index.html, 1996.
Lothar Thiele, Samarjit Chakraborty, and Martin Naedele. Real time cal-
culus for scheduling hard real time systems. In Proceedings of the IEEE
International Symposium on Circuits and Systems (ISCAS), pages 101–
104, May 2000.
Patrick Thiran, Jean yves Le Boudec, and Frederic Worm. Network calculus
applied to optimal multimedia smoothing. In Proceedings of the Annual
BIBLIOGRAPHY 137
Joint Conference of the Computer and Communications Societies (INFO-
COM), pages 1474–1483, Washington, April 2001. IEEE.
Girish V. Varatkar and Radu Marculescu. Traffic analysis for on-chip net-
works design of multimedia applications. In Proceedings of the annual
conference on Design Automation (DAC), pages 416 – 434, New York,
June 2002. ACM.
Girish V. Varatkar and Radu Marculescu. On-chip traffic modeling and syn-
thesis for MPEG-2 video applications. IEEE Transactions on Very Large
Scale Integration Systems, 12(1):108–119, January 2004.
Ernesto Wandeler, Alexander Maxiaguine, and Lothar Thiele. Quantitative
characterization of event streams in analysis of hard real-time applications.
volume 29, pages 205–225, Netherlands, 2005. Springer.
Ernesto Wandeler, Alexander Maxiaguine, and Lothar Thiele. Performance
analysis of greedy shapers in real-time systems. In Design, Automation and
Test in Europe (DATE), pages 444–449, Belgium, March 2006. European
Design and Automation Association.
Andreas Wieferink, Tim Kogel, Rainer Leupers, Gerd Ascheid, Hein-
rich Meyr, Gunnar Braun, and Achim Nohl. System level proces-
sor/communication co-exploration methodology for multiprocessor system-
on-chip platforms. IEE Proceedings on Computer and Digital Techniques,
152(1):3–11, January 2005.
DumindaWijesekera and Jaideep Srivastava. Quality of service (QoS) metrics
138 BIBLIOGRAPHY
for continuous media. Multimedia Tools and Applications, 3(2):127–166,
July 1996.
DumindaWijesekera, Jaideep Srivastava, Anil Nerode, and Mark Foresti. Ex-
perimental evaluation of loss perception in continuous media. Multimedia
Systems, 7(6):486–499, November 1999.
Hoesoek Yang, Hyunuk Jung, and Soonhoi Ha. Buffer minimization in RTL
synthesis from coarse-grained dataflow specification. In Proceedings of
the workshop on Synthesis And System Integration of MIxed Technologies
(SASMI), Nagoya, Japan, April 2006.
Sungjoo Yoo, Gabriela Nicolescu, Lovic Gauthier, and Ahmed Amine Jer-
raya. Automatic generation of fast timed simulation models for operat-
ing systems in soc design. In Proceedings of the Design, Automation and
Test in Europe Conference and Exhibition (DATE), pages 620–627, Paris,
France, March 2002.
Wanghong Yuan and Klara Nahrstedt. Energy-efficient soft real-time CPU
scheduling for mobile multimedia systems. In Proceedings of the sympo-
sium on Operating systems principles (SOSP), pages 149–163, New York,
October 2003. ACM.
Nicholas H. Zamora, Xiaoping Hu, and Radu Marculescu. System-level per-
formance/power analysis for platform-based design of multimedia applica-
tions. ACM Transactions on Design Automation of Electronic Systems, 12
(1):2, 2007.
BIBLIOGRAPHY 139
Xingping Zhu and Sharad Malik. A hierarchical modeling framework for on-
chip communicating architectures of multiprocessing SoCs. ACM Transac-
tions on Design Automation of Electronic Systems, 12(1):6, January 2007.
