Perception-aware low-power audio processing techniques for portable devices by HUANG WENDONG
  
 
PERCEPTION-AWARE LOW-POWER AUDIO 










NATIONAL UNIVERSITY OF SINGAPORE  
2008 
                                  
   
 
PERCEPTION-AWARE LOW-POWER AUDIO 
PROCESSING TECHNIQUES FOR PORTABLE DEVICES  
 
HUANG WENDONG 
( B.Eng. Xidian University ) 






A THESIS SUBMITTED 
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY 
 
 
DEPARTMENT OF COMPUTER SCIENCE 
 






First and foremost, I sincerely thank my advisor, Dr. Wang Ye, for providing 
immediate helps whenever I met difficulties in my study.  I consider myself very 
fortunate for having studied in his group. I continuously benefit from his guidance, 
encouragement and support in so many ways. He identifies my problems and helps me 
to correct them, encourages me to pursue academic goals, and gives me sufficient 
opportunities to develop my research ability. Without his solid supports, this thesis 
would not have been possible.    
 
I would like to thank Dr. Samarjit Chakraborty for introducing me into embedded 
system field. It is during the joint project with him that I have learned Simplescalar 
tool sets, and had an understanding of network calculus. Both of them have proved to 
be helpful for my thesis work.  
 
I would like to thank Dr. Wei Tsang Ooi and Dr. Weng Fai Wong for their valuable 
suggestions on my thesis proposal. These suggestions have inspired me to consider my 
thesis work from new perspectives. 
 
I thank everyone in multimedia lab 3 and DIVA. They are all good lab-mates and 
always ready to help me. I bothered them again and again to conduct those boring 
audio subjective tests. They have never hesitated to do so. I will certainly miss Huaxin 
and Zhaoming for their kindness. I will miss Yicheng, Zhang Sheng, and Zhehui as 
 ii
well, for discussing interesting problems. I especially thank Tran Vu An for helping 
me to prepare video experimental data and organize subjective tests for the thesis work. 
I thank Xiaopeng for his kindness. I thank Ye Ning for his professional answers to my 
various questions about Latex. I thank Bingjun for sharing uncertainty modeling 
materials with me, although I have in fact spent little time on reading them.  
 
I am very grateful to my parents. They always encourage me, support me with 
dedication and require nothing from me. They are a constant source of my spiritual 
force.   
 
Last, but not least, I would like to thank my wife, Liu Bo, for all she has done during 
these four years. She has managed to free me from the care of the housework. She has 
missed many opportunities of enjoyment when I have been occupied by work. And 
she has suffered a lot from my tension and frustration. But she has shown an 
understanding and has said nothing on all of these, although she always complains that 




Table of Contents 
Chapter 1 Introduction .................................................................................................. 1 
1.1 Background: System Organization and Power Consumption Issues ............... 1 
1.1.1 System Organizations and Sources of Power Consumption ..................... 2 
1.1.2 Energy Efficient Approaches for Computation Components ................... 8 
1.2 Characteristics of Audio Decoding Applications ........................................... 16 
1.3 Related Works ................................................................................................ 19 
1.3.1 Workload Reduction ............................................................................... 19 
1.3.2 DVS Techniques ..................................................................................... 21 
1.3.3 Main Challenges of the Existing Techniques for Low Power Audio 
Applications .......................................................................................................... 23 
1.4 Our Methodology of Low Power Audio Techniques for Portable Devices ... 25 
1.5 Contributions of the Thesis ............................................................................ 31 
Chapter 2 A Joint Encoder-Decoder Framework for Supporting Low Power Audio 
Decoding ....................................................................................................................... 35 
2.1 Introduction .................................................................................................... 35 
2.2 Related Works ................................................................................................ 39 
2.2.1 Noise Shaping Techniques in AAC ........................................................ 39 
2.2.2 Computation Efficient Techniques for Transforms ................................ 40 
2.3 Overview of the Proposed Work .................................................................... 42 
2.4 Joint ASP and Quantization Noise Shaping ................................................... 44 
2.4.1 Truncation Noise Shaping of SOPOT Coefficients ................................ 44 
 iv
2.4.2 Noise Allocation over SOPOT Coefficient Blocks ................................ 53 
2.4.3 Workload Estimation Module ................................................................. 57 
2.5 Experimental  Results..................................................................................... 58 
2.5.1 IFFT Workload Reduction ...................................................................... 60 
2.5.2 Subjective Evaluation ............................................................................. 62 
2.5.3 Increase of File Sizes .............................................................................. 64 
Chapter 3 An Optimal DVS Scheme Supported by Media Servers for Low-Power 
Multimedia Applications .............................................................................................. 67 
3.1 Introduction .................................................................................................... 67 
3.2 Problem Formulation...................................................................................... 74 
3.3 Energy Optimization  Techniques .................................................................. 76 
3.3.1 Bounds on the Processor Speed .............................................................. 76 
3.3.2 Estimation of the Input Buffer and the Playback Buffer ........................ 77 
3.3.3 The Optimal Speed Profile Algorithm .................................................... 80 
3.4 Experimental Results...................................................................................... 83 
3.4.1 Experimental Results for Audio.............................................................. 83 
3.4.2 Experimental Results for Video .............................................................. 84 
3.5 Proof of Optimality ........................................................................................ 91 
Chapter 4 Frequency Band and Stereo Image based Workload Scalable Decoding 
Scheme .......................................................................................................................... 97 
4.1 Introduction .................................................................................................... 97 
4.1.1 Perception-Awareness in Audio Decoding ............................................. 98 
4.1.2 Perception aware Workload Scalable Processing ................................. 100 
 v
4.2 Frequency  Band and Stereo Image Scalable  Decoding ............................. 101 
4.2.1 Frequency Bandwidth Scalability ......................................................... 102 
4.2.2 Stereo Image Scalability ....................................................................... 104 
4.2.3 Multiple Level Decoding ...................................................................... 105 
4.3 Efficient Algorithm for Synthesis Filterbank ............................................... 109 
4.3.1 Asymmetric Partial Spectrum Reconstruction for Stereo Audio .......... 109 
4.3.2 Conceptual Framework ......................................................................... 110 
4.3.3 Cosine Re-modulation .......................................................................... 113 
4.3.4 Polyphase Subfilters.............................................................................. 115 
4.3.5 Up-Sampling by Repetition .................................................................. 115 
4.4 Experimental Evaluation .............................................................................. 116 
4.4.1 Subjective Evaluation of BSS Decoding Scheme ................................. 116 
4.4.2 Workload Estimation ............................................................................ 118 
Chapter 5 Conclusions and Future Works ................................................................ 121 
5.1 Summary ...................................................................................................... 121 
5.2 Future Works ................................................................................................ 124 




Energy efficiency is a critical design consideration for portable devices. With the 
popularity of multimedia applications on such platforms, energy efficiency methods 
optimized for these applications are becoming increasingly important. 
 
In this thesis, we study perception-aware low power audio processing techniques for 
portable device. These works are mainly motivated by the fact that the audio decoding 
application is a significant source of energy consumption in context of portable 
devices, while it has received much less attention till now. Energy efficient techniques 
have been widely studied in terms of video decoding applications. Audio decoding 
applications, however, have different characteristics and more critical requirement on 
playback quality. As a result, low power audio decoding applications are not 
sufficiently supported by current high-level design methodologies and concrete 
techniques of low power multimedia processing.  
 
Targeting low power audio decoding, we propose a new conceptual methodology 
framework based on the usage modes of the portable device. It makes use of two kinds 
of design strategies. First, for the case that the application’s requirements on resources 
are satisfied, we extend the low power design to the encoder and media server side to 
optimize the energy efficiency of the decoding process without degradation of 
playback quality. Second, for the case that its requirements are not satisfied due to 
limited available resources, we propose the concept of workload scalable decoding to 
support the low power resource scheduling.  
 vii
 
The main contributions of the thesis are as follows. 
We present a novel scheme, a joint encoder-decoder framework (JEDF), which allows 
the decoder to have a desirable tradeoff between energy and storage consumption 
without sacrificing playback quality. JEDF employs Approximate Signal Processing 
(ASP) technique at decoder side to reduce the computational workload. To guarantee 
the playback quality, JEDF jointly shapes the ASP noise (introduced by the decoder) 
and the quantization noise (introduced by the encoder) subject to the masking 
threshold.  
 
We propose a new scheme of media server supporting DVS for low power multimedia 
decoding, to overcome the inherent limitations of existing DVS techniques. Towards 
this new direction, we have designed an optimal speed control scheme, which achieves 
the maximal energy savings among all feasible speed profiles for the given multimedia 
bitstream.  
 
We propose a frequency Band and Stereo-image Scalable (BSS) decoding framework 
based on an analysis of the perceptual relevance of different audio components. BSS 
provides the desired workload scalability for the resource scheduling process. 
Especially, we have designed a novel algorithm, namely asymmetric partial spectrum 
reconstruction (APSR), to remove the redundant computations associated with stereo-
image scalability.  
 viii 
 
List of Tables 
Table 3.1  Experimental results on energy consumption and buffer requirements for 
audio bitstream.  IB and PB: input buffer size and playback buffer size of the proposed 
scheme; EnR and PBR: energy consumption ratio and playback buffer requirement of 
the baseline over the proposed scheme, respectively.................................................... 84 
Table 3.2  Configurations of decoding the six video clips.  FI and FP: feasibility 
condition for input buffer and playback buffer respectively, both of them measured in 
Macro Blocks, the value in bold is used for both input buffer and playback buffer to 
estimate the other items; IB and PB: input buffer size and playback buffer size, 
measured in Kbytes, both of them derived from the max(FI,FP); Delay: introduced 
delay by buffering in sec. .............................................................................................. 86 
Table 3.3 Comparisons between our scheme and the baseline 2. NEC: normalized 
energy consumption of the baseline 2 over our scheme; BUF: maximal buffer 
occupancy of the baseline 2 in terms of Macro Blocks; RED: reduced buffer size ratio 
achieved by our scheme (referred to Table 3.2). .......................................................... 88 
Table 3.4  Energy consumption ratio between the proposed scheme and the TMEC .. 89 
Table 4.1 Four different decoding groups................................................................... 103 
Table 4.2 Five decoding levels, where workload reduction is measured in terms of 
subbands, with a standard MP3 decoder (decoding level 5) as the baseline .............. 106 
Table 4.3  Perceptual evaluation results for different APSR profiles ......................... 118 
 ix
Table 4.4 The average workload (MIPF) for the five decoding levels, along with the 
normalized workload reduction with respect to the standard MP3 decoder (decoding 




List of Figures 
Figure 1.1  Power consumption ratios among three power hungry components of an 
iPAQ when running a video application ......................................................................... 3 
Figure 1.2 Power breakdown for StrongArm microprocessor at 60MHz 500mW and 
Alpha 21164 .................................................................................................................... 8 
Figure 1.3  A two-level software architecture for energy efficiency ............................ 16 
Figure 1.4 A two-state model of the voltage scheduler ................................................ 26 
Figure 2.1 Illustration of the proposed scheme, where MT,QN,AN, and MQD stand 
for masking threshold, quantization noise, ASP noise, and masking-to-quantization-
noise-difference, respectively : a) for a conventional AAC encoder: the sum of the 
additional ASP noise and the quantization noise exceeds the masking threshold; b) for 
our scheme: with reduced quantization noise, the overall noise is below the masking 
threshold. ....................................................................................................................... 37 
Figure 2.2 Architecture of the proposed audio encoder ................................................ 42 
Figure 2.3  The flow graph of a 16-point inverse FFT with marked coefficient blocks
....................................................................................................................................... 46 
Figure 2.4  Normalized workload for the test audio clips, where SF denotes the scaling 
factor of truncation noise .............................................................................................. 61 
Figure 2.5 Averaged MOS values with standard deviations for the test audio clips, 
where SF denotes the scaling factor of truncation noise............................................... 64 
Figure 2.6 Increase ratios of file sizes for various encoding bit rates ........................... 65 
Figure 3.1 Architecture of the multimedia processing system at the client site ........... 74 
 xi
Figure 3.2  Illustration of the optimal speed profile algorithm ..................................... 81 
Figure 3.3  The Optimal Speed Profile Algorithm ....................................................... 82 
Figure 3.4 Normalized energy consumption between our scheme and the baseline 1 for 
the six test video clips ................................................................................................... 88 
Figure 3.5  Normalized energy consumption with the buffer sizes increased from the 
feasibility condition for the six video clips ................................................................... 90 
Figure 3.6 Illustration of the speed profiles based on the clip “Hall”: a) comparison 
between the two base line DVS schemes, In baseline 3, the moving window size is 32 
MacroBlocks. In baseline 4, we calculate the speed very 10 MacroBlocks; b) 
comparison between GAS (global averaged speed), the baseline 4 with buffer size 
1029 MacroBlocks and our scheme with the configuration in Table 3.1 ..................... 91 
Figure 3.7 (a) Illustration of the splitting operation; (b) Illustration of the speed 
profiles, thin lines stand for speed profile S*, thick lines stand for speed profile S. .... 94 
Figure 4.1  High-level block diagram of the BSS decoding scheme supporting voltage 
scheduler in low power state and the user’s power saving switch. ............................. 100 
Figure 4.2 Block diagram of the proposed frequency band scalability ...................... 102 
Figure 4.3  Block diagram of the proposed BSS multi-level decoding algorithm with 
frequency band scalability and stereo image scalability, where B1-B4 stand for middle 
channel, side channel, left channel and right channel data, respectively. ................... 107 
Figure 4.4 Structure of synthesis filter bank in MPEG-1 audio ................................. 111 
Figure 4.5  Evaluation results for different BSS configurations ................................. 117 
 
 1
Chapter 1  
 
Introduction 
Power consumption has become a critical design consideration for battery-powered 
portable devices, such as mobile phones, PDAs and audio/video players. In recent 
years, power consumption of portable devices has grown rapidly resulting from the 
technical advances in hardware and software. From the perspective of hardware, the 
magnitude of power per unit area of the integrated circuit chip is growing, as the 
semiconductor industry continues to improve the performance of the circuit and to 
integrate more functions into the chip. From the perspective of software, multimedia 
applications which are characterized by high computational requirements become 
popular on these platforms. On the other hand, battery technology has been 
progressing in a much slower pace. These facts suggest that the battery life have 
become the major bottleneck of the multimedia applications on portable devices. 
Energy efficient techniques become increasingly important for these applications. 
1.1 Background: System Organization and Power 
Consumption Issues 
In this thesis, we concentrate on audio processing techniques that can lower the power 
consumption of portable devices with general-purpose hardware platforms. These 
 2
techniques do not rely on any specific hardware implementations of the decoder, or on 
any coprocessors to implement specific parts of the decoder. The significance of our 
works stems from the fact that increasingly many consumer electronics products are 
being built using general purpose hardware platforms [63], such as mobile phones, 
personal digital assistants (PDA) or other similar portable devices. The only difference 
between these devices will be the software application that runs on them. Meanwhile, 
several different functionalities are simultaneously provided by a single device — a 
mobile phone also works as a PDA and a music player. Hence, increasingly there is a 
shift of focus in the portable embedded systems domain towards appropriate software 
implementations of different functionalities, rather than tailored hardware for different 
applications. 
We believe that it would soon be common to use PDAs, mobile phones or other 
portable devices, which will have powerful but general-purpose processors, as 
portable audio/video players, by running a suitable decoder application. Our solutions 
will be useful in such a scenario, where hardwired audio/video decoder chips 
implementing a specific codec will be of limited use. 
1.1.1  System Organizations and Sources of Power Consumption  
When considering the general purpose hardware platform of a state-of-the-art portable 
device, we can distinguish three major constituents consuming significant power: 1) 
computation components, which mainly include general purpose processors and 
memory; 2) displays; 3) wireless network interfaces cards (WNIC). Leaving aside, for 
a moment, the issues of computation components, we first look at displays and WINCs. 
 3
They may consume a significant fraction of the overall power. It is illustrated by a 
video application running on an iPAQ, as shown in Figure 1.1 [95][22].  
 
Figure 1.1  Power consumption ratios among three power hungry components of 
an iPAQ when running a video application 
From Figure 1.1, it is noticed that display and WINC are responsible for round 71% of 
the sum of these three constituents’ power consumption. In such a case, considerable 
efforts should be devoted to address energy efficiency issues of display and WINC 
[82]. On the other hand, these three constituents may have different contributions to 
the overall power consumption in different applications. Taking this fact into 
consideration, the low power design of these applications should have different 
focuses. 
Concerning audio decoding applications, computation components dominate their 
overall power consumption, which will be explained in section 1.2. We therefore 
restrict our attention to computation components. Computation components have been 
commonly analyzed at two basic levels of abstraction. The lower level is circuit level, 
where accurate information about the internal nodes of the circuit is available. Based 
on the parameters of the circuit, the researchers are able to build general power 
consumption models with acceptable accuracy. These models are the foundations on 
which power consumptions of higher level are based and they facilitate to explore the 
 4
power-performance tradeoff for circuits. The higher level is architecture level, where it 
is very difficult to accurately estimate the parameters of the circuit since intractable 
computation will be involved [36]. Thus we will investigate the power consumption 
characteristics of these computation components. This kind of knowledge is helpful to 
design energy efficiency strategies for these components.  
We first discuss sources of power consumption at the circuit level. Nowadays, 
computation components are dominantly implemented using complementary metal-
oxide semiconductor (CMOS) logic circuits. For a CMOS circuit, its power 
consumption has three terms [9], which are defined in (1.1): 
2
short leakP a C V f a V I f V Iτ= ⋅ ⋅ ⋅ + ⋅ ⋅ ⋅ ⋅ + ⋅                               (1.1) 
The first term in (1.1) corresponds to dynamic power consumption caused by the 
charging and discharging of the capacitive load of each gate. It is proportional to the 
operating frequency of the system f, the square of the supply voltage V, the activity 
factor a, and the total capacitance C. The second term estimates the power dissipation 
caused by short-circuit current
shortI , which flows between the supply voltage and the 
ground during τ when a CMOS logic gate’s output switches. The third term measures 
the power consumption from the leakage current leakI .  
It should be noted that for current CMOS circuits, the dynamic power consumption 
dominates the overall power consumption [61]. Due to this reason, we will focus on 
dynamic power consumption throughout the thesis.  
 
Another important fact about CMOS is that its maximum operating frequency is 
determined by its supply voltage, as shown in (1.2): 
 5
( ) VVVf threshold 2max −∝                                          (1.2)  
From (1.2), we can see that the maximum operating frequency is roughly proportional 
to the supply voltage V. As shown in (1.1), the supply voltage has a square 
relationship with the dynamic power consumption. It seems to be an effective means 
to achieve power savings by reducing the supply voltage. But relationship (1.2) poses 
a fundamental limitation on this method: reducing the supply voltage will prolong the 
execution time of the target application, indicating that the final saved energy is 
reduced with the reduction of the supply voltage. 
 
At the architecture level, computation components comprise the following significant 
energy consumption components: clock distribution network, data-path, and memories.   
A computer system needs clock signals to define a time reference for operations 
within the system. The clock distribution network distributes the clock signals from a 
common point to all the units that need it. The clock distribution network often takes a 
significant fraction of the power consumption since clock signals: 1) are loaded with 
the greatest fanout, 2) travel over the longest distances, and 3) operate at the highest 
speeds of the signal within the system [39]. The main contributors of energy 
consumption in clock distribution network are the clock generation circuit, clock 
distribution buffers and wires, and the clock load on the clock network [33].  
Data-path refers to the collection of execution units that are required to perform data 
processing. In context of computer, a data-path typically includes arithmetic logic 
units (ALUs), shifters, multipliers, register files, etc. It is closely related to the 
 6
computation functionality of the processor. The energy consumed in a data-path is 
determined by the number, types, and sequence of instructions executed [60].  
Power consumption of memory is strongly dependent on its organization [48]. In 
modern computer systems, a multi-level memory architecture is employed to improve 
system performance: low hierarchy levels are made of small memories of high speeds, 
and high hierarchy levels are made of memories of increasing sizes and accessing 
times. Each level of memories is backed by the next level of memory. The processor 
always attempts to access the requested data from the first level. When the data is not 
available in this level, the requested data is retrieved from the next level of memory. 
Corresponding to this architecture, the interconnections, namely buses, among 
different levels of memory have different characteristics: low level memories are 
connected by on-chip buses, which are more energy efficient due to their shorter 
length, fewer bit width, and lower driving voltages than off-chip buses, which are used 
to connect high level memories.  
Different levels of memory are made of different types of RAMs.  The low hierarchy 
levels of memory (cache) are usually made of static RAM (SRAM), its power 
consumption mainly comes from access. The high hierarchy level of memory (main 
memory) is typically made of dynamic RAM (DRAM), requiring periodic fresh 
writing operations to maintain the data value, which is an additional power source 
besides memory access [61].  
Memory access is a significant source of power consumption since it will involve 
several expensive operations: row and column decoders, wordlines, bitlines, and sense 
amplifiers [19]. Another important source associated with memory access is data 
 7
transfer, which is dependent on the number of transactions on the bus, the bus 
capacitance, the bus width, and the switching activity on the bus [99].   
 
Finally we present a discussion on non-violate storage of portable devices. In section 
1.2, we will see that an audio player running on a portable device is usually fed data 
from local storage, where the performance of non-violate storage should be taken into 
consideration for low power design. In context of portable devices, flash ROM is 
employed as the non-violate storage due to its lower power consumption, faster read 
access speed, and better kinetic shock resistance than hard disk. The power 
consumption of flash ROM is less than DRAM [12][111], and its involved data is 
much smaller than main memory since an audio player reads the compressed data only 
once during the decoding process. These two factors make the energy consumption of 
flash ROM be much less than that of computation components. This is well illustrated 
in [100]: for a MP3 software decoder running on a portable device, the energy 
consumption of flash ROM is only responsible for 1.9% of energy consumption of 
computation components. This result shows that the energy consumption of flash 
ROM is negligible in comparison with computation components. 
  
At the architecture level, the power consumption of a component is influenced by 
various factors, such as cache sizes and hierarchy levels in the memory subsystem. 
Different architectures may choose diverse configurations for these factors, and 
consequently, the component will contribute with different weights to the total power 
consumption in different architectures. Such different weights may vary largely, which 
 8
is demonstrated by the power breakdowns for two representative processors in Figure 
1.2 [83][44]. Figure 1.2 reveals that because of the diversity, an optimized low power 
design for certain architecture may be much less efficient for others.  Low power 






















Figure 1.2 Power breakdown for StrongArm microprocessor at 60MHz 500mW 
and Alpha 21164  
1.1.2  Energy Efficient Approaches for Computation Components 
As shown in (1.1), to achieve energy efficiency, we need to reduce the values of one 
or more variables: activity factor, capacitance, operating frequency and supply voltage. 
Towards this, different levels of techniques have been developed, in a bottom-up way, 
including logic level, architecture level and software level, etc. Logic level techniques 
include clock-gating, half-frequency and half-swing clocks, and asynchronous logic, 
etc. Architecture level techniques include parallelism and speculation, etc [84].  
Besides the abovementioned logic and architecture level techniques which directly 
lead to energy efficiency, hardware also facilitates software to achieve energy 
efficiency with supporting mechanisms. Such two widely used mechanisms are 
Dynamic Voltage Scaling (DVS) and Dynamic Power Management (DPM). DVS 
 9
changes the processor speed to match the workload requirement of applications. On 
the other hand, DPM switches off some parts of the system when these parts become 
idle. Today’s processors for portable devices widely support DVS. For example, 
Intel’s XScale, SA 11xx series, and Transmeta’s Crusoe, etc, provide multiple levels 
of supply voltages and operating frequencies to allow the DVS operations. DPM 
schemes were originally motivated by low power design for peripherals. With the 
advance of technology, some memory architectures begin to support DPM as well. A 
well known method is memory banking [34], which splits the memory into banks and 
only activates the banks in use. With memory banking, a straightforward energy 
efficiency method for memory subsystem is to reduce the occupied memory size of the 
target application.  
 
Among these levels of techniques, we are most interested in the software level energy 
efficient techniques since they are the sole techniques applicable to the development 
of a portable device, whereas those lower levels techniques are only suitable for the 
design of the hardware architecture. Software has a substantial impact on the power 
consumption of a system since it is software that controls the activity of the hardware, 
including exploiting the low power mechanisms provided by hardware. We divide the 
software level energy efficient techniques into two classes. The first class adjusts the 
activities of the hardware to match the characteristics of the target application, such as 
switching off the unused components of the system to save the energy. The second 
class shapes the behavior of the applications to support energy reduction, such as 
 10
reducing the number of memory access. We call the former the hardware matching 
techniques and the latter the software shaping techniques, respectively.  
                                           
Hardware matching techniques for multimedia applications widely employ DVS since 
multimedia decoding processes are computationally expensive and their workloads 
exhibit high variability. This is typically demonstrated by video decoding. The ratio of 
its maximum and its average workload can be as high as a factor of 10 [52]. Without 
DVS, the processor speed should be set as a constant value which corresponds to the 
worst case workload to guarantee the playback quality [108] [56], resulting in much 
wasted energy.  
When designing a DVS scheme for a large scale of portable devices, two approaches 
are usually adopted with two mutually exclusive assumptions. The first approach 
assumes that the target processor will meet all the requirements of applications, 
including the dynamic ranges of the speeds, and the continuity of the speed changes 
[71].With this assumption, the processor speeds can be completely derived from the 
workload of the target application. Recognizing that this assumption may be infeasible 
for actual platforms, some techniques have been developed to perform the derived 
speeds on the target platform, such as dithering [76]. As an alternative, the second 
approach combines these two steps into a single one. It employs a voltage scaling 
model which has discrete voltage levels to abstract various actual platforms [73]. In 
implementation, the voltage scaling model needs to be mapped to the target platform. 
To facilitate such mapping, the voltage scaling model should be chosen to be simpler 
than actual platforms. For example, some works employ a voltage scaling model with 
 11
two voltage levels, whereas actual platforms usually offer more levels. In the second 
approach, the resulting processor speeds are derived from the workload of the 
application and the voltage scaling model. The choice of the voltage scaling model is 
essentially a tradeoff between generality and performance: a generic solution needs to 
oversimplify the voltage scaling model to be applicable to diverse platforms, which 
will result in higher fluctuation levels in speed profiles, or larger buffer sizes, leading 
to the performance degradation.   
Therefore, in terms of energy efficiency, the first afore-discussed approach is superior 
to the second one. We formalize it as two parts: 1) software oriented DVS, which 
corresponds to the techniques of deriving processor speeds from workload of the 
application; and 2) platform oriented DVS, which refers to the techniques of mapping 
the processor speeds from software oriented DVS to the target platform, with fully 
exploiting the scaling levels provided by the target platform. It is more efficient than 
some oversimplified voltage scaling models.   
Another aspect of DVS is its implementation, which has a significant influence on its 
energy efficiency. There are two methods to perform DVS: application level and 
operating system level. The application level DVS is only applicable to single task 
mode[102], where the processor speed is scaled according to the target application 
needs. Application level DVS has achieved impressive energy efficiency. When 
multiple applications are concurrently executed, operating system level DVS has to be 
applied since only the operating system can access the information of the global 
resource usage and allocate the computation resources in a consistent way. The 
voltage scaling for multiple tasks is defined as voltage scheduling, which adds a new 
 12
dimension to the conventional task scheduling and resource allocation of the operating 
system. Operating system level DVS has to tackle much more complicated problems 
than application level DVS, which leads to a significant degradation of its energy 
efficiency. As compared in [102], for the same application, operating system level 
DVS can achieve 17% energy reduction while application level DVS can achieve 90% 
energy reduction. This example illustrates that there is an urgent need to improve the 
performance of the operating system level DVS. An important method is to make 
applications support the DVS operations: some works investigate the possibility of 
supporting the workload estimation by application. As the voltage scaling can be 
considered as a special case of the voltage scheduling, for convenience, we refer to 
both case as the voltage scheduler.  
Hardware matching DVS has three main limitations. First, DVS only focuses on 
energy efficiency of the processor. Other sources of energy consumption, such as 
memory, are beyond its scope. Second, due to the limitation of equation (1.2), the 
saved energy has a linear relationship with the reduction of processor speeds, while we 
seek to save more energy with the same workload reduction and the super-linear 
energy efficiency is preferable. Third, the lower bound of energy consumption of DVS 
is determined by the overall workload of the target application (detailed discussion see 
section 3.1). Hardware matching DVS is incapable of improving its energy efficiency 
beyond this bound. These three weaknesses can be remedied in software shaping 
techniques.    
   
 13
Software shaping techniques are closely related to the development of a program, 
which can be summarized as two levels: the high level focuses on designing an 
algorithm which is effective for the problem and efficient in terms of time and storage 
complexity; the low level focuses on implementing the algorithm which maps 
algorithmic operations to available hardware with maximized energy efficiency 
performance. Corresponding to this development strategy, the low level software 
shaping techniques involves changing computational structure of algorithms in a 
manner that their input/output behavior is preserved. This kind of techniques is called 
code transformation. Code transformation techniques usually fall into three categories 
[109]: 1) minimizing the number and cost of memory access; 2) selection of the least 
expensive instructions or instruction sequences; and 3) exploiting power minimization 
features of hardware. In implementation, code transformation involves the process of 
translating a specification in a high-level language into optimized machine code for 
the target processor [100] [103]. From this perspective, compilers become the main 
tools for code optimization. The advantages of code transformation are that they are 
non-application specific and they do not need the efforts of the programmer, largely 
alleviating his/her burden. However, different platforms have inconsistent 
characteristics, indicating that optimization of the compilers is platform dependent. 
Consequently, code transformation relies on the availability of an optimized complier 
for the target platform. More importantly, since only focusing on the lower level of 
software development, code transformation techniques pose less impact on the energy 
consumption than algorithm design, the higher level of software development. 
 
 14
Specific to multimedia processing, a more aggressive approach of software shaping 
technology, which corresponds to algorithm design level, is to reduce the workload of 
the target application. There is a fundamental difference between multimedia 
processing and other applications like the scientific calculation, which stems from the 
characteristics of human perception system: exact reconstruction of the original 
multimedia content is usually unnecessary since human perception system is tolerable 
to certain amount of distortion. This property enables the precise algorithms in 
multimedia processing to be replaced by an approximate algorithm, which yields 
reduced workload with additional approximate noises. This approach has found 
widespread use in multimedia processing since a small degradation of quality is in 
exchange for a large gain of workload reduction [5]. 
Workload reduction is an important low power approach. First, it is a general approach 
for energy efficiency, not relying on any specific hardware implementations. Second, 
workload reduction can achieve significant energy efficiency for both the processor 
and memory. The reduced workload enables the target application to run at lower 
processor speeds without prolonging the execution time, which is superior to the 
performance of the hardware matching DVS. Furthermore, workload reduction implies 
that the activity factor in (1.1) is accordingly reduced [84]. Considering these two 
factors, as a first order of approximation, there is a quartic relationship between the 
reduced workload and the reduced energy consumption of the processor. On the other 
hand, workload reduction leads to less memory access, achieving energy savings of 
the memory.  
 
 15
After reviewing the software level techniques, an important problem naturally arises: 
is it possible to improve energy efficiency of these techniques? We seek to address it 
from two aspects. The first aspect is to exploit complementarities among different 
types of techniques, such as workload reduction and code transformation, workload 
reduction and DVS. Such complementarities offer the potential to improve the energy 
efficiency. The second aspect is to address the diversity issue of platforms. Although it 
has received less attention, we believe that the diversity issue may be an important 
source of inefficiency. As discussed in section 1.1, the hardware platforms of current 
portable devices have shown large diversity. A generic energy efficiency scheme 
aiming at various platforms always encounters the conflict between the diversity issue 
of platforms and the requirement of a generic solution. Such conflict may cause 
performance degradation, which has already manifested itself in the case of DVS.  The 
diversity issue suggests that we cannot design a single optimized scheme for all of 
these diverse platforms. There is a gap between a generic algorithm and its 
implementation targeted for some specific platform. 
Due to above considerations, we propose a two-level energy efficient software 
architecture which is shown in Figure 1.3. The high level combines workload 
reduction and software oriented DVS, which serve as generic energy efficiency 
approaches and are shared by all the platforms. The low level includes code 
transformation and platform oriented DVS, which are utilized to perform platform 
specific optimization. The low level bridges the gap between a generic solution and 
the targeted platform. We believe that the proposed architecture satisfactorily achieves 
 16
the desired objectives, i.e. exploiting complementarities and solving diversity issue, 
and, will significantly improve energy efficiency. 
 
Figure 1.3  A two-level software architecture for energy efficiency 
 In this thesis, we will limit our focus to workload reduction and software oriented 
DVS, both of them fall into the class of platform independent techniques. Throughout 
the thesis, we will utilize the resulting workload to represent the effectiveness of our 
workload reduction techniques, rather than the energy consumption of some specific 
platform. This is mainly because that according to the proposed software architecture, 
the algorithms with reduced workload need further optimization by code 
transformation techniques for the target platform.  Moreover, workload is an intrinsic 
metric for such techniques and it is closely related to power consumption. In contrast 
to energy consumption, the main merits of the workload measurement are twofold. 
First, workload is very consistent on different platforms for a given algorithm [26]. 
Second, we can obtain accurate workload estimation in efficient ways, such as 
algorithm analysis or simulations, facilitating its application. 
 
1.2 Characteristics of Audio Decoding Applications 
So far, research on energy efficient multimedia applications concentrates on the video 
decoding applications. In comparison, low power audio decoding techniques have 
 17
received much less attention. However, we believe that audio decoding applications 
are a significant source of energy consumption and require different techniques for 
low power design. Both of them stem from the users’ experience with audio, which 
can be characterized as follows. First, users are very critical to the playback quality of 
audio clips. Due to the long history of digital audio, they may require CD quality in 
context of portable devices and be reluctant to sacrifice playback quality in exchange 
for energy efficiency in usual case. Second, users tend to repeat playing back their 
favorite audio clips, whereas they playback a video clip only once for most cases. This 
leads to different usage modes for video and audio: users prefer download-playback 
mode for audio and streaming mode for video, which makes low power audio 
techniques have different focuses from video. Third, there are much fewer limiting 
factors for users to listen to audio in context of portable devices. Portable devices are 
typically used when accompanying other activities such as walking, doing exercises, 
or driving a car. In such situations, watching video is inconvenient for users since they 
have to look at the screen, which affects their other activities. On the contrast, users 
can listen to music while doing other things. This makes the audio applications very 
frequently accessed applications in portable devices.  
Based on these observations, we can characterize audio decoding applications from 
the following aspects:  
• An important source of energy consumption: Based on the above discussion, 
audio decoding applications are responsible for a large fraction of the usage of 
portable devices. For an application, its energy consumption is the product of 
its power consumption and execution time. Although its power consumption is 
 18
less than that of video decoding applications, the audio decoding application 
becomes one of the most energy consuming applications due to its long 
execution time.   
• Computation dominant energy consumption: As abovementioned, most 
audio clips are accessed in download-playback mode, which does not involve 
display and WNIC, two significant energy consumption components of the 
portable device system.  As a result, computation will dominate the energy 
consumption of the audio decoding applications. In this case, workload 
reduction and its relevant techniques, DVS, are effective methods to achieve 
energy efficiency for audio decoding.  
• High expectation of the playback quality: This requirement is clearly shown 
in above analysis and it has a significant implication. As workload reduction is 
identified as the major method to achieve energy efficiency for audio decoding, 
it requires that workload reduction will not lead to degradation of the playback 
quality. This contradicts the fundamental principle in multimedia processing: 
workload reduction is achieved at the cost of playback quality. And this 
requirement is not supported by existing methods of workload reduction 
(referring to section 1.3). Thus workload reduction with non-degradation of 
quality is a special requirement for audio, and it becomes the most challenging 
issue in low power audio processing.  
 
As a summary, audio decoding applications are an important source of energy 
consumption and require special techniques, especially the workload reduction 
 19
techniques, to meet the critical requirements of the users.  Based on these observations, 
we will study perception-aware low power audio processing techniques for portable 
devices.  
1.3 Related Works   
In this thesis, we will achieve low power audio processing from two fundamental 
approaches: workload reduction and DVS. In this section, we briefly review the 
related works on them to highlight the challenges of the current techniques.  
1.3.1  Workload Reduction 
Modern audio codecs, including MPEG-1 audio layer III (MP3), MPEG-2 and -4 
Advanced Audio Coding (AAC), widely employ transform based methods. These 
transform modules are responsible for a dominant fraction of the overall workload of 
the decoding process. Taking MPEG-2 AAC low complexity profile for example, the 
Inverse Modified Discrete Cosine Transform module is responsible for 86% of the 
overall workload. Therefore, in this thesis, we will concentrate on workload reduction 
for transform algorithms. The workload reduction techniques for transform 
computation can be divided into three classes: data driven approaches, Partial 
Spectrum Reconstruction (PSR) and data representation based approaches. 
 
Data driven approaches are based on the statistical distribution of the input frequency 
coefficients. There are a number of zero-valued coefficients after forward transform. 
This class of approaches eliminates the calculation for zero-valued coefficients since 
these coefficients make no contribution to the output of the transform. Data driven 
 20
approaches include pruning techniques [77][106][49] and forward mapping 
IMDCT[78].  
 
PSR exploits the spectral characteristics of transform. An attractive property of 
Transforms, such as DCT, is that they concentrate large part of the energy of spatially 
correlated data into a small number of low frequency components. This results in 
high-magnitude low frequency coefficients and low–magnitude high frequency 
coefficients. Moreover, low frequency components are perceptually more important 
than high frequency components. Exploiting this property, PSR discards low-
magnitude high frequency coefficients and only reconstruct low frequency coefficients, 
resulting in a low pass version of the original spatial samples [80] [2]. 
 
Computational workload is closely related to the data representation methods. Floating 
point calculations have the highest workload requirement. For portable devices, CPUs 
are rarely equipped with floating point units, workload is further increased as floating 
point operations are simulated by software packages. To reduce workload, floating 
point calculations are approximated by fixed point calculations, which are further 
divided into two methods, fixed point multiplication [41] and Sum-Of-Power-Of-Two 
(SOPOT) methods [66] [16]. Fixed point multiplication scales and approximates those 
floating point coefficients by integers. The resulting transforms have much lower 
computational workload than the floating point version. SOPOT methods decompose a 
multiplication operation into the sum of power of two operations. 
 
 21
After reviewing workload reduction techniques, we need to identify those techniques 
appropriate for our works. In terms of multimedia applications, a desired model of 
workload reduction is represented by Approximate Signal Processing (ASP) [85]. In 
ASP, algorithms are structured to allow tradeoffs between the accuracy of their results 
and their utilization of resources, such as time, power, memory, etc. An important 
characteristic of ASP is its incremental refinement property: an ASP algorithm 
consists of a succession of steps, each of which refines the result produced by the 
previous one. The incremental refinement property of ASP implies that the tradeoffs 
provided by ASP techniques are tunable as we change the configurations of the 
algorithm. 
In the light of ASP, data driven approaches and fixed point multiplication techniques 
are not suitable for our works since they cannot provide tunable tradeoffs between 
their workload and accuracy of the results. On the other hand, PSR and SOPOT 
techniques fall into the class of ASP. We can tune the workload of PSR and SOPOT 
by adjusting the bandwidth of the spectrum and the number of SOPOT terms, 
respectively.    
 
1.3.2  DVS Techniques  
DVS needs to be performed at run time while meeting the required Quality of Service 
(QoS) of target applications. Multimedia decoding applications, including video and 
audio, have large variations in workload demand, which proved to be a major 
challenge for DVS. When applied to multimedia decoding, the performance of a DVS 
scheme can be estimated in terms of its energy efficiency and its offered QoS. 
 22
Different DVS schemes provide different tradeoffs between these two metrics. In the 
class of hard real-time DVS schemes, the QoS is guaranteed at the cost of degradation 
of energy efficiency. This class of approaches performs the DVS operations using a 
global worst case execution time [108] , or adaptive worst case execution times [56] 
[102][102]. Thus the energy savings is quite limited since the large variations in 
workload of multimedia decoding cannot be fully exploited. This leads to the class of 
soft real-time DVS schemes. As multimedia decoding exhibits non-stationary 
workload requirements, the conventional interval based workload prediction methods 
result in unacceptably suboptimal solutions [107] [91]. The effectiveness of DVS 
techniques largely depends on the ability to predict the workload of the multimedia 
decoding. Towards this, three subclasses of approaches have been developed. The first 
subclass improves the prediction accuracy by incorporating frame parameters of the 
multimedia bitstream into estimation, such as frame types [23], code sizes [1] , etc, 
since it is shown that there is a strong correlation between workload and these 
parameters. The second subclass only meets a certain percentage of frame deadlines, 
based on the probability distribution of workload demands [109]. This subclass of 
work provides tunable tradeoffs between workload threshold and QoS. The third 
subclass let contents providers supply the workload information in conjunction with 
the video clips, therefore workload prediction at the client site is not needed [26]. 
Despite a lot of research effort, accurate workload prediction remains to be a challenge.   
To alleviate the accuracy issue of the workload prediction, some works explore the 
possibility of avoiding the missed deadlines by buffering mechanism. This concept 
can be traced back to [87], where processor speeds are dynamically scaled based on 
 23
the filling level of the input buffer, to avoid its overflow or underflow. In recent years, 
buffer based DVS techniques have been developed, to compensate for the inaccuracy 
of the workload prediction [75], to average the workloads of multiple frames [73], and 
to reduce the idle periods of the processor [54]. In general, buffer based DVS 
approaches can achieve significant energy savings. 
1.3.3  Main Challenges of the Existing Techniques for Low Power 
Audio Applications 
From section 1.1 and 1.2, we have identified four main challenges of the existing 
techniques when addressing low power design of audio decoding. These challenges 
are summarized as follows.    
1. Degradation of playback quality. Modern perceptual audio encoders 
widely exploit the masking effect of human auditory system, where noise lower than 
the masking threshold becomes inaudible. Based on the masking effect, maximal 
allowed quantization step sizes are used subject to the constraint of masking threshold, 
which produces the shortest bit length of coded audio signals with quantization noise 
just below the masking threshold. On the other hand, the existing workload reduction 
techniques are carried out at the decoder side, unaware of the masking threshold. 
Therefore these workload reduction approaches will inevitably introduce additional 
ASP noise to the quantization noise. As a result, the sum of ASP noise and 
quantization noise may violate the masking threshold and become audible. This leads 
to degradation of the playback quality, which is undesirable especially for audio, as 
already discussed in section 1.2.    
 24
2. Conflicting functions of buffers in DVS. As aforementioned, buffer based 
DVS approaches can achieve significant energy savings. We believe this is mainly 
because fluctuations of multimedia decoding workload are smoothed out by the 
buffers. As energy consumption of a processor is a convex function of its speeds, 
energy consumption increases with the degree of fluctuation in the processor speeds 
with the same average workload. Since multimedia decoding exhibits high fluctuation 
levels in workload, smoothing is an effective method to reduce energy consumption. 
In existing DVS approaches, however, buffers provide two functions: 1) smoothing 
out fluctuations of workload demand to reduce energy consumption; 2) avoiding 
missed deadline to guarantee QoS. These two functions have two conflicting 
requirements and will interfere with each other.  Consequently, energy efficiency is 
degraded.   
3. Insufficient support for voltage scheduler. With current DVS schemes, 
audio decoding is designed to work in a binary quality mode, which is described as 
follows. The audio decoding application requires certain amount of computation 
resource to decode a frame. If its demand can be satisfied by the allocated resources 
from the voltage scheduler, the application will successfully decode the current frame; 
otherwise, the application will fail to decode the frame. With binary quality mode, the 
voltage scheduler cannot dynamically reduce the workload of an audio decoder to 
achieve low power computation [70] [9].   
4. Excluding users’ intention. Portable devices have a large diversity of 
application scenarios. In some cases, users are more tolerant to playback quality 
degradation due to the following reasons. 1) Perceptual characteristics of individual 
 25
users: Although most perceptual high quality audio codecs, such as MP3, cover a 
frequency range of 22 kHz, most adults can hardly hear frequency components above 
16 kHz. We can therefore leave those irrelevant high frequency components un-
decoded. 2) Listening environment: It is far more common to use portable audio 
players on the move and in a variety of environments such as in a bus, train, or plane, 
using simple earpieces. In such noisy environment, it is difficult for most users to 
distinguish between various playback qualities – they appear to be more tolerant to 
small quality degradation in such situations. 3) Service types and signal characteristics: 
Different applications and signals require different bandwidths. For example, a 
storytelling audio clip requires significantly less bandwidth compared to a music clip. 
4) User preferences associated with battery level: A user might want better playback 
quality with a fully charged battery, but may be willing to sacrifice some playback 
quality for longer battery life when the battery is flat. Based on these observations, we 
believe that it would be an interesting feature of the portable audio player to allow 
users to control the tradeoff between the battery life and the decoded audio quality.  
 
1.4  Our Methodology of Low Power Audio Techniques 
for Portable Devices 
The issues listed in section 1.3.3 represent conflicting requirements on low power 
audio techniques. There is no a single solution which can solve all of these issues. This 
is illustrated by the following example. Issue 1) implies that we should minimize the 
energy consumption of the audio decoding application without sacrificing its playback 
quality. The resulting workload is a single optimized value, which is not suitable for 
 26
issue 3) and 4). Further investigation shows that the heterogeneity of those problems 
results from different usage modes of the portable devices, where the user has 
different preferences and expects different characteristics of the decoding applications. 
Typically, we can distinguish between two cases when using the portable devices. In 
normal case when the remaining battery capacity is sufficient, the user has high 
expectation on the playback quality. When the remaining battery capacity is at low 
level, the user is willing to sacrifice some playback quality in exchange for prolong of 
the battery life.  
 
Figure 1.4 A two-state model of the voltage scheduler 
Corresponding to this point of view, we propose a two-state model for the voltage 
scheduler, which is shown in figure 1.4. We associate different users’ preferences with 
different states of this model. Resource allocation policies of the voltage scheduler are 
then determined from users’ preferences. In normal state, users prefer playback quality 
to energy efficiency. The voltage scheduler needs to satisfy the computational 
resource requirements from the audio decoding applications. As the remaining battery 
capacity is running out, the voltage scheduler will be transited to low power state. In 
this thesis, we assume that the transition is triggered by two mechanisms: 1) user’s 
manual switching, or 2) automatic detection of remaining battery capacity level. The 
low power state can be switched back to normal state by the recharged battery or 
user’s choices. In low power state, energy efficiency takes precedence over playback 
 27
quality. This allows the voltage scheduler to allocate less computation resources than 
required by the audio decoding applications to prolong the battery life.  
 
The low power design of audio codec is largely dependent on the above two-state 
model of the voltage scheduler. In normal state, since the voltage scheduler has 
guaranteed not to degrade the designed playback quality of the audio codec, the 
performance of the audio applications is completely determined by the audio codec. 
Therefore the primary objective of the low power audio codec in normal state is to 
optimize their energy efficiency subject to the constraint of non-degradation of their 
playback quality, which can be achieved at the design phase of audio codec. On the 
other hand, in the low power state, as they are dynamically changed by the voltage 
scheduler to achieve desirable tradeoffs between workload reduction and the quality, 
the actually allocated computation resources are unknown when designing the audio 
codec. It is impossible to optimize the audio codec in such a case. Therefore the 
primary design goal of the low power audio codec in low power state is how to 
support the energy efficient operations of the voltage scheduler at runtime.  
From the above analysis, it is noticed that different states of the voltage scheduler 
require different design strategies of the audio codec, which represents a natural 
classification of energy efficiency techniques. In this thesis, following this 
classification, we address the energy efficiency issues from two perspectives.  
First we consider the energy efficiency techniques for the normal state of the voltage 
scheduler. As summarized in issues 1) and 2) of section 1.3.3, the main issues are to 
achieve workload reduction with non-degradation of quality, and to improve energy 
 28
efficiency of DVS by eliminating interference of guaranteeing QoS. To address the 
first issue, the fundamental idea is to keep the sum of quantization noise and ASP 
noise below the masking threshold. When achieving the required workload reduction, 
we will tune the ASP noise, and correspondingly, adjust the quantization noise to meet 
the constraint of masking threshold. This implies that we need to access masking 
threshold and control the quantization process when performing workload reduction. 
To solve the second issue, we exploit bitstream analysis to avoid the missed deadline, 
rather than the filling level of buffers. This enables buffers to focus on smoothing out 
fluctuations of workload. Such method requires the knowledge of workloads and bit 
length of the frames, and powerful computing resource to perform the analysis.  
In all existing energy efficient schemes, including workload reduction and DVS, the 
decoder is solely responsible for energy efficiency. These decoder based schemes can 
only provide suboptimal solutions due to their inherent limitations. They have no 
access to important information, such as masking threshold and accurate workload. 
Even these data are available at decoder/client, due to the real-time demand and 
limited computing resources, the decoder/client-only approaches may not afford the 
required computations to find a globally optimal solution. 
As energy efficiency has become a major design consideration of the portable devices, 
we believe that it is necessary to involve all related parts into the low power design to 
overcome the limitations of existing techniques. Following this idea, we extend the 
low power design to the encoder/server, making the encoder/server support low power 
audio decoding. More specifically, we investigate the possibility of: 1) workload 
reduction supported by encoder; and 2) DVS supported by media servers. These two 
 29
proposed schemes are superior to the existing schemes from the following 
perspectives. First, the accurate information of masking thresholds and workload are 
available for energy efficiency techniques. Second, the encoder/server can yield 
optimal solutions to the related issues by solving them with accessible global 
information in off-line way and employing powerful computing resources of the 
encoder/server. Third, these two schemes alleviate the computational burden of the 
decoder/client: the major parts of computations relevant to energy efficiency, such as 
workload estimation, etc., are moved from the decoder/client to the encoder/server. 
These computations are additional sources of energy consumption.   
We now consider the energy efficiency techniques for the low state of the voltage 
scheduler. These techniques involve solving the issues 3) and 4) in section 1.3.3. In 
fact, both of these two issues are associated with the binary quality mode of current 
audio decoder. Although binary quality mode is appropriate for characterizing 
applications with precise computations like scientific calculations, audio processing is 
not necessarily of a binary quality mode: multi-levels of tradeoffs can be provided 
between playback quality and workload [9]. This observation inspires us to solve these 
two issues with workload scalability. Towards this, we propose a new concept of 
workload scalable audio decoding. In a workload scalable audio decoding scheme, the 
decoding process is partitioned into several layers. The lowest layer decoding 
performs the essential reconstruction of the audio data with the minimal workload. As 
the number of the involved decoding layers grows, the quality of the reconstructed 
signals is improved with increased workloads. In addition, to support the voltage 
scheduler to find the desired tradeoffs, the workload scalable audio decoding 
 30
applications provide performance profiles, which describe the relationships between 
the possible allocated resources and a numerical measure of playback quality. The 
workload scalable decoding offers additional support for voltage scheduler with multi-
tasks. By adding a new dimension of scheduled units to the voltage scheduler, 
workload scalable audio decoding techniques improve the performance of the voltage 
scheduler significantly. Similarly, the workload scalable decoding scheme will support 
user’s switching operations as well.  
 
We conclude this section by re-examining current techniques from the perspective of 
our taxonomy of methods, which provides insights into their key issues at the level of 
methodology. First, it is noticed that all existing techniques have been developed 
under the implicit assumption of a single usage mode: users’ preferences and resource 
allocation policies of the voltage scheduler remain unchanged during the entire 
process. The unawareness of heterogeneous usage modes of portable devices for audio 
leads to incomprehensive solutions to the energy efficiency issues since, as discussed 
above, approaches for different usage modes are incompatible with each other. Second, 
most of existing energy efficiency techniques do not work well for both states of the 
voltage scheduler. These existing techniques fall into the category of energy priority 
approaches which associate with low power state of the voltage scheduler. Their 
performance may be unacceptable for normal state.  This is a serious problem for their 
applications since normal state is dominant when using portable devices as music 
players. Playback quality priority techniques for normal state should be the primary 
concern of study.  On the other hand, as energy efficiency techniques for low power 
 31
state, most of them fail to sufficiently support the voltage scheduler since they only 
provide a single level of reduced workload.  
1.5  Contributions of the Thesis 
The main contributions of this thesis are summarized as the following:    
• We present a novel framework, a joint encoder-decoder framework (JEDF), 
which allows the decoder to have a desirable tradeoff between energy and 
memory consumption without sacrificing playback quality. This work exploits 
the technology trends that in comparison with the relatively slow progress of 
the battery technology, the semiconductor memory has improved much more 
rapidly, making the storage a less critical limiting factor in designing low 
power embedded systems such as PDAs. We employ SOPOT technique, an 
ASP technique in an MPEG AAC decoder to reduce the computational 
workload. The SOPOT introduces additional ASP noise (in the decoder) on top 
of the quantization noise introduced in the process of lossy compression (in the 
encoder). The sum of these two kinds of noise may become audible when it 
exceeds the masking threshold. We tackle this problem from a new perspective: 
the proposed JEDF allows the ASP and quantization noises to be shaped 
jointly to satisfy the masking threshold. In the case that the perceptual room 
between the masking threshold and the quantization noise is insufficient for the 
ASP noise, the JEDF can reduce the quantization noise level which results in 
an increase in bitrate. To implement the proposed scheme, we have developed 
two new techniques: 1) SOPOT truncation noise shaping; 2) truncation noise 
allocation based on a perceptual model. Experimental results show the 
 32
effectiveness of our approach. With transparent playback quality, the proposed 
JEDF achieves round 40% workload reduction of the entire decoding process, 
incurring less than 10% file size growth.  
• We propose a new concept of DVS for low power multimedia decoding in 
battery-powered portable devices. Most existing DVS techniques are 
suboptimal in achieving energy efficiency while providing the guaranteed QoS, 
which is mainly due to the inherent limitations of client-only approaches. To 
address this problem, we investigate the possibility of media server supporting 
DVS techniques, by incorporating accurate workload estimation and buffering 
mechanism. Towards this new direction, we propose an optimal speed control 
scheme, namely OSP-DVS, which achieves the maximal energy savings 
among all feasible speed profiles for the given buffers. Compared to the 
representative existing techniques, our scheme significantly improves the 
performance of DVS. Moreover, the proposed scheme only requires very small 
sizes of buffers and speed profiles. This largely facilitates its application on a 
large of scale portable devices and provides additional opportunities for energy 
savings.  Compared to existing buffer based DVS, our scheme improved 11% 
~16% in energy consumption and 10%~17% in playback buffer requirement 
for audio bitstreams.   
• We propose a new workload scalable audio decoding scheme that support 1) 
voltage scheduler in low power state and 2) users’ choices of playback quality 
with tunable tradeoffs between playback quality and power consumption in 
battery-powered portable audio players. The proposed scheme is based on a 
 33
frequency Band and Stereo-image Scalable (BSS) decoding framework for 
single-layer audio formats such as MP3, which exploits an analysis of the 
perceptual relevance of different audio components in the compressed 
bitstream. The frequency band and stereo-image scalability directly translates 
into scalability in terms of the computational workload generated by the 
decoder. When implementing the proposed scheme, the most challenging 
problem is how to effectively reduce the workload of Pseudo-Quadrature 
Mirror Filters (PQMF), since conventional techniques cannot exploit the 
benefits of stereo-image scalability. To address this problem, we design a 
novel algorithm of a scalable and efficient PQMF, namely asymmetric partial 
spectrum reconstruction (APSR). As a new extension to the conventional 
partial spectrum reconstruction techniques, APSR removes irrelevant 
computations associated with stereo-image scalability. In our scheme, the 
workload is roughly proportional to the frequency bandwidth to be decoded 
and achieves workload reduction from 23.9% to 85% at the cost of small 
playback quality degradation.  
These techniques have been developed under the taxonomy presented in section 1.4, 
offering a comprehensive solution to low power audio processing for portable devices. 
They can be classified into two groups based on the two modes of portable devices 
and the decoder can flexibly combine them in different modes. JEDF and OSP-DVS 
are designed for normal state. They offer a single optimized energy efficiency solution 
without sacrificing playback quality. As more workload reduction is required in low 
 34
power state, BSS can be then incorporated into the decoding process to provide the 
desirable tradeoffs between energy efficiency and playback quality.  
 
The rest of the thesis is organized as follows. In chapter 2, we present the scheme of 
JEDF. In chapter 3, we discuss the techniques of media server supported DVS. The 
proposed work is a general method for multimedia bitstreams, including audio and 
video. As video decoding applications exhibit higher variations, the experimental 
results are mainly based on video to demonstrate its effectiveness. It should be noted 
that the work is suitable for audio as well. In chapter 4, we discuss the BSS scheme. 
We first present the framework of workload scalability scheme based on frequency 
band and stereo-image, and then concentrate on the technique of asymmetric partial 
spectrum reconstruction targeting stereo-image scalability. Finally, we conclude the 
thesis and point out possible future works in chapter 5. 
 35
 
Chapter 2  
 
A Joint Encoder-Decoder Framework for 
Supporting Low Power Audio Decoding 
2.1 Introduction 
Among the techniques to reduce energy consumption of multimedia decoding 
applications, a fundamental approach is to reduce their computational workload. The 
reduced workload can be exploited by a voltage/frequency scalable processor to save 
energy and to prolong the battery life. Towards this, approximate signal processing 
(ASP) techniques have been widely adopted [89], which exploit algorithms with 
flexible structures, such as tunable word length, filter order, etc, to achieve the desired 
tradeoff between accuracy of their results and their utilizations of resources. A well-
known ASP technique is partial spectrum reconstruction (PSR), which only 
reconstructs the spectrum of a part of the coded signals, resulting in a low pass version 
of the original spatial samples [80] [2]. Essentially, ASP techniques may result in a 
degradation of playback quality in exchange for a prolonged battery life [5]. 
However, if the user requires both CD-quality audio and long battery life, it is difficult 
to solve the dilemma with existing methods. To address this problem, we propose a 
 36
new approach towards saving energy. We achieve energy efficient audio decoding by 
a joint encoder-decoder framework (JEDF). This approach allows us to introduce a 
less critical limiting factor, the storage into the tradeoff, so that we can significantly 
reduce the decoding workload, while maintaining transparent playback quality by a 
possible sacrifice of the compression efficiency. 
In our JEDF framework, the decoder employs ASP techniques to reduce the 
computational workload, which results in additional ASP noise. The saved 
computational workload is determined by the configuration of the ASP structure, 
where more workload is reduced at the expense of introducing more ASP noise. In our 
scheme, the configurations of the ASP structures of the decoder are specified in the 
encoder. The encoder inserts additional side information, which describes the desired 
configurations of the ASP structure, into the compressed bitstream. In the process of 
playback, the decoder reads the related side information and adopts the specified 
configurations for the ASP algorithm accordingly. In other words, the encoder can 
completely control the ASP computational workload at the decoder.  
To guarantee the playback quality, we extend the current audio coding techniques by 
jointly shaping the ASP noise and the quantization noise according to masking 
thresholds. Masking thresholds are a fundamental concept in modern perceptual audio 
codecs, such as MPEG-1 Audio Layer III (MP3) [57] and Advanced Audio Coding 
(AAC) [58]. As a property of the human auditory system, the masking threshold 
indicates that noise lower than this threshold is inaudible. Exploiting this principle, 
perceptual audio encoders compute maximal allowed quantization step sizes subject to 
masking threshold constraints, which will produce the minimal bit length of coded 
 37
audio signals. We define Masking-to-Quantization-noise-Difference (MQD) as the 
difference between the masking threshold and the quantization noise. MQD indicates 
the maximum level of extra noise allowed by the masking threshold, which will not 
degrade the playback quality. 

































Figure 2.1 Illustration of the proposed scheme, where MT,QN,AN, and MQD 
stand for masking threshold, quantization noise, ASP noise, and masking-to-
quantization-noise-difference, respectively : a) for a conventional AAC encoder: 
the sum of the additional ASP noise and the quantization noise exceeds the 
masking threshold; b) for our scheme: with reduced quantization noise, the 
overall noise is below the masking threshold. 
The key idea of our proposed work, as shown in Figure 2.1, is to reduce the 
quantization noise level in the encoder when necessary, enabling the increased MQD 
to accommodate the ASP noise introduced by the decoder. This approach ensures that 
the overall distortion is below the masking threshold. In comparison with conventional 
encoders, this approach may increase the size of the compressed audio files. Its effects 
can be analyzed from two perspectives. First, the increased file size will lead to 
additional energy consumption in the data reading process from non-violate storage. 
However, as analyzed in section 1.1.1, in context of portable devices, such reading 
 38
operations are only responsible for a small fraction of the overall energy consumption. 
As pointed out, for a software MP3 decoder on a portable device, the file reading 
operations only consume 1.9% of the energy of the decoding process. On the other 
hand, our proposed scheme can achieve round 40% workload reduction of the 
decoding process with less than 10% increase in file sizes for 128 Kb/s AAC bitstream 
and above. These figures show the advantage of the proposed scheme. The second 
concern is on the storage capacity of portable devices: if a portable device has only 
limited storage capacity, then we would prefer storage efficient schemes. Fortunately, 
the rapid advance of the semiconductor technologies has made the storage a less 
critical limiting factor in portable devices. For instance, the well-known Apple iPod 
nano series have already been equipped with 2G, 4G and 8G bytes of storage. Large 
storage capacities allow us to exploit flexible design strategies in low power 
techniques for portable devices. Our scheme offers an appealing tradeoff between the 
local storage and energy consumption. The projected applications of our scheme is 
download-playback services where the user downloads audio clips once and play them 
back multiple times from local storage. In such an application scenario, we argue that 
the audio file size is less critical than battery life. 
It should be noted that the proposed scheme differs fundamentally from existing 
energy efficient techniques, where the decoder is solely responsible for energy 
savings. We believe that the proposed scheme represents a new direction for low 
power media decoding applications. Its superiority to the existing approaches is 
twofold. First, the encoder is responsible for computing the desired configurations of 
the ASP structures, which alleviates the relevant computation at the decoder. Second, 
 39
the encoder can access more information to achieve the goal than the decoder. For 
example, our scheme accesses the masking threshold calculated in the encoder to 
shape the ASP noise. This is impossible for a decoder-only approach. 
The rest of the chapter is organized as follows. In section 2.2, we briefly review the 
noise shaping technique in AAC and computation efficient techniques for transforms. 
Then we provide an overview of our work in section 2.3. In section 2.4, we present a 
detailed description of the technology of joint ASP and quantization noise shaping for 
AAC encoding. In section 2.5, we present the experimental results.  
2.2 Related Works 
2.2.1 Noise Shaping Techniques in AAC 
In AAC [58], the full spectrum of a frame is partitioned into 49 scale factor bands. 
Different scale factor bands may have different masking thresholds for two reasons: 1) 
different sensitivities of Human auditory system over these scale factor bands; and 2) 
the characteristics of the audio signal. To fully exploit the variations of masking 
thresholds, different quantization step sizes are used to quantize the frequency 
coefficients with the goal of keeping the quantization noise below the masking 
threshold. This results in different quantization noises for different scale factor bands. 
On the other hand, the transparent playback quality is guaranteed if for each scale 
factor band, the following relationship holds:  
2 2
( ) ( ),
( )
( ) ( ( )) 0 49Q i T i
k i
E F k Q F k E i
β∈
= − ≤ ≤ <∑                          (2.2.1) 
 40
Where ( )iβ  is the range of frequency coefficients for scale factor band i, F(k) and 
Q(F(k)) are the original k-th frequency coefficient and its quantized value, ( )Q iE and 
( )T iE are the quantization noise and the masking threshold for scale factor band i, 
respectively. 











                                                (2.2.2) 
where Q∆  is the quantization step size. As shown in (2.2.2), the quantization noise 
increases as the quantization step size becomes larger. 
2.2.2 Computation Efficient Techniques for Transforms 
In the literature, typical computation efficient algorithms for transforms, including 
(Inverse) Fast Fourier Transform ((I)FFT), (Inverse) Discrete Cosine Transform 
((I)DCT) and (Inverse) Modified Discrete Cosine Transform ((I)MDCT), can be 
divided into two classes: data driven approaches and fixed point approximation 
approaches. Data driven approaches include pruning techniques [77] [106] [49] and 
forward mapping IMDCT [79]. This class of approaches eliminates the calculation for 
zero-valued coefficients since these coefficients make no contribution to the output of 
the transform. The workload reduction of data driven approaches largely depend on 
the statistical properties of the input sequence and block length of the transform. An 
important property of this class of approach is that the efficiency of workload 
reduction degrades rapidly as the size of the transform block grows. 
 41
Fixed point approximation approaches include fixed point multiplication methods [41] 
and Sum-Of-Powers-Of-Two (SOPOT) methods [66] [16]. Both of them replace 
floating point multiplications with fix point operations. In transforms including 
IMDCT, a significant part of computation involves floating point multiplications. As 
portable devices are rarely equipped with the floating point unit, the floating point 
multiplication is simulated by software packages. As a result, the required workload 
increases significantly. To achieve computation efficient implementations, fixed point 
multiplication scales and approximates those floating point coefficients by integers. 
The resulting transforms have much lower computational workload than the floating 
point version and therefore are widely employed in various applications. However, it 
is argued that the 32-bit integer cannot provide sufficient accuracy for long block 
transforms [16]. Moreover, it cannot provide tunable tradeoffs between computational 
workload and ASP noise. As an alternative, SOPOT methods decompose the operation 
of multiplication into the sum of powers of two operations. For example, we can 
calculate 0.40625x ⋅ as 1 3 52 2 2x x x− − −⋅ − ⋅ + ⋅ . For SOPOT, an effective way to save 
the computational workload is to reduce the number of SOPOT terms, which results in 
the truncation noise: more workload can be saved at the cost of introducing more 
truncation noise.  
 42














































Figure 2.2 Architecture of the proposed audio encoder 
Although the proposed scheme can be implemented with most existing audio codecs, 
we have implemented our scheme with the ISO/IEC 13818-7 Advanced Audio Coding 
format [58]  for the sake of proof of concept. In [58], three profiles are defined: ain 
profile, low complexity profile, and scalable sample rate profile. Among them, the low 
complexity profile has found the most widespread use. Due to this, we have 
implemented our scheme in the low complexity profile.  
The block diagram of the proposed audio encoder is depicted in Figure 2.2. Our 
scheme is essentially a two-pass encoder based on the frame structure of AAC. The 
first pass is implemented by a conventional AAC encoder core, which analyzes the 
PCM data of the current frame subject to the bit rate constraint, and provides three 
kinds of information to support the second pass processing: masking thresholds for all 
scale factor bands, frequency coefficients, and their associated side information, 
including the quantization step sizes. 
Analogous to the bit rate constraint over a conventional AAC encoder, we introduce a 
computational workload level of the ASP algorithm as the constraint for the second 
 43
pass. The second pass of the proposed scheme searches desired ASP parameters such 
that both the MQD requirement and the computational workload requirement are met. 
This pass involves two important modules: ASP workload reduction and workload 
estimation. The workload estimation module controls the processing of the second 
pass. Based on the ASP parameters provided by the ASP workload reduction, the 
workload estimation module can derive the corresponding workload of decoding the 
current frame. If the workload is lower than the workload constraint, the workload 
estimation will invoke the quantization and multiplier (MUX) modules, etc, with the 
updated side information. The involved modules compress the frequency coefficients 
of the current frame and multiplex the compressed data and side information into the 
coded bitstream. If the workload is higher than the workload constraint, workload 
estimation will reduce the quantization step size of some scale factor band to yield an 
increased MQD level. With the updated side information, workload estimation 
invokes the ASP workload reduction module. This process repeats until the actual 
workload is below the workload constraint or the required quantization step sizes 
cannot be supported by the AAC specifications.  
The ASP workload reduction module is responsible for reducing the computational 
workload of the decoding process. Among various processing modules in an AAC 
decoder of low complexity profile, we mainly consider the IMDCT due to two factors: 
1) it is challenging to design an effective ASP method for a long block of IMDCT 
(AAC employs a 2048-point IMDCT): as shown in section 2.2.2, for existing 
techniques, either the efficiency of workload reduction is limited or the ASP noise is 
not acceptable; 2) IMDCT is responsible for a large part of the computational 
 44
workload of the whole decoding process, especially with high accuracy computation. 
The latter will be shown in section 2.5.1. In most AAC decoders, the 2048-point 
IMDCT is computed using a 512-point IFFT and some pre- and post-rotations. In this 
method, IFFT is responsible for about 50% of the workload of the IMDCT. As the 
first step, we concentrate on IFFT to achieve workload reduction. Based on the 
analysis in section 2.2.2, we have chosen SOPOT as the implementation of IFFT at 
decoder, since it provides: 1) high dynamic range of computation accuracy; 2) tunable 
tradeoffs between computational workload and truncation noise. These two properties 
suggest that it is appropriate for noise shaping in our scheme. In addition, it only 
requires two simple arithmetic operations, shift and addition, found in a wide range of 
portable devices. To perform workload reduction, we have developed joint ASP and 
quantization noise shaping for AAC encoding. This method comprises two parts. The 
first part includes the techniques to shape the SOPOT truncation noise to fit the noise 
shaping framework used by an AAC encoder. The second part concentrates on how to 
allocate MQD to its associated SOPOT coefficients to effectively reduce the 
computational workload. These two parts are presented in section 2.4.1 and 2.4.2, 
respectively. 
2.4 Joint ASP and Quantization Noise Shaping  
2.4.1 Truncation Noise Shaping of SOPOT Coefficients 
As discussed above, truncated SOPOT coefficients cause additional noise in the 
reconstructed data. To guarantee the transparent playback quality, we need to keep 
these truncation noises below MQD. These MQD levels vary dynamically as we may 
change the quantization step sizes of some scale factor bands. Towards this, we shape 
 45
the truncation noise of the SOPOT coefficients to match the level of MQD. In our 
work, the SOPOT coefficient truncation noise shaping has two implications. First, we 
need to truncate different coefficients at different positions. This requirement enables 
us to fully exploit the variations of MQDs: for larger MQDs, we can discard more 
SOPOT terms to save computations. Second, the truncation noise over some frequency 
coefficients should be orthogonal to other frequency coefficients. We call this property 
the orthogonality of the truncation noise. Orthogonality of the truncation noise can be 
explained as follows. When we perform IFFT on a frequency coefficient with the 
truncated SOPOT coefficients, time domain noises are produced over the 
reconstructed data. The orthogonality of truncation noise requires that the spectrum of 
those time domain noises is only related to the source frequency coefficient, no other 
frequency coefficients are involved. The orthogonality of truncation noise eliminates 
the cross scale factor band noise, which will complicate the truncation noise shaping 
process.  
The truncation of SOPOT coefficients appears to be equivalent to coefficient 
quantization problems of various transforms in the literature, such as [59] [88][41], 
which have attracted the attention of many researchers. However, these results cannot 
be applied to our work, for the following reasons. All of them have been developed to 
address the issue of finite word length of registers and these schemes have modeled 
the truncation errors as independent and identical distribution random variables. This 
implies that their results are not valid for the shaped truncation noises in our scheme. 
Moreover, in those works, the truncation noise resulting from a frequency coefficient 
is spread to other frequency coefficients. 
 46
To effectively shape the truncation noise, in subsection 2.4.1.1, we propose a method 
to achieve the orthogonality of truncation noise using IFFT coefficient blocks.  In 
subsection 2.4.1.2, we propose a method to deal with the cross terms among 
coefficient blocks.  


























































































































Figure 2.3  The flow graph of a 16-point inverse FFT with marked coefficient 
blocks 
The N-point Discrete Fourier Transform and its inverse transform are defined as (2.4.1) 
and (2.4.2), respectively: 









= ⋅∑ ,  0 1k N≤ ≤ −                                  (2.4.1) 
1
0









= ⋅∑ , 0 1n N≤ ≤ −                                  (2.4.2) 
where ( )exp 2nW j n Npi= − ⋅ ⋅  
 47
When we apply the inverse FFT to compute (2.4.2), the rotation operation nkW − is 
decomposed into a series of sub-rotations )(),(
⋅−α
knW , where v(k) is the number of sub-











=∏                                                   (2.4.3) 
In particular, from (2.4.2), we can derive ( )k nτ , 0 1n N≤ ≤ − , which are the temporal 














= ⋅ ⋅∏                                         (2.4.4)  
The truncation of a SOPOT coefficient is equivalent to introducing an additive error δ , 
and the realistic reconstructed sample can be calculated as:  






n k n k
k i






= ⋅ +∑ ∏                                    (2.4.5) 
Next we investigate the truncation errors when we only truncate the SOPOT 
coefficients of a single IFFT coefficient block. We define an IFFT coefficient block 
as { }( )( , ) 0 1in kW n Nα− ≤ ≤ − , which are those coefficients grouped in the same box in 
Figure 2.3. An important property of the coefficient block is that the transform of a 
frequency coefficient can be decomposed into a series of calculations using the 
coefficient blocks. This property allows us to control the output of the coefficient 
block to achieve the desired truncation noise shaping. 
Without loss of generality, let the j-th IFFT coefficient block associated with F(k) be 
truncated. From (2.4.4), we have:   
 48
( )( ) ( )( ) ( ) ( )( , ) ( , ) ( , )
1,
1
ˆ ( ) ( )
v k
j j i
k n k n k n k
i i j
n F k W W
N
α ατ δ− −
= ≠
= ⋅ + ⋅ ∏                                    (2.4.6) 
To derive the spectrum of the truncation error, we conduct DFT over the time domain 
errors. Based on (2.4.1), and (2.4.4), we can represent the spectral error over frequent 















∆ = ⋅ − ⋅∑                                        (2.4.7)  
Substituting (2.4.4), (2.4.6) into (2.4.7), we have: 
( ) ( ) ( )1
( ) ( ) ( ) ( )
( , ) ( , ) ( , ) ( , )
0 1 1 1
( )1
( ) ( )





                                   
v k v k v kN
i j i i nm
k n k n k n k n k




n k n k
n i
i j
F kF m W W W W
N

























∑ ∏ ∏ ∏
∑ ∏              
                           (2.4.8) 
In (2.4.8), it implies that the truncation noise of the frequency coefficient F(k) will be 
spread to other frequency coefficients when { }( )( , ) 0 1jn k n Nδ ≤ ≤ − are random variables. 
This is the exact case for conventional analysis of coefficient quantization problems. 
To guarantee the orthogonality of the truncation noise, the truncation errors of a 
coefficient block should satisfy the orthogonal condition of the truncation noise, which 
is described in (2.4.9): 
( ) ( ) ( ) ( )
( , ) ( , ) ( , ) ( , ) ,0 ,
j j j j
m k m k n k n kW W m n N
α αδ δ⋅ = ⋅ ≤ <                             (2.4.9) 
In other words, { }( ) ( )( , ) ( , ) 0 1j jn k n kW n Nαδ ⋅ ≤ ≤ − have equal values. This is shown as 




( ) ( )
( , ) ( , )
0 1
( )1
( ) ( )
( , ) ( , )
0 1
1
( ) ( )





            
( )
             
v kN
j i nm




j i nk nk nm




j j nk nm
n k n k
n
F kF m W W
N
F k W W W W
N














































⋅ =∑  for any m k≠ . This justifies the orthogonality of the 
truncation noise. 
It should be noted that when we let various { }( ) ( )( , ) ( , ) 0 1j jn k n kW n Nαδ ⋅ ≤ ≤ − have equal values, 
{ }( )( , ) 0 1jn k n Nδ ≤ ≤ −  are no longer truncation errors. But they support the truncation 
operation in the way that these errors dominate the distortion and make the actual 
truncation errors of smaller levels negligible. Thus the magnitude of )( ),(
j
knδ can serve   
as an indicator for the truncation position of the corresponding SOPOT coefficient, 
which will be shown in section 2.4.2. In this sense, we call them “truncation errors”. 
2.4.1.2 Modeling the Noise from a Series of IFFT Coefficient Blocks 
We have proposed a method to shape the truncation errors by a single IFFT coefficient 
block in subsection 2.4.1.1. As illustrated in Figure 2.3, most frequency coefficients 
are associated with a series of blocks. To build a model to represent the sum of 
truncation noises, which results from a series of blocks, for a frequency coefficient, we 
need to extend the results presented in subsection 2.4.1.1.  
The sum of truncation errors for frequency coefficient F(k) can be represented as: 
 50
( )( ) ( )1 ( ) ( ) ( )( , ) ( , ) ( , )
0 1 1
( )
ˆ( ) ( ) ( )
v k v kN
i i i nk
n k n k n k
n i i
F k







= − = + − ⋅ 
 
∑∏ ∏                   (2.4.10) 
As the errors are of small values, we can neglect the higher order error terms. 
Substitute (2.4.3), we have: 
( )( ) ( ) ( )( )( ) ( ) ( ) ( ) ( )( , ) ( , ) ( , ) ( , ) ( , )
11 1 1,
( )
( ) ( )
( , ) ( , )
1
                                                
v k v k v kv k
i i i i j
n k n k n k n k n k
ii i j j i
v k
n k i i









== = = ≠
− ⋅
=




                     (2.4.11) 
From (2.4.11), for any ( )1,i v k ∈   , if we let { }( ) ( )( , ) ( , ) 0 1i in k n kW n Nαδ ⋅ ≤ ≤ − have equal 
values, the orthogonality of the truncation noise holds for a series of coefficient 
blocks. 
For convenience, when the orthogonality of the truncation noise holds, we 
denote ( ),( ) ( ) ( )( , ) ( , ) ( )
j k ii i i
n k n k kW e
ϕαδ δ⋅ = ⋅ .               
Thus the sum of truncation noises for frequency coefficient F(k),which results from a 
series of blocks, is represented as:  
 ( )
2( )





i j k i
T k
i
e k F k e ϕδ
=
= ⋅ ⋅∑                                   (2.4.12)  
As we further develop (2.4.12), a critical issue is how to effectively deal with the cross 
terms between truncation errors. In statistical model based approaches [59][88], these 
truncation errors are assumed as independent random variables with mean value of 
zero and their cross terms will vanish for the expectation of sum of truncation noises. 
By contrast, in our work, these errors are subject to the orthogonal conditions, they 
cannot be assumed as independent random variables with mean value of zero. These 
 51
non-zeroed cross terms lead to two undesired consequences. First, these cross terms 
will complicate the analysis of noise allocation for IFFT coefficient blocks. Second, 
and more importantly, these cross terms potentially lead to additional noises. These 
additional noises will degrade the efficiency of workload reduction. To address the 
issue of the cross terms between truncation errors, we suppress the sum of truncation 
noises by shaping the angles of the truncation errors. Furthermore, the angle shaping 
technique enables us to develop a cross term deleted representation of the sum of 
truncation noises.  
To facilitate the angle shaping, we limit the value set of 
( ){ }, 0 1,1 ( )k j k N j v kϕ ≤ ≤ − ≤ ≤ to be { }4,5 4pi pi . These two angles are opposite to 
each other. When we assign those block truncation errors with different angles, these 
truncation errors will subtract from each other, to make the noises as small as possible. 
Further, it reduces the angle shaping process into “sign” assignment operation: “+” 
denotes 4pi , and “-” denotes 5 4pi . We have designed the following algorithm to 
accomplish the assignment for all the blocks. 
We organize IFFT coefficient blocks as the following structure: a block and all of its 
left-covering blocks form a block tree and the largest block acts as the root of the tree. 
On the other hand, for block k, we define P(k) as a block set, all of its elements left-
cover block k. For example, in Figure 2.3, {B(1), B(3), B(6), B(7)}, {B(3), B(7)} and 
{B(7)} are three of such block trees;  P(6)={B(1)} and P(7)= { B(3), B(1) }.  
We start with the largest tree and perform the assignment for the root. Deleting the 
root which has been processed, the current tree is decomposed into several smaller 
 52
trees. We then iteratively move to these block trees and conduct the same computation 
for their roots until all blocks are processed. 
We denote jδ and S(j) as the noise level and sign of block j.  For current block k, given 
the truncation noise levels of all coefficient blocks, we have:  




S k SIGN S j δ
∈
 
= − ⋅ 
 
∑                                  (2.4.13) 
According to this algorithm, we can derive an upper bound on the overall truncation 
noise.   
From (2.4.13), we have: 




i j k i
k
i




= + ∠ ⋅ 
 
∑                            (2.4.14) 
Thus 
( )
2 2( ) ( ) 12( )( ) ( , ) ( ) ( , )
( ) ( ) ( )
1 1
v k v k
v ki j k i i j k i
k k k
i i
e eϕ ϕδ δ δ
−
= =
⋅ ≤ + ⋅∑ ∑                       (2.4.15) 
In this way, we can iteratively decompose the last item at the right side of (2.4.15) and 
we have a cross-term deleted representation of the sum of truncation noise: 
( ) ( ) ( )
2( ) ( ) 22 2 ( ) ( , ) 2 ( )
( ) ( )
1 1
v k v k
i j k i i
k k
i i
e k F k e F kϕδ δ
= =
= ⋅ ⋅ ≤ ⋅∑ ∑                    (2.4.16) 
 53
2.4.2 Noise Allocation over SOPOT Coefficient Blocks  
Given the MQDs of all the scale factor bands, we need to allocate them to the IFFT 
coefficient blocks. Based on its allocated noises, we can perform truncation over an 
IFFT coefficient block to achieve workload reduction. 
First we need to establish the relationship between the number of reserved SOPOT 
terms and their corresponding truncation noise. We denote b as the truncation position 









≤∑ , ( )( )
1,   has a SOPOT term at position i








                             (2.4.17) 
Its implication is straightforward: we can effectively reduce the number of SOPOT 
terms by left-shifting the truncation position. Therefore we use the truncation position 
as the estimation of the computational workload of a SOPOT coefficient.  On the other 
hand, the relationship of the truncation noise 2te  and the truncation position b has been 
well-established in the literature.  
For a fixed point coefficient, we have [59]: 
2 2( ) 2 6bte b −=                                                    (2.4.18)   
Next we need to associate truncation noise with the allocated noise 2iδ of block i. 
Towards this, we choose the truncation position for block i as follows, where c >0, 
being a constant for all coefficients: 
( )( )arg min 1j ijb I cδ= = +                                          (2.4.19) 
 54
This indicates that we only reserve c bits of iδ from its first non-zero most significant 
bit and the rest of the bits are discarded to save computation. Thus we can derive the 
relationship between the allocated noise 2iδ of block i and its associated truncation 
noise. The expectation of the ratio between the value of those reserved bits of iδ and 











+ ⋅∑                                                        (2.4.20) 
From (2.4.20) and (2.4.18), we have: 
( )( ) ( )22 1 26 2 2 0.5 1c ci te bδ −= ⋅ + − + ⋅                                  (2.4.21) 
In (2.4.21), we can see that the constant c determines the scaling operation by right-
shifting the truncation position. Due to this, we call c the scaling factor of truncation 
noise. The relationship described in (2.4.21) implies that the truncation noise is 
reduced rapidly as c grows. For example, when c equals 3, the truncation noise is only 
a factor of 0.00126 of the allocated noise. Therefore these actual truncation errors can 
be safely neglected when using appropriate values of the scaling factor of truncation 
noise.    
From (2.4.18) and (2.4.21), we have ( )( )2 212 20.5 log 6 2 2 0.5 1 0.5logc c ib δ−= − ⋅ ⋅ + − + − . 
Neglecting those constants, we can estimate the workload of an SOPOT coefficient of 
the i-th block as 22log iδ− . Based on (2.2.1), we can formulate the noise allocation 
problem as follows: 
 55









− ⋅∑  
Subject to ( ) 2 ( ) ( )
( )
T Q i T i
k i
e k E E
β∈
+ ≤∑ , 0 49i≤ <                       (2.4.22) 
where N is the block size, l(j) is the number of coefficients of the j-th coefficient block, 
( )Te k is the truncation error of coefficient k. Based on (2.4.16) and the result in [68], 
which calculates the ( )Q iE  in terms of the quantization step size and the coefficient 
values of scale factor band i, we can develop (2.4.22) into the following, where 
( )Q i∆ denotes the quantization step size for scale factor band i: 
( ) ( )( ) 2 0.52 ( ) 2( ) ( ) ( )





k T i Q i
k i j k i




= − ⋅∆∑ ∑ ∑ , 0 49i≤ <              (2.4.23)  
Although this is a problem of finding extrema under constraints which can be solved 
by the Lagrange multiplier method in principle, the actual computation is prohibitively 
complex:  it needs to solve a non-linear equation group including 49 equations, having 
49 variables with maximal power of 255. 
To address this issue, we resort to a heuristic to accelerate the computation and find a 
satisfactory solution. We make use of the same structure of IFFT coefficient blocks 
used in algorithm 4.1. Instead of finding the global optimal solution, we search a value 
of the truncation noise for the root of a block tree which will make the current block 
tree have the smallest sum of reserved bits. By deleting the root which has been 
assigned a truncation noise value, the current tree is decomposed into several smaller 
trees. We iteratively move to these block trees and conduct the same computation for 
their roots until each block is assigned a truncation noise.   
 56
In this method, we are frequently required to allocate the allowed noise of a scale 
factor band or a frequency coefficient over its associated IFFT coefficient blocks to 
achieve the minimal workload, without involving other scale factor bands or 
frequency coefficients. We call this problem “independent noise allocation”. In this 
case, the allocation algorithm should allocate more noise to blocks including more 
coefficients. For IFFT, although different stages have different numbers and sizes of 
blocks, the total number of coefficients remains the same for all stages. Based on this 
observation, same amount of noise should be allocated to all of the associated 
coefficient blocks.   
The proposed allocation method consists of two steps. The first step is to allocate the 
MQD of a scale factor band to its frequency components. Following the independent 
noise allocation principle, we should allocate same amount of noise to all the blocks 
associated to the scale factor band. In this way, for scale factor L, let its associated 
blocks be B(L), and we have ( ) ( )( ) ( )i jm nδ δ=  for any ( ),m n B L∈ , and ( )i v m∈ , ( )j v n∈ . 
Then we can derive the initial allowed noise for each coefficient of the scale factor 
band from (2.4.23).  For a frequency coefficient k, let its initial allowed noise be ( )kε , 
its associated block number be ( )B k , and the average noise of each block for the j-
th scale factor band be ( )jζ , then we have:  
( )
( )
( ) ( )
0.52



















                                 (2.4.24) 
( ) ( ) ( )k B k jε ζ= ⋅ , ( )k jβ∈                                          (2.4.25) 
 57
After the first step, a coefficient block usually has multiple noise values assigned by 
various frequency coefficients. In the second step, we will choose an appropriate value 
for each coefficient block. As discussed above, we reduce this problem to find a 
desired noise value for the root of a block tree. Let the block index of the root be i, its 
associated frequency coefficients be T(i), block numbers between coefficient F(j) and 
block i be R( i, j),  the residual allowed noise for F(j) be ( )jε( , the desired noise value 
of the root be iε .  To estimate the workload for F(j), we perform the independent 
noise allocation over its residual allowed noise. Thus iε will minimize the following:  
















                                (2.4.26)  
Numerical techniques can be employed to find the desired value of iε in (2.4.26).   
2.4.3 Workload Estimation Module 
As discussed in section 2.3, the workload estimation module provides three 
functionalities: 1) to estimate the workload for the set of truncated SOPOT 
coefficients; 2) to choose a scale factor band to decrease its quantization step; 3) to 
control the second pass processing.  In this section, we will discuss these three aspects 
in detail.  
In SOPOT operations, the basic calculation units are the shift and add operations. This 
enables us to measure the computational workload of SOPOT IFFT in terms of 
number of shift and add operations. This measure is similar to the widespread 
workload estimation for IFFT of the floating point version using the number of 
 58
multiplications. On the other hand, we can derive the exact number of shift and add 
operations from the sum of reserved SOPOT terms of the IFFT. 
When the estimated workload is greater than the workload constraint, the workload 
estimation module needs to choose a scale factor band to reduce its quantization step 
size. From (2.4.22), we can see that MQD of the j-th scale factor band allocates 
truncation noise ( )jζ for each of its associated coefficient blocks. On the other hand, 
these coefficient blocks are also shared by other scale factor bands. Then these blocks 
have different allocated truncation noises from different scale factor bands. To provide 
transparent playback quality, the performance of workload reduction is limited by the 
minimal level of the various allocated truncation noises. Due to this fact, we should 
choose scale factor band j to increase its MQD, where ( ) ( ) ,0 49j i iζ ζ≤ ≤ < . 
The workload estimation module employs an iteration procedure to control the second 
pass processing. As the procedure of control is presented in section 2.3, we only 
present the termination conditions of the loop. Normally the loop terminates when the 
estimated workload is below the specified workload level. However, this is not always 
possible to achieve. In this case we exploit two other termination conditions, which are 
in accordance with those used in a conventional AAC encoder [58]: 1) The next 
iteration would require all scale factor bands to be amplified; 2) The next iteration 
would cause the difference between two consecutive scale factors to exceed 60. 
2.5 Experimental  Results  
To evaluate the performance of the proposed scheme, we implemented a prototype. 
We employed Free Advanced Audio Coder (FAAC) version 0.60 [112] as the 
 59
conventional AAC encoder core in our work, as FAAC is a well-known open source 
AAC encoder. It is noticed that FAAC 0.60 is not the latest version. (In August 2006, 
FAAC version 1.25 was released.) We have chosen FAAC 0.60 as our implementation 
platform because the important algorithms used in the FAAC 0.60 are in accordance 
with those described in the informative parts of AAC standards [58]. On the other 
hand, FAAC 1.25 develops new techniques for psychoacoustic modeling and noise 
allocation. Both of them produce output for the second pass processing. In this case, 
well-documented techniques are better for serving as a proof of concept.  
By analyzing the output of the conventional encoder core, we have found that the 
masking threshold constraints of some scale factor bands are not always satisfied. For 
these scale factor bands, we define the initial level of their MQDs as zero, rather than 
a negative value. This method ensures that the generated audio file will not have lower 
quality than the version by the conventional encoder core. 
An important coding parameter, which is not involved in the conventional AAC 
encoder core, is the scaling factor of the truncation noise (referring to section 2.4.2). 
The value of the scaling factor has a close relationship with the computational 
workload and the playback quality. Due to its importance, we perform experiments for 
various values of the scaling factor, ranging from 3 to 5. 
We carried out experiments on 6 selected audio clips, including 5 popular songs, and 1 
instrumental music. All of them were extracted from CDs, coded in WAV format, at a 
sampling rate of 44100 samples/sec, 16 bits per sample, stereo mode.  
 60
2.5.1 IFFT Workload Reduction  
A primary motivation of our work is to reduce the computational workload at the 
decoder side. To achieve this, we implemented IFFT in SOPOT, which is an important 
step in IMDCT module. 
We first estimated the workload portion of IMDCT module in the low complexity 
profile. Towards this, we executed an AAC decoder, Free Advanced Audio Decoder 2 
(FAAD2) 2.0 [112] on an ARM simulation tool: Simplescalar/ARM [114]. We carried 
out the simulation in two settings: 1) floating point version: we made use of a software 
package to implement the floating point operations for all the processing modules; 2) 
fixed point version: we employed un-truncated SOPOT coefficients for IMDCT, 
which does not introduce any coefficient truncation noise, and fixed point 
multiplication for the other modules. This was because fixed point multiplication 
cannot provide the required accuracy for 2048-point IMDCT. Simulation results 
showed that IMDCT is responsible for 86% and 92% workload of the entire decoding 
process, in the floating point version, and the fixed point version, respectively. These 
results provide strong motivation for workload reduction on IMDCT. In addition, 
IFFT is responsible for around 55% workload of IMDCT in both settings. 
Next we estimated the workload reduction of IFFT. We measured the computational 
workload of IFFT of SOPOT version in terms of the number of shift and add 
operations, as described in section 2.4.3. We have counted the exact number of shift 
and add operations when performing the IFFT. As mentioned above, the workload is 
related to the scaling factor of truncation noise. We varied the values of the scaling 
factor of truncation noise from 3 to 5, to change the workload. The baseline of the 
 61
workload was calculated using the un-truncated SOPOT coefficients for the same 
input data. We present the results in Figure 2.4, where the baseline workload is 























Figure 2.4  Normalized workload for the test audio clips, where SF denotes the 
scaling factor of truncation noise 
As shown in Figure 2.4, we have achieved significant workload reduction for IFFT 
computation in an AAC decoder. Encoding the audio data using the proposed scheme, 
with scaling factors of truncation noise of 3, 4 and 5, on average, we save 
computational workload by 77.8%, 75.0% and 72.8%, respectively. To our knowledge, 
the presented results are better than all the results reported in the literature for a 512-
point of transform. Although various methods have been proposed to save  
computations for transforms, it is hard to find an effective way to reduce the workload 
of a long block of transforms. We illustrate this fact by the following examples. 
In terms of pruning techniques, for IDCT, the relationship between  workload 
reduction ratio G, and the block size B, the number of non-zero frequency coefficients 
b can be described as follows [106]: 
 62
1 log logG b B≈ −                                                  (2.5.1) 
We can then estimate the workload reduction for a 512-point of IDCT using the 
pruning technique. According to (2.5.1), even though we discard three quarters of the 
frequency coefficients, we can only achieve 22.2% computation savings for the best 
case.  
On the other hand, for approximate techniques, the basic assumption is that the noise 
introduced by approximation is negligible in comparison with the energy of the signal. 
This assumption is valid for transforms with short block [16]. However, the 
approximation noise accumulates as the block size increases.  As a consequence, the 
approximation noise can no longer be neglected for a long block transform.  
We solved the problem in an alternative way. In a similar manner to the “lossy 
compression” used in audio encoding to achieve high compression ratios, for the long 
block size of transform, we perform “lossy transform”, where we allow the truncation 
noises with higher levels to achieve significant workload reduction. We then 
addressed these noises by MQD.  
2.5.2 Subjective Evaluation  
To evaluate our scheme, we carried out subjective tests on a group of 30 subjects 
(male and female undergraduate students with normal hearing). All subjects were 
asked to evaluate audio quality using the mean opinion score (MOS), which is a five-
point scale (5-excellent, 4-good, 3-fair, 2-poor, and 1-bad). 
We encoded the selected 6 audio clips into AAC bitstreams using the prototype 
encoder. We set the encoding parameters for the conventional AAC encoder core as 
 63
follows: bitrate 128kbits/s, with low complexity profile, temporal noise shaping, 
Middle/Side coding switched on, and with long term prediction, perceptual noise 
substitution, and intensity coding switched off. These are typical options for an AAC 
encoder. We prepared four copies for each program. Three copies were generated by 
our scheme with scaling factor of truncation noises of 3, 4 and 5, and the 
corresponding workload reduction ratios were chosen as 77.8%, 75.0% and 72.8%, 
respectively. The fourth copy was generated by a FAAC 0.60 encoder with the same 
encoding parameters, which served as the baseline sample. In addition, each program 
was also given the uncompressed clip as reference (MOS=5). For fairness, all test 
samples were arranged in random order. We presented the averaged MOS values and 
their associated standard deviation levels for each copy in Figure 2.5. 
From Figure 2.5, we cannot identify a regular order for these copies of different 
scaling factors of truncation noise. This implies that there are no perceptually 
significant differences among these test samples.  
Another important result is the comparison with the baseline samples. It determines 
whether our scheme can provide transparent playback quality. Towards this, we 
averaged the MOS values for all six test clips and we had: baseline= 4.11, SF=3: 4.14, 
SF=4: 4.17 and SF=5: 4.25. As their values are so close to each other, we can 
conclude that we have achieved the playback quality which has no perceivable 
differences from the baseline encoder using a scaling factor of truncation noise no less 














Baseline SF:3 SF:4 SF:5
 
Figure 2.5 Averaged MOS values with standard deviations for the test audio clips, 
where SF denotes the scaling factor of truncation noise  
2.5.3 Increase of File Sizes 
High compression ratios are an important goal for audio encoders. As aforementioned, 
our scheme will sacrifice some compression efficiency in exchange for the reduction 
of the decoder's computational workload.  In this section, we will investigate the 
compression characteristics of our work by comparing the file sizes generated by our 
scheme with that by the baseline AAC encoder.  
The baseline encoder is chosen as FAAC 0.60, with the identical bit rates and 
encoding parameters as those used by the conventional encoder core in the prototype 
encoder. We used the same encoding parameters as those presented in 2.5.2, except 
the bit rate. The workload reduction ratio was chosen as 77.8%. The file size generated 
by our scheme depends on the specified bit rate for the conventional encoder core. In 
the AAC standard, ten bit rates are provided, ranging from 64 Kb/s to 320 Kb/s. As 
our scheme targets for high quality audio entertainment, the basic requirement for the 
bit rate should be 128 Kb/s. But we added two lower bit rates for evaluation purpose. 
 65
Thus we choose the following bit rates: 96 Kb/s, 112 Kb/s, 128 KB/s, 160 Kb/s, 192 
Kb/s, 224 Kb/s, 256 Kb/s and 320 Kb/s. For each bit rate, we measured the generated 
file sizes and computed the increase ratios compared to the baseline, as shown in 
Figure 2.6. 






























Figure 2.6 Increase ratios of file sizes for various encoding bit rates 
From Figure 2.6, we can see that the increase ratios of the compressed file sizes 
decrease as the bit rate increases. This can be explained as follows. The initial levels 
of MQDs increase as the specified bit rates become larger. The required changes of the 
quantization step sizes for each scale factor band become smaller. Consequently, this 
results in smaller increases of the bit length of the coded frequency coefficients. For 
the bit rate of 128 Kb/s, the average file size increase ratio is 9.52%. This implies that 
our scheme only incurs a modest increase in file sizes. We believe that this is 
acceptable for the targeted application scenarios. On the other hand, the results shown 
in Figure 2.6 also justify our work. When we encode the audio data in 320 Kb/s, which 
is the largest bit rate supported by the AAC standard, it incurs 4.13% increase in file 
size. This suggests ASP noise still violates the constraints of MQDs. This implies that 
the initial levels of the MQDs provided by the conventional AAC encoder cannot 
 66
mask all the ASP noise and special techniques, such as the proposed scheme, are 




Chapter 3  
An Optimal DVS Scheme Supported by Media 
Servers for Low-Power Multimedia Applications  
3.1 Introduction  
Dynamic Voltage Scaling (DVS) is one of the most widely used approaches to reduce 
the processor energy consumption, which adjusts the clock frequency and/or supply 
voltage level at run time while meeting the required Quality of Service (QoS). 
Multimedia decoding applications, one of the most popular applications running on 
portable devices, have large variations in workload, which proved to be a major 
challenge for DVS. When applied to multimedia decoding, the performance of a DVS 
scheme can be estimated in terms of its energy efficiency and its offered QoS. Different 
DVS schemes provide different tradeoffs between these two factors. In the class of hard 
real-time DVS schemes, the QoS is guaranteed at the cost of degradation of energy 
efficiency. This class of approaches performs the DVS operations using a global worst 
case execution time [108], or adaptive worst case execution times [56] [102]. Thus the 
energy savings is much limited since the large variations in workload of multimedia 
decoding cannot be fully exploited. This leads to the class of soft real-time DVS 
schemes. As multimedia decoding exhibits non-stationary workload requirements, the 
 68
conventional interval based workload prediction methods result in unacceptably 
suboptimal solutions [107] [91]. The effectiveness of DVS techniques largely depends 
on the ability to predict the workload of the multimedia decoding. Towards this, three 
subclasses of approaches have been developed. The first subclass improves the 
prediction accuracy by incorporating frame parameters of the multimedia bitstream into 
estimation, such as frame types [23], code sizes [4], etc, since it is shown that there is a 
strong correlation between workload and these parameters. The second subclass meets 
a certain percentage of frame deadlines, based on the probability distribution of 
workload demands[109]. This subclass of work provides tunable tradeoffs between 
workload threshold and QoS. In the third subclass[26], the workload information is 
supplied by contents providers in conjunction with the video clips, where workload 
prediction at the client site is not needed. Although much effort has been made, 
workload prediction remains to be a challenge.   
To alleviate the accuracy issue of the workload prediction, some works explore the 
possibility of avoiding the missed deadlines with buffering mechanism. This concept 
can be traced back to [87], where processor speeds are dynamically scaled based on the 
filling level of the input buffer, to avoid its overflow or underflow. In recent years, 
buffer based DVS techniques have been developed, to compensate for the inaccuracy 
of the workload prediction [75], to average the workloads of multiple frames [73], and 
to reduce the idle periods of the processor [54]. In general, buffer based DVS 
approaches can achieve significant energy savings. According to our analysis, this is 
mainly because fluctuations of multimedia decoding workload are smoothed out by the 
buffers. As energy consumption of a processor is a convex function of its speeds [72], 
 69
energy consumption increases with the degree of fluctuation in the processor speeds 
with the same average workload. This can be demonstrated by a simple example. 
Consider a decoding task with two frames: for case A, both frames require the same 
processor speed of 3, and for case B and C, they require processor speeds of 2 and 4, 1 
and 5, respectively. Thus case A, B and C have the same average workload, but their 
fluctuation levels increase. And we assume that the energy consumption of the 
processor is the square of its speed which is a convex function, we then have 
2 2 2 2 2 23 3 2 4 1 5A B CE E E= + < = + < = + .  
As multimedia decoding exhibits large fluctuations in workload, smoothing becomes 
an effective mechanism to reduce energy consumption. The attractiveness of smoothing 
stems from three aspects. First, energy consumption can be substantially reduced 
without sacrificing the playback quality [46]. Second, it offers an appealing 
compromise between prolonging the battery life and small latency (in our scheme, an 
averaged latency of less than 0.1 sec will be introduced, which is negligible). Third, it 
does not require additional buffers for implementation since most multimedia decoders 
have already made use of input and playback buffers to improve performance (referring 
to Figure 3.1). These facts suggest that smoothing mechanism is very promising for 
multimedia applications on portable devices. 
The effect of smoothing has received considerably little attention so far, and 
consequently existing buffer based DVS techniques yield suboptimal performance in 
terms of smoothing. In these approaches, buffers provide two conflicting functions of: 
1) smoothing out fluctuations of the processor speeds to reduce energy consumption; 2) 
avoiding missing deadline to guarantee QoS. Consequently, QoS requirement interferes 
 70
with smoothing effect and their energy efficiency is degraded. Taking the algorithm 
proposed in [75] for example, processor speeds are scaled according to the filling level 
of buffer following control-theoretic principles. To avoid overflow and underflow of 
the buffer, sufficient marginal space of the buffer needs to be reserved during the speed 
control process. This shows that the buffer space is not fully exploited to achieve 
smoothing.  
Motivated by this observation, we have the following important question: is it possible 
to separately address energy efficiency and QoS in a DVS scheme, to let buffers focus 
on energy savings, and to exploit an alternative mechanism for guaranteeing QoS? This 
question represents a novel insight into the DVS approach. If we have the accurate 
knowledge of workloads and bit length of the decoding units, missing deadline can be 
avoided through bitstream analysis, rather than the filling level of buffer. This can 
avoid unnecessary speed scaling operations and fluctuations of processor speeds are 
further smoothed out.  
This strategy, however, cannot be supported by current techniques. In most existing 
DVS approaches, the client is solely responsible for the DVS operations and these 
client-only schemes have their inherent limitations. First, it is hard to obtain accurate 
workload information using prediction techniques in the client-only schemes, which 
has been clearly showed in our brief literature survey. Furthermore, due to the real-time 
requirement and limited computing resources, these client-only approaches can only 
afford computationally efficient, but suboptimal solutions. On the other hand, these 
issues can be solved in the media server. The server can: 1) obtain the accurate 
workload estimation by simulation or measurement; and 2) solve DVS issues in off-
 71
line manner and employ powerful computation resources to yield the globally optimal 
solution. In addition, this method shifts the major computations relevant to energy 
efficiency from the client to the server. This significantly simplifies the design of the 
speed control scheme of the client. Based on these observations, in this chapter, we 
investigate the possibility of a media server supported DVS, which generates the speed 
profile for a given multimedia bitstream at the server site, by incorporating the accurate 
workload estimation and smoothing mechanism. The resulting speed profile is then 
sent to clients together with the multimedia content. At runtime, instead of performing 
the conventional DVS operations, the client reads the associated speed information  and 
scale the processor speed accordingly. To the best of our knowledge, the proposed 
approach has not been studied before.   
The proposed scheme has a wide range of applications with pre-recorded media 
contents, such as video on demand or simply download and playback media contents. 
However, for such kind of services, our scheme has to address the issue of the large 
diversity of client architectures since our scheme employs static time analysis 
techniques, taking decoding workloads and the memory sizes as input parameters, both 
of which depend on the targeted client architecture. In comparison with buffer sizes, 
workload is less critical since multimedia decoding workloads on different processors 
can be derived by simple scaling computation [26]. To address the different buffer sizes 
of different clients, a possible solution is to generate different speed profiles for 
different groups of memory sizes. However, this is unnecessary in the case of our 
scheme. We will explain it below. 
 72
Besides the concept of media server supported DVS, the other significant contributions 
of the proposed scheme are as follows.  
First, we have developed a bitstream analysis framework to address the inherent 
fluctuation of the decoding workload of a given media bitstream by manipulating the 
sizes and decoding workload of its media units, input and playback buffers of media 
decoders. As a generic framework, it can be used to solve various problems and 
provides analytical results.   
Second, based on the preceding framework, we have proposed an algorithm to 
compute an optimal speed profile which achieves the minimal energy consumption 
among all feasible speed profiles with guaranteed QoS. Compared to existing buffer 
based DVS, our scheme improved 11% ~16% in energy consumption and 10%~17% in 
playback buffer requirement for audio bitstreams (details can be found in section 
3.4.1.) The proposed algorithm has two important properties which support practical 
applications. Since our scheme can be applicable to audio and video bitstreams, and 
video is more challenging for buffer based DVS, the following description is mainly 
based on video to show the effectiveness of our scheme.  
• The energy consumption converges rapidly with buffer growth. Based on 
our experimental results, we have discovered that small buffers are sufficient 
to provide satisfying performance for video decoding: on average, 14.40 K 
bytes of input buffer and 453.48 K bytes (1181MacroBlocks, less than 3 
frames) of playback buffer lead to less than 2% additional energy 
consumption in comparison with the theoretical lower bound (details found 
Table 3.2 and Table 3.4). Such buffer sizes are so small that it can be met by 
 73
the most mobile devices. Then the speed profile is computed according to 
the buffer sizes of feasibility condition, rather than the actual buffer sizes of 
some specific mobile devices. This property satisfactorily solves the 
diversity issue. Meanwhile, the reduction of buffer requirements provides 
additional opportunities for energy savings. Through memory controlling 
mechanism, the unused memory can be shut down or be switched to idle 
state to save energy. In such a case, the energy reduction is closely related to 
the reduction of buffer requirement. As memory operations are responsible 
for a significant portion of the overall energy consumption, the reduction of 
buffer requirement has an important contribution to the overall energy 
efficiency.   
• Scaling number of processor speeds is largely reduced. The algorithm will 
keep the processor speed as a constant (details can be found in III.C), till the 
speed has to be changed due to the constraints of the buffers. This property 
allows us to record the speed only when it is changed, which will 
significantly reduce the size of the speed profile. According to our 
measurement on actual DVD movies, for 100 mins video clips, its 
corresponding speed profile is round 61K bytes (at Macro Block level). We 
believe this is quite acceptable for practical applications: even we assume 
that the size of 100 mins video clip is 305M bytes (which is far less than the 
size of DVD movies), the size of the speed profile is only 1/5000 of the 
video content. 
 74
This chapter is organized as follows. Section 3.2 formulates the problem. In section 3.3, 
we present our solutions in three steps. In section 3.4, we evaluate the proposed 
scheme. In section 3.5, we prove the optimality of our speed profile algorithm. 
3.2 Problem Formulation  
 
Figure 3.1 Architecture of the multimedia processing system at the client site 
In this chapter, we consider the following multimedia processing system architecture as 
shown in Figure 3.1. The targeted client consists of: 1) an input buffer which is used to 
store the incoming compressed media stream before being processed; 2) a playback 
buffer which accommodates the decoded media data for display devices;Both the input 
and playback buffers have fixed capacities and work in a FIFO manner; 3) a processor 
supporting DVS, and the provided dynamic range is sufficient for our scheme. Based 
on the discussion in section 3.4.2, we will see that this assumption is reasonable.  For 
the sake of generality, we model the power-speed relationship of the processor as a 
convex function [72] [55]. The system-level view of the media bitstream throughout 
this chapter is as follows: it is made up of a sequence of media objects. The media 
objects in this paper can be a frame, or a Macro block.  Before compression, all these 
objects have an identical bit length. However, the encoding process changes the bit 
lengths of media objects. Furthermore, the workload of decoding these media objects is 
also different. This model represents most multimedia bitstreams generated by current 
compression techniques, such as MP3, AAC audio, and MPEG video.  
 75
For a given media bitstream over period [0, T ], our proposed method will produce a 
speed profile: 1 1{( , ),..., ( , )}n nt tpi ω ω=  , with 0 0t =  , nt T=  , which means the 
processor speed is set as iω  over time interval [ 1,i it t−  ],  for 1 i n≤ ≤  . 
We assume that the incoming bitstream arrives at the input buffer at a constant rate of r 
bits/sec and the playback device reads media objects from the playback buffer at a rate 
of C objs/sec. Let the function  ( )kα  denotes the sum of bit length from media objects 
1 to k. Similarly, the function ( )kΓ  denotes the sum of cycle numbers required to 
decode media objects 1 to k. These two functions can be obtained by analyzing the 
given media bitstream. 
The problem can be formally stated as follows. We assume that the first bit of the 
bitstream arrives at the input buffer at the instant of t=0. Let ( )y tpi  denote the number 
of processed media objects under the speed profile pi  during the time interval [0, t].  
Given the input and playback buffer size b and B, respectively, cumulative cycle 
requirement ( )kΓ  and cumulative bit length ( )kα , bit length of a decoded media 
object U, what is the speed profile pi achieving maximal energy savings while 
satisfying: 1) the playback buffer never underflows and overflows; 2) the input buffer 
never underflows and overflows?  That is: 
Min   1
1







⋅ −∑  , where ( )P ⋅  is the 
convex function on power-speed relationship                        (3.2.1) 
                                   s.t.       
 0 ( ( ))r t y t bpiα≤ ⋅ − ≤ ,   0t∀ >                                 (3.2.2) 
 76
and ( ) ( )( )0 dy t C t t U Bpi≤ − ⋅ − ⋅ ≤  , 0t∀ > ,  where  dt : playback delay    (3.2.3) 
3.3 Energy Optimization  Techniques 
Our solution to the problem (2.1) consists of three parts. The first part (3.3.1) 
establishes the relationships between the constraints of buffers and the bounds of the 
processor speed. These relationships form the basis of the rest two parts. As defined in 
3.2, in our scheme, buffer sizes are the primary constraints and we compute an optimal 
speed profile subject to the given buffers. Thus, before generating the speed profile for 
a given media bitstream, we need to estimate what buffer sizes are appropriate. The 
second part (3.3.2) addresses this problem. Later we can see, two parameters, buffer 
sizes of input and playback buffer, will determine the existence of a feasible speed 
profile and the energy efficiency performance of the scheme. Finally, given those 
parameters estimated by the second part, the third part (3.3.3) computes the optimal 
speed profile. 
3.3.1 Bounds on the Processor Speed 
Given the cumulative bit length function ( )kα and cumulative cycle requirement 
function ( )kΓ , the buffer sizes b and B, we can compute the upper bound and lower 
bound of the processor speed as follows. 
From (3.2.2) we have 
( )( )r t b y t r tpiα⋅ − ≤ ≤ ⋅                                          (3.3.1) 
For the function ( )kα , we compute its inverse function )(1 n−α , which return an integer 
k, the sequence media frames of [1,k] has bit length n. Since ( )kα refers to a 
 77
cumulative process, both ( )kα and )(1 n−α  are monotonic increasing functions, by 
operating inverse function 1( )α − ⋅ on (3.3.1), we have:  
1 1( ) ( ) ( )r t b y t r tpiα α− −⋅ − ≤ ≤ ⋅                                   (3.3.2) 
Further we have the following relationship by performing cumulative cycle 
requirement function ( )Γ ⋅  on (3.3.2): 
( )( ) ( )( ) ( )( )1 1r t b y t r tpiα α− −Γ ⋅ − ≤ Γ ≤ Γ ⋅                           (3.3.3) 
( )( )y tpiΓ  is the exact cumulative frequency requirement  function of the desired 
speed profile.  So (3.3.3) forms the upper and lower bounds on the processor speed 
under the constraint of input buffer. 
Similarly, we can derive the upper bound and lower bound of the processor speed 
under the constraint of the playback buffer according to (3.2.4):  
( )( ) ( )( ) ( )( )d p dC t t y t C t T tpiΓ ⋅ − ≤Γ ≤Γ ⋅ + − , where   P BT C U= ⋅         (3.3.4) 
We then combine (3.3.3) and (3.3.4) and form the global upper bound and lower 
bound of the processor speed: 
( ) ( )( ) ( )
( ) ( )( )
1
1
max ( ) , ( ) ( )
min ( ) , ( )
d
p d
r t b C t t y t





Γ ⋅ − Γ ⋅ − ≤ Γ
≤ Γ ⋅ Γ ⋅ + −
                      (3.3.5) 
3.3.2 Estimation of the Input Buffer and the Playback Buffer  
The smoothing effect is closely related to the buffer size: larger buffer yields better 
smoothing performance. On the other hand, as shared computational resources and a 
significant source of energy consumption, it is desirable to reduce the allocated buffer 
to the application. Therefore estimation of the input buffer and the playback buffer 
 78
becomes an important problem: we need to compute the appropriate buffer sizes 
which will make the energy consumption resulted from the fluctuations below some 
threshold.  
Our bitstream analysis framework provides an insight into the buffering mechanism. 
For the sake of clarity, we denote the area enclosed by (3.3) as inS , the area by (3.4) as 
plS , and the area by (3.5) as gbS . In other words, inS ( plS ) represents the range of 
processor speeds that will not lead to underflow or overflow of input buffer (playback 
buffer). gbS is the range of processor speeds which is allowed by both input buffer and 
playback buffer. As shown in (3.3), inS has one tunable parameter: input buffer size b, 
where larger input buffer can increase the area of inS . From (3.4), plS is controlled by 
two parameters: the playback buffer size B, which controls its area, and the playback 
delay dt , which controls its position. According to (3.5), gbS can be geometrically 
interpreted as the intersection of inS and plS . Larger gbS is desired, since more space is 
allowed for the smoothing operation. According to above analysis, we can shape gbS by 
tuning the three parameters b, B and dt . The effects of b and B are straightforward. On 
the other hand, the relative position between inS and plS plays an important role in 
shaping gbS : when the upper bound of playback buffer aligns with the upper bound of 
input buffer, gbS  yields the maximal space for the smoothing algorithm with minimal 
playback delay. 
Our estimation algorithm consists of two parts. The first part computes the buffer sizes 
of feasibility conditions, which refer to the minimal buffer requirements to guarantee 
 79
the existence of a feasible speed profile. This is the basic requirement for a media 
decoding process. When the buffer sizes of feasibility conditions cannot provide 
satisfactory smoothing effect, we need to invoke the second part. The principle of the 
second part is straightforward. It increases the buffer sizes in a fine steps, and then 
computes the speed profile using the optimal speed profile algorithm (details found in 
3.3.3) for the given buffers. From the speed profile, we can compute the energy 
consumption level resulted from the fluctuations. This process repeats till the energy 
consumption resulted from the fluctuations is below some threshold.  
3.3.2.1 Feasibility conditions  
To guarantee that the global bounds in (3.3.5) have feasible speed profiles, b, B and 
dt should satisfy certain conditions. We call these conditions as feasibility conditions. 
They are subject to the following two constraints: the lower bound of input buffer 
should be less than the upper bound of playback buffer (3.3.6), and the lower bound of 
the playback buffer should be less than the upper bound of the input buffer (3.3.7). 
( ) ( ) ( )1( ) ( ) ( )p dr t b y t C t T tpiα−Γ ⋅ − ≤Γ ≤Γ ⋅ + −                           (3.3.6) 
( ) ( ) ( )1( ) ( ) ( )dC t t y t r tpi α−Γ ⋅ − ≤ Γ ≤ Γ ⋅                           (3.3.7) 
  
From (3.3.6) and (3.3.7), we can derive feasibility conditions for B and b. Let 
p dT t∆ = − , which indicates the position of the upper bound of playback buffer, and 
0∆ denote the desirable value that the upper bound of playback buffer aligns with the 
upper bound of input buffer. 
One possible way to calculate 0∆ is as (3.3.8): 
 80
( ) ( )10
0
argmin ( ) ( )
t
C t r tα−
∆ ≥
∆ = Γ ⋅ +∆ −Γ ⋅∑                            (3.3.8) 
From (3.3.6) and (3.3.7) we have: 
( )




( ) , 0
, 0
b r t C t t
B C U t r t U t
α
α −
 ≥ ⋅ − ⋅ + ∆ ≥
 ≥ ⋅ ⋅ + ∆ − ⋅ ⋅ ≥
                             (3.3.9) 
Finally we can calculate the playback delay as:                               
0∆−= pd Tt , where  0∆ is given by (3.8) and UC
BTp
⋅
=                   (3.10) 
3.3.3 The Optimal Speed Profile Algorithm  
Given the global bounds of the processor speed derived in 3.3.2, we have developed 
the optimal speed profile algorithm to find a feasible speed profile which is as smooth 
as possible.  
To reduce the fluctuation level of the generated speed profile, the key idea behind the 
proposed algorithm is to make the speed profile close to the mean value of the 
processing workload. This principle has two implications. First, when the average 
workload is feasible for a given segment, we use it as the speed profile of the segment 
since its fluctuation level is minimal. Second, when the processor speed must be 
changed to ensure feasibility, we change the processor speed based on the largest 
deviation points from the average workload, since it is the closest feasible speed 






C Global lower bound 
Global upper bound 
 
Figure 3.2  Illustration of the optimal speed profile algorithm 
We illustrate the critical part of the proposed algorithm in Figure 3.2: given the global 
bounds, our algorithm is required to find a smoothing speed profile over period [O, A]. 
We first construct straight line OA since it achieves maximal energy saving. However, 
it is infeasible since it violates global bounds near points B and C. Among those 
violated bounds, we identify point B having the largest deviation from line OA.  Then 
the speed profile is split into two parts, line OB and line BA, to satisfy speed bounds at 
B.  The new speed profile may violate global bounds as well, as point C in Figure 3.2. 
We then perform the same splitting process, resulting in line OC and CB. The process 
continues iteratively until all violations are eliminated.  
We give the detailed description of the proposed algorithm in Figure 3.3.  
 
 
INPUT: max ( )B ⋅ : global upper bound of the speed  
              min ( )B ⋅ : global lower bound of the speed 
OUTPUT: Ω : set of speed profile triple: , ,s et t ω   
                           processor speed ω  over period  
                           [ ],s et t    
FUNCTIONS:  
RecursiveSmoothing ( ), , ,s s e et w t w  : to find a  
        smoothing speed profile from time 
st  with  
        accomplished workload sw to time et with  
The Optimal Speed Profile Algorithm 
 82
        workload ew  
    max ( )w s : the maximal workload in sequence s 
    max ( )t s : time index of the maximal workload in s 
BEGIN 
  Ω = RecursiveSmoothing ( )min min,0,max ( ),max ( )in t wt B B  
  Ω ← Ω + 0, ,0int  
 END  
 
FUNCTION   RecursiveSmoothing( , , ,s s e et w t w  ) 
maxΨ ← ∅ ,  minΨ ← ∅ ,   st t←  
WHILE et t≤  
( )max max ( ) ( ) ( ) ( )s e s e s sd B t t t w w t t w← − − − − +  
( )min min ( ) ( ) ( ) ( )s e s e s sd B t t t w w t t w← − − − − +  
     IF 
max 0d <  
           ( )max max max,t dΨ ← Ψ U  
      ELSEIF 
min 0d >  
           ( )min min min,t dΨ ← Ψ U  
      ENDIF 
1t t← +  
ENDWHILE 
  IF maxΨ = ∅  AND minΨ = ∅  
       ( ) ( ), ,s e e s e st t w w t tΩ ← − −  
ELSE 
      IF ( ) ( )max minmax maxw wΨ > Ψ  
           ( )maxmaxm tt = Ψ ,  max ( )m mw B t=  
       ELSE 
            ( )minmaxm tt = Ψ , min ( )m mw B t=  
       ENDIF 
     1Ω =RecursiveSmoothing ( ), , ,s s m mt w t w  
2Ω =RecursiveSmoothing ( ), , ,m m e et w t w  
1 2Ω ← Ω + Ω  
ENDIF 
RETURN Ω  
 
Figure 3.3  The Optimal Speed Profile Algorithm  
Concerning the performance of the optimal speed profile algorithm, we have the 
following theorem. 
 83
THEOREM 3.3.1. Given the same configurations of the algorithm, the speed profile 
generated by the optimal speed profile algorithm achieves the minimal energy 
consumption among all feasible speed profiles. 
PROOF: see section 3.5. 
3.4 Experimental Results 
3.4.1 Experimental Results for Audio 
In this section, we present experimental results concerning audio. We have used five 
musical programs for evaluation selected from pop music MP3 clips. These clips 
include light music, female/male solo, and cover several common rhythms. They are 
all of joint stereo mode, sampling rate 44.1KHz, bitrates of 128kbits/s. we have chosen 
granule as basic media objects. We obtained the computational workload of each 
granule using simulation method, which is based on ARM architecture [114]. As 
pointed in section 1.1.1, the dynamic power consumption of the processor is 
proportional to the cube of its speed. This enables us to estimate the dynamic energy 
consumption ratios from speed profiles. 
We implemented our scheme based on the minimal requirement of the input and 
playback buffers. Towards this, we computed the speed profile as follows. Given the 
workload trace of an audio clip, we first computed the feasibility sizes of the input and 
playback buffers according to (3.3.9). We then obtained the playback delay from 
(3.3.10). Finally we computed the speed profile using the OSP algorithm, based on the 
obtained buffer sizes and playback delay.  
 84
The baseline is based on the panic scheme in [75], which is an effective buffering 
scheme for the generic media bitstream. In our experiments, for the baseline, we 
performed the buffering on the input buffer with the same size as our scheme. We then 
compared the energy consumption and playback buffer requirements, which are listed 
in Table 3.1.  
Clip 
index 1 2 3 4 5 
IB(Byte) 573 489 602 627 522 
PB(KB) 6.92 6.21 7.12 7.26 6.53 
EnR 1.18 1.12 1.19 1.19 1.16 
PBR 1.12 1.11 1.16 1.20 1.11 
Table 3.1  Experimental results on energy consumption and buffer requirements 
for audio bitstream.  IB and PB: input buffer size and playback buffer size of the 
proposed scheme; EnR and PBR: energy consumption ratio and playback buffer 
requirement of the baseline over the proposed scheme, respectively.  
From Table 3.1, compared to the baseline, our scheme can improve 11% ~16% in 
energy consumption and 10%~17% in playback buffer requirement. These results 
show the effectiveness of our proposed scheme. 
3.4.2 Experimental Results for Video 
The proposed DVS scheme is a generic solution, capable of working with audio or 
video applications. Since video applications usually exhibit higher fluctuation levels 
than audio ones, which are more challenging for DVS, we will then present 
experimental results with video to demonstrate the effectiveness of our scheme.  
In this section, we present two kinds of experimental results concerning the proposed 
scheme. The first is the comparison with the representative buffer based speed control 
schemes, which is used to evaluate the effectiveness of the proposed scheme. The 
 85
second is the comparison with the Theoretical Minimal Energy Consumption of the 
workload trace, which provides insights into the relationship between the buffer sizes 
and the performance of energy savings.   
We carried out experiments on six video clips selected from the MPEG test dataset: 
Akiyo, Highway, Coastguard, Container, Hall and Walk. All of them have an identical 
resolution of 352*288 and frame rate of 29.97 fps. They are encoded with a standard 
MPEG-2 encoder in progressive mode, CBR, bitrate 1150 Kb/s, frame pattern: I B B P 
B B P. We employed The Core Pocket Media Player (TCPMP) [113] as the decoder 
application, since it is an open-source media player optimized for portable devices.  
To obtain an accurate estimation on required buffer sizes and latency introduced by 
the buffering mechanism, it is necessary to perform the algorithm using a fine 
granularity. Due to this, we have chosen the Macro block as basic media object, in the 
line with [71]. We obtained the computational workload of each Macro block using 
simulation method, which is based on ARM architecture [114].  
We based our energy estimation on Xscale, which is a popular processor with discrete 
speed levels for portable devices. We used dithering techniques to convert the 
computed continuous speeds to discrete levels [76]. In [27], the measured power 
consumption of each speed level of Xscale for compression applications are given.  
With them, we can compute the total energy consumption totE  as (3.4.1), where N is 
the number of speed changes, ( )iP ω is the power consumption at speed iω , and it∆  is 








= ⋅ ∆∑                                                  (3.4.1) 
 86
We implemented our scheme based on the minimal requirement of the input and 
playback buffers. Towards this, we computed the speed profile as follows. Given the 
workload trace of a video clip, we first computed the feasibility sizes of the input and 
playback buffers according to (3.3.9). To fully utilize the buffers, we used the larger 
one of these two feasibility sizes for both input and playback buffers. We then 
obtained the playback delay from (3.3.10). Finally we computed the speed profile 
using the OSP algorithm, based on the obtained buffer sizes and playback delay. The 
resultant configurations for the six video clips are listed in Table 3.2. We made use of 
these configurations to conduct the experiments in section 3.4.2.1 and 3.4.2.2. 







FI(Macroblock) 670 1320 866 881 1249 1367 
FP(Macroblock) 1052 1028 1035 1029 1062 1401 
IB(KB) 12.87 15.97 12.76 12.54 15.37 16.90 
PB(KB) 403.9 506.9 397.4 395.1 479.6 538.0 
Delay(s) 0.088 0.11 0.075 0.076 0.10 0.12 
Table 3.2  Configurations of decoding the six video clips.  FI and FP: feasibility 
condition for input buffer and playback buffer respectively, both of them 
measured in Macro blocks, the value in bold is used for both input buffer and 
playback buffer to estimate the other items; IB and PB: input buffer size and 
playback buffer size, measured in Kbytes, both of them derived from the 
max(FI,FP); Delay: introduced delay by buffering in sec. 
3.4.2.1 Comparisons with Existing Buffer Based Speed Control Schemes for 
Video 
We compared our scheme with two representative existing buffer based speed control 
schemes: 1) comparisons of energy consumption at the same buffer level; and 2) 
comparisons of maximal buffer occupancy at the same energy consumption level. The 
baseline 1 is based on [54], which employs buffers to reduce the idle processor periods 
 87
resulted from the large variation of video decoding time. To fully exploit its potentials, 
we assume that: 1) accurate worst case information is available; 2) input data are 
always available. For fairness, we set its output buffer size as the sum of input and 
output buffer sizes in our scheme and execute both algorithms at MB level. We 
showed the energy consumption ratios between our scheme and the baseline 1 in 
Figure 3.4. Compared to the baseline 1, our scheme can achieve 28.3% energy savings 
on average with the same buffer sizes.  
The baseline 2 is based on feedback control with PI controller [75], which scales the 
processor speed by monitoring the filling level of the playback buffer. To keep its 
energy consumption close to the optimal energy consumption at MB level, we scale 
the controller parameters from frame level by the number of MBs in a frame, namely, 
the proportional factor and integral factor (0.01, and 0.0145, respectively, according to 
the author’s suggestion) are scaled down by 396, dead-zone, forward window and 
feedback window are scaled up by 396. The energy consumption ratio and the 
required buffer sizes are listed in Table 3.3.  
From Table 3.3, we can see that at the similar energy consumption level, our method 
can achieve 51.2% reduction in buffer size on average. 
Baseline 1 and 2 represent the recent advances of buffer based DVS techniques. The 
above experimental results show that the proposed scheme significantly improves the 
performance over existing DVS schemes. The improvements can be interpreted from 
two aspects. First, due to the elimination of QoS interference, energy efficiency of 
smoothing mechanism has been increased by 28.3%. Second, with accurate workload 
estimation, only a half size of the buffer is needed to provide the guaranteed QoS. 
 88
They demonstrate the superiority of the media server supported DVS scheme and 















































Baseline 1 Our scheme
 
Figure 3.4 Normalized energy consumption between our scheme and the baseline 










NEC 1.04 1.06 1.01 1.02 1.03 1.03 
BUF 1999 2436 2021 2047 2171 3565 
RED 0.474 0.578 0.488 0.497 0.425 0.607 
 
Table 3.3 Comparisons between our scheme and the baseline 2. NEC: normalized 
energy consumption of the baseline 2 over our scheme; BUF: maximal buffer 
occupancy of the baseline 2 in terms of Macro blocks; RED: reduced buffer size 
ratio achieved by our scheme (referred to Table 3.2). 
3.4.2.2 A Comparison with the Theoretical Minimal Energy Consumption 
Due to the limited sizes of buffers, our scheme leads to more energy consumption than 
that of the Theoretical Minimal Energy Consumption (TMEC), which is the energy 









= ∑ , ( ){ }0t t Tω ≤ ≤ is the required speed for each media object. 
Intuitively, we can increase the buffer sizes to further reduce the energy consumption. 
 89
It is then important to investigate the following problems: 1) how much can we further 
improve the performance of energy saving? 2) what is the relationship between the 
increase in buffer size and the  improvement of the performance ? 3) what are the 
appropriate buffer sizes for the given media clip.  
To answer these problems, for each test video clip, we first compared the energy 
consumption of TMEC with our scheme using the minimal requirement of the input 
and playback buffers. The results are summarized in Table 3.3. Then we increased 
buffer sizes and computed their corresponding energy consumptions, as shown in 
Figure 3.5, which is the relationship between the increase in buffer size and the 










TMEC 1.012 1.022 1.016 1.016 1.012 1.018 
 
Table 3.4  Energy consumption ratio between the proposed scheme and the 
TMEC 
These results have two important implications. Using the configurations in Table 1, 
namely, 14.40 Kbytes of input buffer, 453.48 Kbytes of playback buffer on average, 
energy consumption of the proposed scheme is very close to the theoretical lower 
bound with only 1.2%~2.2% additional overhead. The results suggest that the 
proposed scheme works well with sufficiently small sizes of buffers. This also implies 
that increasing buffer sizes or prolonging the latency will hardly reduce energy 
consumption further. This conclusion is supported by results in Figure 3.5, where we 
only achieved less than 2% additional energy reduction as we increased the buffer 
 90
sizes by 33%~50%. These facts show that buffer sizes of feasibility conditions are 
appropriate for the video decoding applications.  








1    





























Figure 3.5  Normalized energy consumption with the buffer sizes increased from 
the feasibility condition for the six video clips 
To give an insight into our proposed scheme, we further conducted comparisons with 
baseline 3 (prediction with moving window) and baseline 4 (averaged speed within the 
given buffer), as shown in Figure 3.6. We believe that two properties of the OSP 
algorithm are responsible for the performance. First, the OSP algorithm exploits global 
information to compute the speed profile, superior to existing buffer based DVS 
techniques where processor speeds are scaled based on local information. This is 
shown in Figure 3.6: baseline 3 and baseline 4 introduce unnecessary scaling 
operations, which incurs additional energy consumption. Second, the OSP algorithm 
“pushes” the processor speed to the averaged value of the speeds. As a final result, the 
major part of the generated speed profile is very close to the global averaged speed, 
which is clearly shown in Figure 3.6(b). Figure 3.6 also reveals some interesting 
properties of our scheme. 1) The dynamic range required by our scheme is far less 
than the natural dynamic range of the bitstream: from 3.6(a), the dynamic range of the 
actual workload is round 20000 cycs, while the dynamic range of our scheme is round 
 91
10000 cycs. Thus our assumption that the processor speed can support the required 
dynamic range is reasonable; 2) The buffers of feasibility condition work well for the 
major part of the whole bitstream. We believe this is because of the large fluctuation 
levels among media objects. The buffer sizes, which just satisfy the feasibility 
condition of the worst case, are sufficient to provide satisfactory performance for the 
whole bitstream.   
 
Figure 3.6 Illustration of the speed profiles based on the clip “Hall”: a) 
comparison between the two base line DVS schemes, In baseline 3, the moving 
window size is 32 MacroBlocks. In baseline 4, we calculate the speed very 10 
MacroBlocks; b) comparison between GAS (global averaged speed), the baseline 
4 with buffer size 1029 Macro blocks and our scheme with the configuration in 
Table 3.1 
 
3.5 Proof of Optimality 
Lemma 1: for a convex function P(·), if  b > a, then we have  
P(b+c)-P(b) > P(a+c)-P(a), c > 0                              (3.5.1) 
 
Consider a period [Ts, Te] in the workload trace of a media bitstream, we define instant 
Ts and Te as the starting point and the ending point of the period, and the sum of 
workload between Ts and Te as the cumulative workload. 
 92
Lemma 2: For a given period in the workload trace of a media bitstream, two speed 
profiles have the same average speed between the starting point and ending point of 
this period. 
Lemma 3: For a given period in the workload trace of a media bitstream, the average 
speed profile between the starting point and the ending point incurs the minimal 
energy consumption. 

























By adding the above two inequations, we have: 
)()()()( aPcbPcaPbP ++≤++  
Then we have 
)()()()( aPcaPbPcbP −+≥−+ . 
Since Lemma 2 and 3 are quite straightforward, we skip their proofs.   
We define change points as those instants when the speed of the processor is changed. 
For example, point A, B and C are all change points in Figure 3.2. Considering 
S*= ( ){ }* 1j j Nω ≤ ≤ , which is the speed profile generated by the OSP algorithm, all 
the change points of S* can be divided into two classes: 1) min-change points: 
processor speed changes are due to global lower bound; 2) max-change points: 
processor speed changes are due to global upper bound. 
 93
Then we can derive that for any feasible speed scheme at a change point of S*: it must 
be not-lower than min-change points and it must be not-higher than max-change 
points. 
We can see that the beginning point of the speed scheme is a max-change point and 
the ending point of the speed scheme is a min-change point. 
Considering a feasible speed profile, its cumulative workload will only intersect that 
of S* at the following points: beginning point, end point, some point between 
neighboring pair of min-change point and max-change points. We call these points as 
intersection points. The intersection points divide the whole speed scheme into various 
segments. Each segment only contains homogeneous change points. 
Next we will show that in each segment, our scheme is the optimal one among all 
feasible speed schemes. Then we can immediately conclude that our scheme is optimal 
for the full speed scheme. 
We first consider a segment which contains the min-change points.  We construct a 
speed profile S= ( ){ }1n n Nω ≤ ≤ , subject to: 1) its cumulative workload intersects 
that of S* at the beginning and end points of the segment; 2) it only changes speeds at 
the change points of S*; 3) within the segment, the cumulative workload of S is not 
less than that of S*. It should be noted that S is not necessarily a feasible speed profile. 
In a segment which contains min-change points, S* has an important property: its 
speeds are mono-decreasing. We show this property as follows. According to OSP 
algorithm, the S* is generated based on the deviation from the average value of the 
speed of some region. As illustrated in Figure 3.7(a), let point A has the largest 
deviation from the average value of the segment, point B has largest deviation from 
 94
the average value of region [O,A], and points C, D has the largest deviation from the 
average value of region [O, B] and [B,A], respectively. In our algorithm, we first split 
the region [O,A] into [O,B] and [B,A], then split [O,B] and [B,A], and results in the 
speed profile, as the solid line shown in Figure 3.7(a).  We denote the cumulative 
workload of B, C, and D as x, y, z, temporal offsets from point O for B, C, and D as c, 
b, d, and the average speed over [O,A] as r. Since B has the largest deviation from the 
average speed, we have x r b y r c− ⋅ > − ⋅ , and x r b z r d− ⋅ > − ⋅ . Then we have 
x y z x
r




. It is noted that (x-y)/(b-c) is the speed over [C,B] and (z-x)/(d-b) is the 
speed over [B,D]. This relationship holds for each splitting operation performed in the 
algorithm. This proves the property. In similar way, we can derive that for a segment 
















Figure 3.7 (a) Illustration of the splitting operation; (b) Illustration of the speed 
profiles, thin lines stand for speed profile S*, thick lines stand for speed profile S.   
 
Based on the construction rules, the cumulative workload of S is not less than that of 
S*, as shown in Figure 3.7 (b), we have: 




j j j jω τ ω τ
= =
⋅ ≥ ⋅∑ ∑ , 1 n N≤ ≤                              (3.5.2)  
For region ( )jτ , we denote ( ) ( ) ( )*j j jω ω ω∆ = − . 
From (3.5.2), we have  
 95







∆ ⋅ ≥∑ ,      1 n N≤ ≤                           (3.5.3) 





j j j j j
j j j j j
ω ω ω ω ω
ω ω ω ω ω
+
−
∆ = − >
∆ = − <
                            (3.5.4) 
We denote 
( ) ( ) ( ){ }





G i k k i k k









                                  (3.5.5) 
 There must exist an partition R, 
( ) ( ) ( ) ( ) ( ) ( )( ) ( )
, , ,
, and , , and  ,
j j k
j
r i i r i r i i G j
R r i
j k j k G N j k G i
τ +
− −
 ⊂ = ∅ ∈ 
=  
≠ ∈ ∉  
I
 
For any ( )j G j−∈ , we have:  









∆ ⋅ ≤ ∆ ⋅∑                          (3.5.6) 
Next we prove it by contradiction. 
We suppose that some ( )kω−∆ cannot be covered. Then for k, we 







∆ ⋅ <∑ , this contradicts with (3.5.3).  
As the power is a convex function of the speed of the processor, substitute lemma 1 
into (3.5.6), we have: 
( )( ) ( )( ) ( ) ( )( ) ( )( ) ( )* *
( ) ( )i G i j G j
P i P i i P j P j jω ω τ ω ω τ
+ −∈ ∈
− ⋅ > − ⋅∑ ∑                 (3.5.7) 
This shows that the constructed speed profile S will have larger energy consumption 
than the speed profile S* for the given segment.  
Now we consider the segment which only contains max-change points. We construct a 
speed profile S, subject to: 1) its cumulative workload intersects that of S* at the 
beginning and end points of the segment; 2) it only changes speeds at the change 
 96
points of S*; 3) within the segment, the cumulative workload of S is not greater than 
that of S*. Thus we have: 




j j j jω τ ω τ
= =
⋅ ≤ ⋅∑ ∑ ,     1 n N≤ ≤                           (3.5.8) 
On the other hand, from lemma 2, we have: 
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )* *
1 1 1 1
n N n N
j j n j j n
j j j j j j j jω τ ω τ ω τ ω τ
= = + = = +
⋅ + ⋅ = ⋅ + ⋅∑ ∑ ∑ ∑           (3.5.9) 
We consider the situation from the end of the segment.  
( ) ( ) ( ) ( )*n n
j N j N
j j j jω τ ω τ
= =
⋅ ≥ ⋅∑ ∑ ,    1 n N≤ ≤                          (3.5.10) 
As shown above, the speeds are mono-increasing for such segment. Thus we have the 
same result for segment which only contains the max-change points. 
Finally, we consider a feasible speed profile S′. A feasible speed profile must be not 
less than a min-change point and not greater than a mix-change point and intersects 
the cumulative workload of S* at intersection points. But it can use arbitrary feasible 
curves to connect these points. For S′, we can construct a speed profile S with the 
same cumulative workload at the change points and the intersection points, using 
straight lines to connect those points. According to lemma 3, energy consumption of S′ 
is not less than that of S. And we have shown that the energy consumption of S* is not 
greater than that of S. So we can conclude that the energy consumption of S* is not 
greater than that of any feasible speed profile. 
 97
 
Chapter 4  
Frequency Band and Stereo Image based 
Workload Scalable Decoding Scheme 
4.1 Introduction 
In chapter 2, we have proposed JEDF, which minimizes the computational workload 
of audio decoding without degradation of quality. It is suitable for the normal state of 
the voltage scheduler, where its requesting resources will always be satisfied. 
However, when being transmitted into low power state, the voltage scheduler needs to 
reduce its allocating computation resources to lower the energy consumption of the 
system. With insufficient computation resources to process current frame, JEDF, as 
well as a standard audio decoder, will fail to work well. Meanwhile these decoders do 
not support the user to choose a low power level according to the application 
scenarios. To address these two problems, we propose a new workload scalable audio 
decoding scheme that can further improve workload reduction than JEDF. 
As discussed in section 1.3.3, the solution to above two problems relies on 
overcoming the binary quality mode of current decoders. Towards this, the new 
scheme provides multiple decoding levels: each of them is associated with a different 
level of power consumption and playback quality. Our scheme is perception-aware, in 
 98
the sense that the difference in the perceived quality associated with the different 
levels is relatively small. But decoding the same audio clip at a lower playback quality 
level leads to significant energy savings. We call the resulting scheme frequency Band 
and Stereo image Scalable (BSS) decoding scheme. 
4.1.1 Perception-Awareness in Audio Decoding 
Our workload scalable decoding scheme is mainly motivated by the following 
observations. 
Perceptual characteristics of individual users: Most perceptual audio codecs are 
designed to achieve transparent audio quality at least at high bitrates. The frequency 
range of a high quality audio codec such as MP3 is up to about 20 kHz. However, 
most adults, particularly older ones, can hardly hear frequency components above 16 
kHz. Therefore, it is unnecessary to compute the perceptually irrelevant frequency 
components. Further, within the wide swath of frequencies that most people can hear, 
some bands register more loudly than others [81] [105]. In general, the high frequency 
bands are perceptually less important than the low frequency bands. There is little 
perceptual degradation if we leave some high frequency components un-decoded. A 
standard MP3 decoder will simply decode everything in the bitstream without 
considering the hearing ability of individual users with or without hearing loss. This 
could result in a significant amount of irrelevant computation, thereby wasting battery 
power.  
Listening environment: To evaluate the perceptual quality of any audio codec, 
rigorous subjective listening tests are carried out. These tests are usually conducted in 
a quiet environment with high quality headphones by expert listeners or panels 
 99
without any hearing loss. However, the realistic environments for ordinary users are 
usually very different: Firstly, it is relatively rare for a portable audio player to be used 
in a quiet environment, for example in the living room of one’s home. It is far more 
common to use portable audio players on the move and in a variety of environments 
such as in a bus, train, or in a flight, using simple earpieces. These differences have 
important implications on the audio quality required. According to our experiments, it 
is hard for most users to distinguish between CD and FM quality audio in a noisy 
environment—they appear to be more tolerant to small quality degradation in such 
environments. The BSS decoding scheme enables the user to change the decoding 
profile to adapt to the listening environment, while a standard MP3 decoder cannot. 
Service types and signal characteristics: Different applications and signals require 
different bandwidth. For example, a storytelling audio clip requires significantly less 
bandwidth compared to a music clip. The BSS decoding scheme allows the user to 
choose an appropriate decoding profile suitable for the particular service and signal 
type, in the process also prolonging the battery life. 
User preferences associated with battery level: A user might want better playback 
quality with a fully charged battery, but may be willing to sacrifice some playback 
quality for longer battery life when the battery is flat. 
The above observations represent a new opportunity for improving energy efficiency. 
We believe that perception-aware workload scalability is a useful option enabling 
either the voltage scheduler to tune the audio decoding process automatically or the 
user to select manually the tradeoff between playback quality and computational 
workload, as illustrated in Figure 4.1. The selected decoding level determines the 
 100
decoding workload. With the reduced workload, the supply voltage and clock 







































































Figure 4.1  High-level block diagram of the BSS decoding scheme supporting 
voltage scheduler in low power state and the user’s power saving switch.  
 
4.1.2 Perception aware Workload Scalable Processing 
As discussed in section 1.3.1, among various workload reduction techniques, only 
SOPOT and PSR are suitable for workload scalability. Although their workload can be 
adjusted by changing the number of SOPOT terms involved in the computation, 
SOPOT schemes cannot divide the entire spectrum into clearly structured subbands 
with different perceptual relevance. The perceptual relevance of those SOPOT terms 
strongly depends on the audio signal characteristics and may change from frame to 
frame.  
PSR becomes a natural choice for our scheme. PSR enables the division of the entire 
frequency band into several subbands according to their perceptual relevance and 
decode only those subbands with highest perceptual relevance to satisfy the user’s 
need thus scaling the workload accordingly. Our workload scalable audio decoding 
 101
scheme consists of two components, namely the frequency band and stereo image 
scalability. 
Although our scheme can be applied to most existing audio formats, we have 
implemented our prototype with MP3, for proof of concept due to its popularity. 
The rest of chapter is organized as follows. In the next section we describe the 
implementation techniques for the proposed scheme, emphasizing the design of multi-
decoding levels, which will facilitate the quality control. In Section 4.3, we develop an 
efficient algorithm of synthesis filterbank in terms of stereo image scalability. In 
Section 4.4, we present experimental results of the proposed decoding scheme.  
 
4.2 Frequency  Band and Stereo Image Scalable  
Decoding  
The workload of a standard MP3 decoder is not scalable. The standard decoder parses 
the bitstream, decodes the side information first, and then runs several signal 
processing modules to convert the MP3 bitstream to PCM audio samples, including 
requantization and reordering of the spectral data, joint stereo processing, IMDCT and 
polyphase synthesis filterbank (see Figure 4.2). Two modules which incur the most 
computational workload are IMDCT and polyphase synthesis filterbank. The standard 
decoder processes the entire frequency range of both channels, which corresponds to 
the highest computational workload. 
In our proposed scheme, different decoding levels are associated with different 
frequency ranges. When an audio clip is decoded at level 1, only the frequency range 
 102
associated with this level is decoded by the processing modules. At higher levels, a 
larger frequency range is decoded and finally the entire frequency range is decoded, 
which is equal to a standard decoder. The attractiveness of this scheme stems from the 
fact that although the computational workload associated with the decoding process 
scales rough linearly with the involved frequency range (see section 4.2.3), the lower 
frequency bands are perceptually more relevant than the higher bands, especially those 
beyond 16 kHz [50] [105]. Therefore, when a clip is decoded at a lower level, by 
sacrificing only a small fraction of the playback quality, the processor can be run at a 
much lower clock frequency (and voltage), when compared to a higher decoding level. 
In the proposed scheme, we first partition the frequency bandwidth of an MP3 decoder 
into 4 groups and then combine frequency bandwidth groups of different channels 
(stereo image) into 5 decoding levels. We will present the details in the following 
subsections.  
4.2.1 Frequency Bandwidth Scalability 
 
Figure 4.2 Block diagram of the proposed frequency band scalability 
 103
As defined in [57], the full frequency range of MP3 is decomposed at two levels: it is 
first divided into 32 subbands and then each subband is further decomposed into 18 
coefficients. We perform frequency band scalability at the level of subbands. In 
principle, it is possible to decode these subbands independently and this enables us to 
design a scalable decoding scheme with a granularity that corresponds to all the 32 
subbands. However, for the sake of facilitating quality control, we need to partition 
these 32 subbands into a small number of groups, which comply with the user’s 
listening experience. Based on our investigation, those four groups listed in Table 4.1 









Group 1 0-7 0 – 5512.5 AM quality  
Group 2 0-15 0 – 11025 Near FM quality 
Group 3 0-23 0 – 16537.5 Near CD quality 
Group 4 0-31 0 – 22050 CD quality 
Table 4.1 Four different decoding groups 
In our scheme, the group 1 covers the lowest frequency bandwidth (5.5 kHz) which 
we define as the base layer. Although the base layer occupies only a quarter of the 
total bandwidth and contributes to roughly a quarter of the total computational 
workload, it covers the most perception relevant frequency bands. The playback 
quality corresponding to this group is sufficient for services like news and sports 
commentary. Group 2 covers a bandwidth of 11 kHz and almost reaches the FM radio 
quality, which is sufficiently good even for listening to music clips, especially in noisy 
environments. Group 3 covers a bandwidth of 16.5 kHz and produces an output that is 
 104
very close to CD quality. Finally, Group 4 corresponds to the standard MP3 decoder, 
which decodes the full bandwidth of 22 kHz. Groups 1, 2 and 3 process only a part of 
the data, whereas Group 4 processes all the data and is therefore computationally more 
expensive. According to our experiments, the audio quality corresponding to Groups 3 
and 4 are almost indistinguishable in noisy environments, but are associated with 
substantially different power consumption levels. 
4.2.2 Stereo Image Scalability  
A large amount of MP3 files are encoded in joint/MS stereo mode. This mode 
employs the similarity between left channel signals and right channel signals and 
encodes middle (M) and side (S) channel data into MP3 file instead of left and right 




RLM +=  and the i-th S 




RLS −= , where iL  and iR  are the i-th 
coefficients of left and right channels respectively, 0 ≤ i ≤575. Consequently, the M 
channel signals contain information from both left channel and right channel, and S 
channel signals only provide auxiliary information to give stereo effect. In other words, 
M channels and S channels have different perceptual significance, even though M and 
S channels consume almost the same computation in decoding operations when equal 
number of coefficients is decoded. This property allows us to asymmetrically decode 
M channels and S channels and the computation for S channels can be reduced 
accordingly. Moreover, we use middle data from the left channel to compensate the 
loss in high frequency bands of the right channel.  
 
 105
4.2.3 Multiple Level Decoding  
In our proposed workload scalable decoding scheme, we combine frequency band 
partitioning and stereo image scalability together to provide satisfactory playback 
quality with further reduced workload. In this process, we first apply the frequency 
band partitioning for both M channel and S channel. We next only keep the 
combinations that S channel levels are not greater than M channel levels according to 
the asymmetric significance in joint stereo mode. In addition, we allow all S channel 
data to be discarded, where the workload is almost reduce a half at the cost that the 
stereo audio becomes mono audio. According to this strategy, the workload of some 
specific decoding level has a rough linear relationship with its involved subband 
number. This is an important fact for designing the workload scalability, which is 
analyzed as follows. For all the MP3 processing modules shown in Figure 4.2, except 
the polyphase synthesis filterbank, their workloads are proportional to the involved 
subband number. Concerning the polyphase synthesis filterbank (referring section 4.3 
for details), two steps are included: cosine re-modulating and polyphase subfiltering. 
Their workload can be represented as K·logK and 16·K respectively, where K is the 
involved subband number. The only term violating the linear relationship is K·logK, 
coming from the cosine re-modulating. Since logK ranges from 3 to 5 in our scheme, 
the maximal workload deviation from linear relationship is one sixteenth of polyphase 
subfiltering. The deviation ratio becomes smaller when taking other processing 
modules into consideration. Therefore the linear component will dominate the 
workload of the entire decoding process.  This enables us to use the involved subband 
 106
number as an approximated workload estimation which is convenient in the design 
phase of the BSS scalability scheme. 
There are all 14 levels of scalability resulting from above strategy. This is undesirable 
since: 1) additional 10 levels of scalability are introduced, which may confuse the 
user; 2) these scalability levels have not a consistent relationship between workload 
and playback quality. For example, according to our evaluation, configuration (M:32, 
S:0) has more workload than configuration (M:16, S:8), but the former has perceived 
quality degradation in comparison with the latter.  To address these two problems, we 








1 M: 8, S: 0 56 AM quality  
 2 M: 16, S: 8 40 Near FM quality 
 3 M: 24, S: 12 28 Near CD quality 
 4 M: 32, S:16 16 CD quality 
 5 M: 32, S:32 -- CD quality 
Table 4.2 Five decoding levels, where workload reduction is measured in terms of 
subbands, with a standard MP3 decoder (decoding level 5) as the baseline  
The rationale underlying these 5 decoding levels are twofold. First, their 
corresponding perceived quality levels should comply with the user’s listening 
experience. Second, their required workloads should be distributed as evenly as 
possible. The latter will guarantee the effectiveness of workload reduction. In our 
scheme, we have designed decoding level 1 as mono audio since it is the case of AM. 
Decoding level 1 is only suitable for some specific scenarios, such as that there occurs 
 107
an urgent need to prolong the battery life. Decoding level 2, 3 and 4 follow their 
counterparts in frequency band scalability and exploit stereo image scalability to 
achieve additional workload reductions. It should be noted that we will decode 12 
subbands for S channel in decoding level 3, rather than 8 or 16 subbands. This is 
mainly because that such configuration will facilitate the up-sampling operations in 
scalable polyphase synthesis filterbank, more details are found in section 4.3.5. 
Subjective evaluation shows that although stereo image scalability and scalable 
polyphase synthesis filterbank are both employed, the playback qualities of decoding 
levels 2~4 are indistinguishable from their corresponding groups in frequency band 
scalability. This fact demonstrates the effectiveness of stereo image scalability in 
perception aware workload reduction. Finally we have reserved configuration (M:32, 











Left channel audio data
Blocks of 576 samples
Right channel audio data
Blocks of 576 samples
Right + M channelLeft + M channel
B1 B2 B3 B4
Left + M channel
Left + M channel
M channel





Figure 4.3  Block diagram of the proposed BSS multi-level decoding algorithm 
with frequency band scalability and stereo image scalability, where B1-B4 stand 
for middle channel, side channel, left channel and right channel data, respectively.   
 108
The structure of the proposed BSS scalability is shown in Figure 4.3. Let m subbands 
of M channel and s subbands of S channel be decoded by our scheme, where ms ≤  
≤32. After Huffman decoding, our scheme only keeps low m and s subbands for M 
channel and S channel respectively. In the following processes before polyphase 
synthesis filterbank, asymmetric operations are conducted for M channel or left 
channel and S channel or right channel. Due to the absence of [s+1, m] subbands in 
side channel, [s+1, m] subbands of left channel cannot be reconstructed, joint stereo 
processing will output real data of low s subbands for both channels and [s+1, m] 
subband middle data instead, along with left channel. For the processing modules from 
requantization to IMDCT, as their computation of a subband data is independent from 
other subbands, the implementation of the BSS scalability is straightforward: just 
leave the M channel data of [m+1,32] and S channel data of [s+1,32] unprocessed and 
their computation is accordingly saved. Unlike the straightforward implementation of 
those preceding modules, the polyphase synthesis filterbank module needs to be 
redesigned to exploit the frequency band and stereo image scalability to achieve 
workload reduction. Towards this, we have developed a general method, namely 
asymmetric partial spectrum reconstruction (APSR), which we will present in the next 
section. When we estimate the workload of each decoding level in terms of the 
number of its associated subbands, the average step of the even distribution of 
workload between (M:8, S:0) and (M:32, S:32) should be 14 subbands. In our scheme, 
the steps from decoding level 1 to decoding level 5 is 16, 12, 12, 16 subbands 
respectively, close to the optimal solution. 
  
 109
4.3 Efficient Algorithm for Synthesis Filterbank  
4.3.1  Asymmetric Partial Spectrum Reconstruction for Stereo 
Audio 
In audio decoding techniques, a fundamental approach to reducing computation 
complexity is partial spectrum reconstruction (PSR). Here, only the spectrum of a part 
of the coded subbands is reconstructed, resulting in a low-pass version of the original 
audio. Much work on PSR has been reported in the literature. In [86], general 
principles of PSR via digital filter banks are discussed. In [2], the design of PSR 
synthesis filter banks for MPEG audio is presented.  
It should be noted that frequency band scalability is in line with PSR and we can make 
use of the techniques reported in [2] to save computation cost of the polyphase 
synthesis filterbank module. On the other hand, no work has been reported to exploit 
stereo image scalability in stereo audio, where fewer side channel subbands than 
middle channel subbands are decoded. According to our observation, a large fraction 
of MP3 audio files on the market is coded in joint/MS mode. This justified the 
significance of efficient algorithm exploiting the characteristics of stereo image 
scalability. 
In typical asymmetric decoding, lower [0, L+M-1] subbands of the middle channel 
and lower [0, L-1] subbands of the side channel are used to reconstruct audio samples 
[49]. Processed by the modules preceding the synthesis filter bank in the MPEG audio 
decoder, three blocks of data are generated, namely, [0, L-1] subbands of the left and 
right channels, and [L, L+M-1] subbands of the middle channel, which are the input of 
 110
the synthesis filterbank module. In MPEG-1 audio [57], the synthesis filterbank is 
implemented with Pseudo-Quadrature Mirror Filter (PQMF). Since middle channel 
data provide common high frequency components for both the left and right channels, 
a straightforward way to deal with middle channel data is to add them to both the left 
and right channels before performing PQMF [50]. However, this method results in a 
significant amount of redundant and irrelevant computation. Due to this, we propose 
APSR to eliminate the redundant and irrelevant computation while maintaining the 
same perceptual quality of the decoded audio.  
Notations: In the rest of the section, nmLA
×
 means matrix A has m rows and n columns, 
and it is labeled by L. L denotes either the left channel or the number of left channel 
subbands involved; the exact meaning can be determined from the context. The same 
applies for M and R. Where number of subbands is concerned, R=L holds. The 
superscript T denotes transpose. 
4.3.2 Conceptual Framework  
As defined in [57], the PQMF algorithm used in MPEG audio is essentially based on 
cosine-modulated filter banks. Concerning the synthesis filterbank which performs 
PQMF at the decoder, the set of filters are derived from a single low-pass prototype 
filter by cosine modulation, which yields a series of frequency shifts of the prototype 
[104]. To reduce computational complexity, MPEG audio makes use of polyphase 
decomposition to implement the filterbank. Consequently, the synthesis process 
comprises two computationally intensive steps to reconstruct the analyzed signal, 
namely cosine re-modulating and polyphase subfiltering. A block diagram of the 




    
D 
 K*2K 






0 ( )H z−  
2
1( )H z−  
2














0 ( )Y m
1( )Y m
2 1( )KY m−
 
Figure 4.4 Structure of synthesis filter bank in MPEG-1 audio 
The structure of the synthesis filter bank allows us to interpret the synthesis process in 
an alternative way which provides insights into the proposed APSR algorithm: the 
cosine re-modulation module performs frequency shifting operation on input 
frequency coefficients, and polyphase subfilters transfer shifted coefficients of the 
frequency domain into samples of the time domain. In the light of this interpretation, 
M data (high frequency components) are distinguishable from L and R data (low 
frequency components) at the output of cosine re-modulation. On the other hand, 
different spectral components are merged after polyphase subfiltering, and it is 
impossible to separate them. These two facts influence the design of our proposed 
approach.  
The key idea of our approach is to eliminate redundant computation by enabling the 
right channel to share processed M data of the left channel. That is, the dimensions of 
the input data to the PQMFs of the left and right channels are (L+M) and R 
respectively. As the computational complexities of cosine re-modulation and 
polyphase subfiltering are ( log )O K K⋅  and (16 )O K⋅  respectively, where K is the number 
 112
of subbands involved, the computational workload of the synthesis process of the right 
channel is significantly reduced in comparison to the original scheme in [50]. 
An important part in the proposed APSR is the reconstruction of M data which is 
removed from the right channel for reducing computational workload. One possibility 
is to share the cosine-re-modulated M data before polyphase subfiltering as in our 
earlier scheme [50]. This can be easily accomplished as M data are separated from low 
frequency components (L data) at this stage. The polyphase subfiltering of the left and 
right channels can then be executed separately, and the reconstructed data of both 
channels are of the same sampling rate. While this scheme may be implemented with 
ease, the redundant computation of M data in the step of polyphase subfiltering for the 
right channel is not removed. 
The proposed technique presented in this section tackles above-mentioned problem. 
Towards this, we postpone the sharing of M data till after polyphase subfiltering. The 
main challenge in implementing this scheme lies in the extraction of processed M data 
for the right channel, since the output of the left channel contains the sum of converted 
L and M data which are not easily separable. In order to solve this problem, we 
compute first the residue between R and L data (R-L), which is used as the input of the 
right channel to the filterbank instead of the original R data. After the polyphase 
subfiltering step, the (R-L) time samples are up-sampled to yield the same sampling 
frequency of the left channel. As the final step, the sum of time samples from both 
PQMFs produces the desired right channel samples with high frequency coefficients 
(R+M). 
 113
To facilitate easy sampling rate conversion, we limit the subband dimension of (L+M) 
as a multiple of R. As a result, our proposed technique incurs computational workload 
close to that of processing (2L+M) subbands in comparison to (2L+2M) in our earlier 
scheme. 
4.3.3 Cosine Re-modulation  
For the sake of generality, we represent cosine re-modulation in terms of the number 
of subbands K as follows: 
12 ××
⋅=
KKK XDY                                                           (4.4.1) 









 +   
= +    
    
                                        (4.4.2) 
where N is the number of prototype filter coefficients, subject  to N=16·K+1. 
From (4.4.2), we can then derive the respective cosine re-modulation matrices Ds for 
the left and right channels by substituting the specified values of K. 
According to (4.4.1), calculation of the left channel is represented as (4.4.3a). By 














DY )()(2                                                         (4.4.3a) 




























                                      (4.4.3c) 
 114
For right channel input, we make use of (4.4.4) to re-modulate it: 
)(2 LRRRLR XXDY −⋅= ×−                                                       (4.4.4)  
As we have mentioned, the residue between R and L data is calculated, rather than the 
original R data, which facilitates the reconstruction of right channel samples. This can 
be justified as follows: Let )(tfL and )(tfR denote sample values at instant t of the left 






































Ptf )(                                     (4.4.5) 
      Since PQMF is a linear system [32], we have:  































LR XXP                                           (4.4.6) 
Therefore, the desired right channel samples can be obtained from the sum of the 
residual samples and the left channel samples. For convenience, we denote the second 
item in (4.4.6) as )(tf LR− . 
By comparing (4.4.3c) and (4.4.4), we can see that  RRD ×2  is lower in dimension 
than LD , which implies that LRY − only provides a low-pass version of )(tf LR− . 
Moreover, some distortion is introduced into )(tf LR−  by the additional up-sampling 
operation (see Section 4.4.4). Thus, our scheme only yields an approximation to )(tfR . 
Fortunately, the distortion introduced by the up-sampling is inaudible to the human ear 
under appropriate profiles, which we will verify in section 4.4.1. 
 115
4.3.4 Polyphase Subfilters  
In this section, we present a generalized version of polyphase subfiltering capable of 
conducting calculation according to the number K of subbands involved, which is 
required in our scheme. For this, we need to re-design the prototype filter in terms of 
K, which has been proposed in [2].  
The redesigned prototype filter can be decomposed into 2K polyphase components as 
follows: 
1 2 1 2 1
(2 ) 2
0 0 0
( ) (2 ) ( )
I K K
iK j j K
j
i j j




= + ⋅ =∑∑ ∑                                     (4.4.7) 
Based on these polyphase components, the polyphase subfiltering calculation is 
represented in (4.4.8) [32], and it accomplishes the required calculations: 
( ) { (0) ( )}Kd z z S S K Y−⋅ ⋅ + ⋅                                                      (4.4.8) 
































4.3.5 Up-Sampling by Repetition  
A common method for up-sampling rate conversion is to employ an interpolation filter 
[28]. The operation can be represented as: 
( ) 1 ( ) 1
ˆ
L M L M R R
R RX U X
+ × + × ×
= ⋅
%
                                     (4.4.9) 
Although the interpolation filter method can provide optimized performance, its 
computational complexity is very high: (4.4.9) leads to (L+M)*R multiplications and 
(L+M)*(R-1) additions, suggesting it contradicts our main design objective.  
 116
 As discussed in Section 4.1, small distortions are tolerable in the application scenarios 
of portable devices. This allows us to exploit computationally efficient up-sampling 
methods. Through investigation, we choose repetition to perform up-sampling rate 
conversion; this choice yields satisfactory performance with very low computational 
complexity, especially in the case that (L+M) is a multiple of R.  
4.4 Experimental Evaluation 
4.4.1 Subjective Evaluation of BSS Decoding Scheme 
To evaluate the effectiveness of the BSS decoding scheme, we carry out experiments 
on a group of 13 subjects (male and female graduate students). All subjects are asked 
to evaluate the audio quality using the mean opinion score (MOS), which is a 5-point 
scale (5-ecxcellent, 4-good, 3-fair, 2-poor, and 1-bad). We have used 5 musical 
programs for evaluation selected from pop music MP3 clips: Song from a secret 
garden (sample 1), Casablanca (sample 2), Love fool (sample 3), Lydia (sample 4), 
and TNT for the brain (sample 5). These clips include light music, female/male solo, 
and cover several common rhythms. They are all of joint stereo mode, sampling rate 
44.1KHz, bitrates of 128kbits/s. There are three sources of distortion in our proposed 
BSS scheme: frequency band scalability, stereo image scalability, and APSR of 
polyphase synthesis filterbank. As we have designed the frequency band scalability 
according to user’s listening experience, we focus on the latter two kinds of distortion 
in this section and present experimental results corresponding to these sources 
respectively. We first investigate stereo image scalability. For each program, we have 
prepared 5 copies for testing. These copies are generated by our BSS decoder, using a 
standard polyphase synthesis filterbank, with configurations of (M:32, S:32), (M:32, 
 117
S:24), (M:32, S:16), (M:32, S:8), (M:32, S:0), respectively. Each program has two 
additional copies with the original audio clip and (M:8,S:8) given as the references, 
the former at MOS scale of 5 and the latter at MOS 3. For sake of fairness, all test 





















Figure 4.5  Evaluation results for different BSS configurations 
As shown in Figure 4.5, a large part of (M:32, S:0) copies have significant difference 
in performance from other copies. This is mainly due to the fact that subjects are 
sensitive to the difference between mono and stereo audios and stereo ones are more 
preferred. Another interesting observation is that the qualities of  (M:32,S:24), 
(M:32,S:16) and (M:32,S:8) copies are not consistent with the configurations of S 
channels. This in fact suggests that the subjects cannot tell the differences among the 
copies with these configurations in term of quality. All of these show that BSS 
decoder can achieve very satisfactory quality with low configurations of S channels, 
such as 8 subbands, which require considerably less computation. 
On the other hand, we carried out experiments to evaluate the quality degradation 
introduced by our APSR algorithm. For each program, we prepared four copies for 
testing. These copies were generated by our APSR algorithm with profiles of (M:32, 
 118
S:32),  (M:32, S:16), (M:32, S:8) and (M:32, S:4), respectively. Each program had an 
additional copy with (M:32, S:32) given as references. The result of our subjective 
evaluation is shown in Table 4.3. 
M:32 Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 
S:32 4.90 4.85 4.93 4.90 4.90 
S:16 4.90 4.90 4.90 4.90 4.95 
S:8 4.75 4.85 4.70 4.85 4.87 
S:4 4.35 4.65 4.80 4.33 4.57 
Table 4.3  Perceptual evaluation results for different APSR profiles 
We can see that profiles (M:32, S:4) and (M:32, S:8) only incur slight quality 
degradation. Especially, the profile (M:32, S:16) is almost indistinguishable from the 
full decoding profile used in the standard MPEG audio decoder. These observations 
show that the proposed APSR algorithm can provide satisfactory playback quality. 
4.4.2 Workload Estimation  
We evaluated our decoder using two different classes of audio clips, those having a 
bitrate of 160 kbits/sec and the other class having a bitrate of 128 kbits/sec. In the 
former class, the average number of bits per frame is higher compared to the latter 
class. We measured the workload of these clips with ARM simulation tool [55]. We 
implemented our BSS decoder based on the MP3 decoder from the Fraunhofer IIS, 
which is the reference source code of MPEG-1 audio standard. The main merit of the 
reference decoder is that its implementation is well documented in informative part of 
[57] and this facilitates reader’s analysis and comparison. All the audio clips we used 
had a sampling frequency of 44.1K PCM samples/sec per channel, which corresponds 
to CD quality audio and duration of 20 sec.  
 119
We measured the workload of a task in terms of the instruction number performed to 
accomplish the task, as the usual measurement of computer performance. The 
workload W of decoding a MP3 frame is closely related to the processor’s speed, 
which can be illustrated by the following example. As audio decoding is a real-time 
application, audio frames are required to complete decoding within period T to avoid 
playback jitter. Then the processor’s speed should be at least set as W∕T. Through 
workload reduction, we can lower the required processor speed, which will result in 
energy savings using DVS.    
Table 4.4 lists the average workloads, namely million instructions per frame (MIPF), 
for five decoding levels using the high bitrate class of clips. We obtained almost 
identical results for the low bitrate (128 kbits/sec) class of clips as well. Based on 
these MIPF values, we calculated the workload reduction with respect to decoding 
level 5. For a given MP3 clip, let its workload of decoding level j be jW  , 1≤j≤5. For 






significance of the normalized workload reduction is twofold. First, as pointed out in 
[26], MIPF values of the same application vary on different platforms. But these 
values can derive each other by a scaling factor. In this case, normalized workload 
reduction is more meaningful and provides generic information across various 
platforms. Second, those processing modules may be implemented by various 
techniques. For example, IMDCT can be performed with direct implementation or fast 
implementation. This will make workload vary largely in different implementations. 
 120
Normalized workload reduction is insensitive to particular implementations and 
remains to be stable when linear component dominates the overall workload.  













MIPF 11.76 8.95 6.85 4.77 1.71 
Workload 
reduction -- 23.9% 41.7% 59.4% 85.0% 
Table 4.4 The average workload (MIPF) for the five decoding levels, along with 
the normalized workload reduction with respect to the standard MP3 decoder 
(decoding level 5) 
 
Table 4.4 verifies that the workload is roughly proportional to the frequency 
bandwidth to be decoded. Clearly, the decoding configuration (M:32, S:32) requires 
the highest amount of computation. Compared to this baseline, significant and 







Chapter 5  
Conclusions and Future Works 
5.1 Summary 
In this thesis, we looked into perception-aware low power audio processing techniques 
for portable device. We exploit two methods, namely workload reduction and DVS, to 
achieve energy efficiency. For low power audio decoding applications, the most 
challenging problem is that they have different characteristics and design requirements 
from current low power multimedia technologies. We approached this problem from 
two levels: high-level design methodology and concrete techniques, and covered three 
important aspects: workload reduction with non-degradation of playback quality, DVS 
with smoothing out fluctuation, and workload scalability. These three techniques 
provide a comprehensive solution for the application scenarios of portable devices. 
 
We believe that low power audio processing techniques is still in its infancy, which is 
reflected at both levels of methodology and concrete techniques. Unlike other matured 
disciplines, there is no suitable established methodology framework to guide the 
design of low power audio processing. This is the main reason leading to the 
ineffectiveness of existing techniques. Due to this, we begin with establishing a 
methodology framework for designing low power audio processing technology. The 
 122
key ideas of our works can be summarized as three concepts. First, we have developed 
the taxonomy of low power audio processing techniques. It is based on the 
heterogeneous usage modes of portable devices. The taxonomy consists of two kinds 
of techniques with different objectives and design strategies. Second, we have 
proposed to extend the low power audio decoding design to the encoder side. It is the 
key to address the problem of achieving optimized energy efficiency without playback 
quality degradation, while all existing techniques are unable to solve this problem due 
to their inherent limitations. This idea is also applicable to DVS. Third, we have 
proposed the concept of workload scalable decoding to support energy efficiency 
operations of voltage schedulers and users’ choices. The significance of these three 
concepts relies on that they partially fill the gap between the requirements of audio 
applications and existing low power techniques, and may lead to a number of 
innovative low power techniques targeting audio processing.  
 
More specially, following the above proposed methodology, main results that we have 
obtained in this thesis can be summarized as follows.    
• We have proposed a novel framework, JEDF, which allows the decoder to 
have a desirable tradeoff between energy and memory consumption without 
sacrificing playback quality. This is achieved by a joint noise shaping process. 
With no statistically significant differences from a standard encoder of the 
same configuration, JEDF can on average save around 40% of the overall 
computational workload of an AAC low complexity decoder. On the other 
hand, it only incurs a modest increase in file sizes. For the bit rate of 128 Kb/s, 
 123
the average file size increase ratio is 9.52%. And the increase ratios of the 
compressed file sizes decrease as the bit rate increases. More importantly, 
JEDF represents a new direction for workload reduction in long block 
transform computations. In general, although various methods have been 
proposed to save the computations for transform, it is hard to find an effective 
way to reduce the workload of a long block transform. In a similar manner to 
the “lossy compression” used in audio encoding to achieve high compression 
ratios, JEDF performs “lossy transform” for the long block transform, where 
the noises of higher levels are allowed to achieve significant workload 
reduction. JEDF then addressed these noises by joint noise shaping. 
• We have proposed a new concept of media server supporting DVS, which is 
superior to existing client-only approaches. As a first step to this new direction, 
we have designed an optimal speed control scheme for intra-task voltage 
scheduling. The significance of the optimal solution is twofold. First, it is of 
significant interest in theory since this is the first time to identify the lower 
bound of energy consumption achievable for the given buffers. Second, it 
requires much less memories than the existing approaches. This largely 
facilitates its applications and provides additional opportunities for energy 
savings. In terms of video, experimental results show that the proposed 
techniques: 1) lead to only 2% additional energy consumption compared to 
theoretical minimal energy consumption; 2) require buffer sizes of less than 3 
frames, and introduce delay of less than 0.1 sec. Compared to representative 
buffering based DVS techniques, our work improves the performance of 
 124
energy efficiency by 28.3% with the same buffer sizes or reduces 50% of the 
buffer requirement at the same level of energy consumption. These results 
demonstrate the superiority of the media server supported DVS scheme. 
• We have, for the first time, identified a dynamic nature of perception 
awareness of audio playback applications in the context of portable devices. It 
provides a new opportunity for workload scalability. In implementing the 
workload scalability, we have solved two key issues. First, how to design the 
decoding levels which offer the desirable tradeoffs between playback quality 
and workload? This problem turned out to be closely related to the user’s 
listening experience. Second, how to exploit stereo image scalability to reduce 
the workload of the synthesis filterbank? To address this problem, we have 
developed APSR technique, which is a useful extension to the well-known 
PSR technology.   
5.2 Future Works  
Based on the study presented in this thesis, some possible future works are listed 
below. 
• In our implementation of JEDF, we have concentrated on the truncation noise 
shaping of IFFT in the IMDCT to achieve the specified workload reduction. As 
an immediate next step, we plan to extend the proposed scheme to other parts 
of the IMDCT. Furthermore, how to represent the new side information such 
as the truncation positions of SOPOT coefficients in an efficient way remains 
to be addressed. Currently we concentrate on the truncation noise shaping to 
 125
accomplish the specified workload reduction. This results in irregular 
truncation positions for different SOPOT coefficients. This implies that we 
should insert the side information of the truncation positions into the coded 
bitstream. Then a critical issue is how to deal with the new side information in 
an efficient way as we have 255 coefficient blocks for a 512-point IFFT. We 
plan to solve this problem by clustering all the frames into limited number of 
groups. All members in the same group share the same side information of 
truncation positions. This method will efficiently compress the side 
information of truncation positions. 
• In terms of media server supported DVS, we plan to extend the proposed 
scheme to multiple task scenarios. Moreover, the analysis framework also 
provides insights into issues of selecting input and playback buffer 
configurations in terms of individual media bitstreams with more accurate 
estimations. The new estimation can be used to identify the build-in buffer 
ranges to support a class of multimedia processing applications, which is an 
important issue in designing a SoC platform.  
• BSS has provided the required characteristics to support energy efficiency 
performed by voltage scheduler. It remains to be a critical issue that how to 
design the voltage scheduling algorithms to find the optimized tradeoff 





[1] Acquaviva, A.; Benini, L.; and Ricco, B., “Software-controlled Processor Speed 
Setting for Low-power Streaming Multimedia,” IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 20(11), November 2001, pp. 
1283-1292 
[2] Argenti, F.; Del Taglia, F.; and Del Re, E., “Audio Decoding with Frequency and 
Complexity Scalability,” IEE Proceedings on Vision, Image and Signal 
Processing, 149(3), 2002, pp.152-158 
[3] Austin, T.; Larson, E.; and Ernst, D., “SimpleScalar: An Infrastructure for 
Computer System Modeling,” IEEE Computer, 35(2):59–67, 2002, pp. 59-67 
[4] Bavier, A.; Montz, A.; and Peterson, L., “Predicting MPEG Execution Times,” 
SIGMETRICS/ PERFORMANCE International Conference On Measurement and 
Modeling of Computer Systems, 1998, pp. 131-140 
[5] Benini, L.; and De Micheli, G., “System-level Power Optimization: Techniques 
and Tools,”  ACM Transaction on Design automation of electronic systems, Vol.5, 
No.2, Apr. 2000, pp.115-192 
[6] Bernhard, G., “A Bit Rate Scalable Perceptual Coder for MPEG-4 Audio,” 103rd 
Audio Engineering Society Convention, 1997, Preprint 4620 
[7] Bosi, M.; and Goldberg, R. E., “Introduction to Digital Audio Coding and 
Standards,”  Kluwer Academic Publishers, 2002 
[8] Breuer, M. A., “Determining Error Rate in Error Tolerant VLSI Chips”, IEEE 
International Workshop on Electronic Design, Test and Applications, Jan, 2004, 
pp.321-326 
[9] Breuer, M. A., “Multi-media Applications and Imprecise Computation,” 
Euromicro Conference on Digital System Design, Aug., 2005. pp.2-7   
[10] Britanak, V.; Rao, K. R., “An Efficient Implementation of the Forward and 
Inverse MDCT in MPEG Audio Coding,”  IEEE Signal processing Letters, Vol. 
8(2), 2001, pp.48-51 
[11] Brooks, D.; Tiwari, V.; and Martonosi, M., “Wattch: A Framework for 
Architecture-level Power Analysis and Optimization,” International Symposium 
on Computer Architecture, 2000, pp.83-94 
[12] Bruno, D.; Dorgival G.; Wagner, M.; and Ricardo, B., “Limiting the Power 
Consumption of Main Memory,” ACM SIGARCH Computer Architecture, San 
Diego, USA, June, 2007, pp. 290-301 
 127
[13] Burd, T. D.; Pering, T. A.; Stratakos, A. J.; and Brodersen, R. W., “A Dynamic 
Voltage Scaled Microprocessor System,” IEEE Journal of Solid-State Circuit, vol. 
35, no.11, Nov. 2000, pp.1571-1580. 
[14] Cai, L.; and Lu, Y. H., “Dynamic Power Management using Data Buffers,”  
conference on Design, Automation and Test in Europe, 2004, pp. 526-531 
[15] Chakraborty, S.; Wang, Y.; and Huang, W., “A Perception-Aware Low-Power 
Software Audio Decoder for Portable Devices,” IEEE Workshop on Embedded 
Systems for Real-Time Multimedia, September 22-23, 2005, New York, pp. 13-18 
[16] Chan, S. C.; and Yiu, P. M., “An Efficient Multiplierless Approximation of the 
Fast Fourier Transform Using Sum-of-Powers-of-Two (SOPOT) Coefficients,” 
IEEE Signal Processing Letters,   Vol. 9. PART 10, 2002, pp. 322-325 
[17] Chandrakasan, A.; Gutnik, V.; and Xanthopoulos, T., “Data Driven Signal 
Processing: An Approach for Energy Efficient Computing,” International 
Symposium on Low Power Electronics and Design, Monterey, CA, USA, 1994,  
pp.347-352 
[18] Chang, N.; Kim, K.; and Lee, H. G., “Cycle-accurate Energy Consumption 
Measurement and Analysis: Case Study of Arm7tdmi,” International. Symposium 
on Low-Power Electronics and Design, 2000, pp.185-190 
[19] Chen, C.; Chang, C.; and Ku, C., “A Low Power-Consuming Embedded System 
Design by Reducing Memory Access Frequencies,” IEICE Transactions on 
Information and Systems, Vol.  E88-D,  12,   Dec,  2005, pp.2748-2756 
[20] Chen, R. Y.;  Irwin, M. J.; and Bajwa, R. S., “Architecture-level Power Estimation 
and Design Experiments,” ACM Transactions on Design Automation of Embedded 
Systems, 2001, pp.50-66 
[21] Chen, W. H.; Smith, C. H.; and Fralick, S., “A Fast Computational Algorithm for 
the Discrete Cosine Transform,” IEEE Transactions on Communications, COM-
25(9), September 1977, pp. 1004-1009 
[22] Cheng, L.; Mohapatra, S.; Zarki, M. E.; Dutt, N.; and Venkatasubramanian, N., 
“A backlight optimization scheme for video playback on mobile devices,” IEEE 
Consumer Communications and Networking Conference,  Vol.2, 8-10, Jan, 2006,  
pp.883-887 
[23] Cho, Y. C.; Choi, S., “Nonnegative features of spectro-temporal sounds for 
classification,” Pattern Recognition Letters, Vol. 26(9), 2005, pp.1327-1336 
[24] Choi, K.; Dantu, K.; Cheng, W.; and Pedram, M.,  “Frame-based Dynamic 
Voltage and Frequency Scaling for a MPEG Decoder,”   International Conference 
on Computer Aided Design, 2002, pp.732-737 
[25] Choi, K.; Soma, R.; and Pedram, M., “Off-chip Latency-driven Dynamic Voltage 
and Frequency Scaling for an Mpeg Decoding,” Design Automation Conference, 
2004, pp.544-549 
[26] Chung, E.; Benini, L.; and Micheli, G., “Contents Provider-Assisted Dynamic 
Voltage Scaling for Low Energy Multimedia Applications,” International 
Symposium on Low Power Electronics and Design, 2002, pp.42-47 
 128
[27] Contreras, G.; and Martonosi, M., “Power Prediction for Intel XScale Processors 
Using Performance Monitoring Unit Events”  International Symposium on Low 
Power Electronics and Design, Aug., 2005, pp.221-226  
[28] Crochiere, R. E.; and Rabiner, L. R., “Multirate Digital Signal Processing”, 
Prentice-Hall, 1983 
[29] Daudet, L.; Torrésani, B., “Hybrid Representations for Audiophonic Signal 
Encoding,” Signal processing,  82(11), 2002, pp.1595-1617 
[30] De Smet, P.; Bruyland, I., “Optimized Recursive Subband Synthesis Windowing 
for Implementing Efficient MPEG Audio Decoders,” IEEE Signal Processing 
Letters, Vol.10 (10), 2003, pp.303-306 
[31] De Smet, P.; Rooms, F.; Luong, H.; and Philips, W., “Do Not Zero-Pute: An 
Efficient Homespun MPEG-Audio Layer II Decoding and Optimization Strategy,” 
ACM Multimedia Conference, Oct. 2004, pp. 376 - 379 
[32] Diniz, P.S.; Silva, E.A.; and  Netto, S.L., “Digital Signal Processing : System 
Analysis and Design”, New York : Cambridge University Press, 2001  
[33] Duarte, D. E.; Vijaykrishnan, N.; and Irwin, M., “A Clock Power Model to 
Evaluate Impact of Architectural and Technology Optimizations,” IEEE 
Transactions on VLSI, vol. 10, no. 6, Dec. 2002, pp. 844-855 
[34] Fan, X.; Ellis, C.; and Lebeck, A., “Memory Controller Policies for DRAM Power 
Management,” International Symposium on Low Power Electronics and Design, 
Aug. 2001, pp. 129-134 
[35] Feig, E.; and Winograd, S., “Fast Algorithms for the Discrete Cosine Transform,” 
IEEE Transactions on Signal Processing, 40(9), September 1992, pp. 2174–2193 
[36] Figueras, J., “Modeling Power at Different Levels” in “Low Power Design in 
Deep Submicron Electronics,” Edited by Nebel, W.; and Mermet, J., Kluwer 
Academic Publishers, 1996 
[37] Flautner, K.; and Mudge, T., “Vertigo: Automatic Performance-setting for Linux,”  
the 5th symposium on Operating systems design and implementation, Boston, 
MA, Dec. 2002, USENIX, pp.105-116 
[38] Fogel, D. B., “What Is Evolutionary Computation?”  IEEE Spectrum, Vol. 37, No. 
2, February 2000, pp. 26-32 
[39] Forsyth, N. T.; Chambers, J. A.; Naylor, P. A., “Alternating fixed-point algorithm 
for stereophonic acoustic echo cancellation,” IEE proceedings. Vision, image and 
signal processing, Vol. 149 (1), pp.1-9 
[40] Friedman, E. G., “Clock Distribution Networks in Synchronous Digital Integrated 
Circuits,” Proceedings of IEEE, Vol.89(5), May 2001, pp.665-692 
[41] Gazor, S.; Liu, T., “Adaptive Filtering with Decorrelation for Coloured AR 
Environments,” IEE proceedings. Vision, image and signal processing, Vol. 152 
(6),  2005, pp.806-818 
[42] Ghurumuruhan, G.; and Prabhu, K. M. M., “Fixed-point Fast Hartley Transform 
Error Analysis,” Signal Processing, Vol.84, 2004 , pp.1307-1321  
 129
[43] Gluth, R., “Regular FFT-Related Transform Kernels for DCT/DST-based 
Polyphase Filter Banks,” International Conference on Acoustics, Speech, and 
Signal Processing, Vol.3, 1991, pp.2205-2208  
[44] Gronowski, P. E.;  Bowhill, W. J.; Preston, R. P.; Gowan, M. K.; and Allmon, R. 
L., “High-Performance Microprocessor Design,” IEEE Journal of Solid-State 
Circuits, Vol. 33 (5), May 1998, pp.676-686 
[45] Grunwald, D.; Levis, P.; Farkas, K.; Morrey, C.; and Neufeld, M., “Policies for 
Dynamic Clock  Scheduling,” Symposium on Operating Systems Design and 
Implementation, Oct 2000, pp.6-6 
[46] Gutnik, V.; and Chandrakasan, A. P., “Embedded Power Supply for Low Power 
DSP,” IEEE Transactions on VLSI Systems, Vol.5, No.4, Dec, 1997, pp.425-435 
[47] Haid, J.; Schoegler, W.; and Manninger, M., “Design of an Energy-Aware system-
in-package for playing MP3 in Wearable Computing devices,” IEEE International 
Systems-on-Chip(SOC) conference, Austria, 2003, pp. 35- 38 
[48] Hicks, P.; Walnock, M.; and Owens, R. M., “Analysis of Power Consumption in 
Memory Hierarchies,” International Symposium on Low Power Electronics and 
Design, 1997, pp.239-242 
[49] Hu, Z.; and Wan, H., “A Novel Generic Fast Fourier Transform Pruning 
Technique and Complexity Analysis,” IEEE Transactions on Signal Processing, 
Vol. 53, No. 1, Jan. 2005, pp.274-282 
[50] Huang, W.; Wang ,Y.; and Chakraborty, S., “Power-Aware Bandwidth and 
Stereo-Image Scalable Audio Decoding,”  ACM Multimedia Conference, 
November 06-11, 2005, Hilton, Singapore, pp.291-294 
[51] Huang. W.; and Wang, Y., “Efficient Partial Spectrum Reconstruction using an 
Asymmetric PQMF Algorithm for MPEG-Coded Stereo Audio,” IEEE 
International Conference on Multimedia & Expo, July 9-12, 2006, Toronto, 
Canada, pp. 901 - 904 
[52] Huang, Y.; Chakraborty, S.; and Wang, Y., “Using Offline Bitstream Analysis for 
Power-Aware Video Decoding in Portable Devices,” ACM Multimedia 
Conference, November 06-11, 2005, Hilton, Singapore, pp. 299 - 302 
[53] Hughes, C. J.; Srinivasan, J.; and Adve, S. V., “Saving Energy with Architectural 
and Frequency Adaptations for Multimedia Applications,” 34th Annual 
International Symposium on Microarchitecture (MICRO), 2001, pp. 250- 261 
[54] Im, C.; Ha, S.; and Kim, H., “Dynamic Voltage Scheduling with Buffers for Low-
power Multimedia Applications,” ACM Transactions on Embedded Computing 
Systems, 3(4), 2004, pp. 686–705  
[55] Irani, S.; Shukla, S. K.; and Gupta, R., “Algorithms for Power Savings,” ACM-
SIAM symposium on Discrete algorithms, 2003, pp.37-46 
[56] Ishihara, T.; and Yasuura, H., “Voltage Scheduling Problem for Dynamically 
Variable Voltage Processors,” International Symposium on Low Power 
Electronics and Design, 1999, pp.197-202 
[57] ISO/IEC, “MPEG1 11172-3: Audio Coding ,” 2000 
 130
[58] ISO/IEC, “MPEG2 13818-7: Advanced Audio Coding,” 2006 
[59] James, D. V., “Quantization Errors in the Fast Fourier Transform,” IEEE 
Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-23, No.3, 
June 1975, pp.277-283 
[60] Jochens, G.; Kruse, L.; Schmidt, E.; and Nebel, W.,  “A New Parameterizable 
Power Macro-model for Datapath Components,” Design, Automation and Test in 
Europe,  1999, pp.29-36 
[61] Kadayif, I.; Kandemir, M.; Chen, G.; Vijaykriishnan, N.; Irwin, M.J.; and 
Sivasubramaniam, M. J., “Compiler-Directed High-Level Energy Estimation and 
Optimization,” ACM Transactions on Embedded Computing Systems, Vol. 4, Nov, 
2005, pp.819-850  
[62] Keltcher, P.; Richardson, S.; and Siu, S., “An Equal Area Comparison of 
Embedded DRAM and SRAM Memory Architectures for a Chip Multiprocessor,” 
Technical Report HPL-2000-53, Computer Systems Technology HP Laboratories 
Palo Alto, Apr 2000 
[63] Keutzer, K.; Malik, S.; Newton, A. R.; Rabaey, J. M.; and Sangiovanni, V., 
“System Level Design: Orthogonalization of Concerns and Platform-based 
Design,” IEEE Transactions on Computer-Aided Design of Integrated Circuits 
and Systems, 19(12), December 2000, pp. 1523 - 1543 
[64] Lee, H. G.; and Chang, N., “Energy-aware Memory Allocation in Heterogeneous 
Non-volatile Memory Systems,” International Symposium on Low Power 
Electronics and Design, Aug. 2003, pp. 420-423 
[65] Lee, S.; Ermedahl, A.; and Min, S.L., “An Accurate Instruction-level Energy 
Consumption Model for Embedded RISC Processors,” ACM SIGPLAN workshop 
on Languages, compilers and tools for embedded systems, 2001, pp.1-10 
[66] Liang, J.; and Tran, T. D. ,“Fast Multiplierless Approximations of the DCT with 
the lifting scheme,” IEEE Transactions on Signal Processing, Vol. 49, No. 12, 
Dec, 2001, pp.3032-3044  
[67] Lim, J.; Kim, M.; Kim, J.; and Kim, K., “Semantic Transcoding of Video based 
on Regions of Interest,”  Visual Communications and Image Processing, 2003 
pp.1232-1243 
[68] Liu, C. M.; Lee, W. C.; and Chien, C. T., “Bit Allocation for Advanced  Audio 
Coding using Bandwidth-Proportional Noise-Shaping Criterion,” Proc. of the 6th 
International Conference on Digital Audio Effects (DAFX-03), London, UK, 
September 8-11, 2003  
[69] Liu, J.; Chou, P. H.; Bagherzadeh, N.; and Kurdahi, F., “Power-Aware Scheduling 
under Timing Constraints for Mission-Critical Embedded Systems,”  Design 
Automation Conference, 2001, pp.840-845 
[70] Liu, J.; Shih, W.; Lin, K.; Bettati, R.; and Chung, J., “Imprecise Computations,” 
proceedings of the IEEE, Vol. 82(1), Jan, 1994, pp.83-94 
[71] Liu, Y.; Maxiaguine, A.; Chakraborty, S.; and  Ooi, W. T.,  “Processor Frequency 
Selection for SoC Platforms for Multimedia Applications,”  IEEE Real-Time 
Systems Symposium, Lisbon,  December 2004, pp. 336-345 
 131
[72] Lorch, J. R.; and Smith, A. J., “PACE: A New Approach to Dynamic Voltage 
Scaling,” IEEE Transactions on Computers, 53 (7), July, 2004, pp.856-869 
[73] Lu,Y. H.; Benini, L.; and De Micheli, G., “Dynamic Frequency Scaling with 
Buffer Insertion for Mixed Workloads,” IEEE Transactions on Computer-Aided 
Design of Integrated Circuits and Systems, 21(11), Nov. 2002 , pp.1284-1305 
[74] Lu, Z.; Lach, J.; Stan, M.; and Skadron, K., “Reducing Multimedia Decode Power 
using Feedback Control,” International Conference on Computer Design, San 
Jose, CA, Oct. 2003, pp. 489- 496 
[75] Lu, Z.; Lach, J.; Stan, M.; and Skadron, K., “Design and Implementation of an 
Energy Efficient Multimedia Playback System,” Asilomar Conference on Signals, 
Systems and Computers, 2006, pp.1491-1497 
[76] Luo, J.; and Jha, N. K., “Power-profile Driven Variable Voltage Scaling for 
Heterogeneous Distributed Real-time Embedded Systems,” International 
Conference on VLSI Design, 2003, pp.369–375 
[77] Markel, J. D., “FFT Pruning,” IEEE Transactions on Audio and Electroacoustics, 
Vol. AU-19, Dec.1971, pp305-311 
[78] Maxiaguine, A.; Zhu, Y.; Chakraborty, S.; and Wong, W.-F., “Tuning Soc 
Platforms for Multimedia Processing: Identifying Limits and Tradeoffs,” 
International conference on Hardware/software codesign and system synthesis, 
2004, pp.128-133 
[79] McMillan, L.; and Westover, L., “A Forward-Mapping Realization of the Inverse 
Discrete Cosine Transform,” IEEE Data Compression Conference,  March 24-27, 
1992,  pp. 219-228 
[80] Mesarina, M.; and Turner, Y., “Reduced Energy Decoding of Mpeg Streams,” 
Multimedia Systems, 9(2),2003, pp.202–213 
[81]  Mock, T., “Music Everywhere,” IEEE Spectrum, Sep 2004, pp.42-47 
[82] Mohapatra, S.; Cornea, R.; Dutt, N.; Nicolau, A.; and Venkatasubramanian, N., 
“Integrated Power Management for Video Streaming to Mobile Handheld 
Devices,” ACM Multimedia Conference, Nov, 2003, pp.582-591 
[83] Montanaro, J.; etc, “A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor,”  
IEEE Journal of Solid-State Circuits, Vol.31 (11), Nov., 1996, pp.1703-1714 
[84] Mudge, T., “Power: A First-Class Architectural Design Constraint,” IEEE 
Computer, April 2001, pp.52-58 
[85] Nawab, S. H.; Oppenheim, A. V.; Chandrakasan, A. P.; Winograd, J. M.; Ludwig, 
J. T., “Approximate Signal Processing,” Journal of VLSI Signal Processing 
Systems, Vol. 15 (1-2), Jan 1997, pp.177-200  
[86] Nguyen, T. Q., “Partial Spectrum Reconstruction using Digital Filter Banks,” 
IEEE Transactions on Signal Processing, 41(9), 1993, pp.2778-2795 
[87] Nielsen, L. S.; Niessen, C.; Sparso, J.; and Van Berkel, K., “Low-power Operation 
using Self Timed Circuits and Adaptive Scaling of the Supply Voltage,” IEEE 
Transactions on VLSI Systems, 2, 4 Dec.,1994, pp.391-397 
 132
[88] Oppenheim, A.; and Weinstein, C. J., “Effects of Finite Register Length in Digital 
Filtering and the Fast Fourier Transform,” Proceedings of the IEEE, Vol. 60, No. 
8, Aug. 1972, pp.957-976 
[89] Oppenheim, A.; Nawab, H.; Verghese, G.; and Womell, G., “Algorithms for 
Signal Processing,” 1st Rapid Prototyping of Application Specific Signal 
Processors (RASSP) Conference, 1994  pp.146-153 
[90] Pedram, M., “Design Technologies for Low Power VLSI,” Encyclopedia of 
Computer Science and Technology, 1995, pp.73-96 
[91] Pering, T.; Burd, T.; and Brodersen, R., “The Simulation and Evaluation of 
Dynamic Voltage Scaling Algorithms,” International Symposium on Low Power 
Electronics and Design, 1998, pp.76-81  
[92] Ponomarev, D.; Kucuk, G.; and Ghose, K., “Dynamic Allocation Of Datapath 
Resources For Low Power,” Workshop on Complexity-Effective Design, Jul, 2001, 
pp.90-102 
[93] Pouwelse, J.; Langendoen, K.; Lagendijk, I.; and Sips, H., “Power Aware Video 
Decoding,” the 22nd Picture Coding Symposium, Seoul, Korea, 2001, pp.303-306 
[94] Qu, G.; and Potkonjak, M., “Energy Minimization with Guaranteed Quality of 
Service,” International Symposium on Low Power Electronics and Design, 2000, 
pp.43-48 
[95] Raghunathan, V.; Pering, T.; Want, R.; Nguyen, A.; and Jensen, P., “Experience 
with A Low Power Wireless Mobile Computing Platform,” The International 
Symposium on Low Power Electronics and Design, Aug, 2004, pp.363-368 
[96] Rao, R.; and Vrudhula, S., “Energy Optimal Speed Control of Devices with 
Discrete Speed Sets,” Design Automation Conference, 2005, pp.901-904 
[97] Roberts, A. W.; and Varberg, D. E., “Convex functions”, Academic Press, 1973 
[98] Roy, K., “Software Design for Low Power,” in “Low Power CMOS VLSI Circuit 
Design,” John Wiley & Sons, Inc, 2000 
[99] Roy, K.; and Prasad, S., “Low-power Static  Ram Architectures,” in “Low Power 
CMOS VLSI Circuit Design,” John Wiley & Sons, Inc, 2000  
[100] Simunic, T.; Benini, L.; De Micheli, G., and Hans, M., “Source Code 
Optimization and Profiling of Energy Consumption in Embedded Systems,”  
International Symposium on System Synthesis, 2000, pp.193-199 
[101] Sinevriotis, G.; Leventis, A.; Anastasiadou, D.; Stavroulopoulos, C.; 
Papadopoulos, T.; Antonakopoulos, T.; and Stouraitis, T., “SOFLOPO: Towards 
systematic software exploitation for low-power designs,” International 
Symposium on Low-Power Electronics and Design, 2000. 
[102] Shin, D.; Kim, J.; and Lee, S., “Intra-Task Voltage Scheduling for Low-Energy 
Hard Real-Time Applications,” IEEE Design & Test of Computers, Vol. 18, No.2, 
2001 pp.20-30  
[103] Steinke, S.; Knauer, M.; Wehmeyer, L.; and Marwedel, P., “An Accurate and Fine 
Grain Instruction-level Energy Model supporting Software Optimizations,” 
 133
International Workshop on Power And Timing Modeling, Optimization and 
Simulation, 2001. 
[104] Vaidyanathan, P.P., “Multirate Systems and Filter Banks”,  Prentice-Hall, 1992 
[105] Wang, Y.; Huang, W.; and Korhonen, J., “A Framework for Robust and Scalable 
Audio Streaming,” ACM Multimedia Conference, 2004, pp.144-151 
[106] Wang , Z., “Pruning the Fast Discrete Cosine Transform,” IEEE Transactions on 
Communications, Vol.39, No.5, May 1991, pp.640-643 
[107] Weiser, M.;  Welch, B.; Demers, A.; and Shenker, S.,  “Scheduling for Reduced 
CPU Energy,” Operating Systems Design and Implementation, 1994, pp.13-23 
[108] Yao, F.; Demers, A.; and Shenker, S., “A Scheduling Model for Reduced CPU 
Energy,” IEEE Annual Foundations of Computer Science, 1995, pp.374-382 
[109] Yuan, W.; and Nahrstedt, K., “Energy-efficient Soft Real-time CPU Scheduling 
for Mobile Multimedia Systems,” ACM Symposium on Operating Systems 
Principles, 2003, pp.149-163 
[110] Yuan, W.; and Nahrstedt, K., “Practical Voltage Scaling for Mobile Multimedia 
Devices,” ACM Multimedia Conference, Oct. 2004, pp. 924 - 931 
[111] Zheng, F.;  Garg, N.;  Sobti, S.;  Zhang, C.;  Joseph R.;  Krishnamurty A.; and 
Wang, R., “Considering the Energy Consumption of Mobile Storage 
Alternatives,” IEEE/ACM International Symposium on Modeling, Analysis and 
Simulation of Computer and Telecommunication Systems, Oct 2003, pp.36-45 
[112] http://www.audiocoding.com 
[113] http://tcpmp.corecodec.org/ 
[114] SimpleScalar/ARM: http://www.simplescalar.com/v4test.html 
[115] MAD (libmad):  http://sourceforge.net/projects/mad/ 
[116] Libavcodec: http://www.afterdawn.com/glossary/terms/libavcodec.cfm 
 
