Abstract-Parallel structures may be used to increase a system processing speed in case of large amount of data or highly complex calculations. Dynamic Voltage and Frequency Scaling (DVFS) may be used for simpler calculations in order to decrease the system voltage or frequency and achieve lower power consumption. Combining these two mechanisms may lead to higher efficiency and lower power consumption. In this paper, we introduce a parallel decoding process with Digital Signal Processing (DSP) for power efficiency in a heterogeneous multicore embedded system. We describe a parallel low-power design on the system level. Under the condition of preserving the original decoding process, we manage the size of the system's multimedia buffer by considering the spontaneous streaming transfer and tuning the decoding process scheduling time by using the DVFS system in order to decrease the multimedia data dependency and achieve a multi-core embedded system with accurate and low-power detection mechanism.
INTRODUCTION
Many current products employ embedded systems. The improved quality of commercial products and demand for multimedia applications require increasing number of data operations. Due to the demand for higher system frequency, the newly proposed hardware embedded systems begun using multi-core designs. These new architectures pose many challenges to developers:
1. Many embedded multimedia applications exhibit dependency problems during decoding processes that refer to the previous segment to perform decoding. Developers of multi-core systems need to consider how to effectively distribute data to different cores for processing and how to avoid dependency problems.
2. Compared to single core platforms, multi-core systems need a power managing mechanism to prevent excess power usage, especially in case of embedded systems such as handheld and battery devices. Dynamic Voltage and Frequency Scaling (DVFS) is a viable solution: it dynamically adjusts the system voltage or frequency during low calculation applications and it is effective for decreasing power consumption. The design challenge is to predict the system voltage or frequency with a running application process in order to achieve low power.
3. An important and realistic problem for a system developer is the overhead time to be invested to change the single core system platform to a fully working system. This remains a major issue to for developers and manufacturers.
In this paper, we introduce a front-wave parallel power management stream decoding system. We consider the entire system structure. While preserving the original decoding process with a single core, the parallel decoding implementation is achieved by using a simple yet effective concept: using the buffer management mechanism under the acceptable limits for the end-user to remove the time slack and data loss with estimation error. We then combine parallel processing and buffer management for adjusting both the system voltage and frequency according to parameters received from the two mechanisms.
In Section II, we introduce the DVFS system and review related proposals in the area of parallel structures and single core decoding procedures. In Section III, we introduce the front-end parallel DVFS system and describe system structure and module design. The implementation of the experimental platform and power efficiency prediction is given in Section IV. We conclude with Section V.
II. RELATED WORK AND BACKGROUND
We classify related proposals in two types: parallel decoding and the DVFS system.
A. Parallel Decoding
There are many designs that employ parallel decoding. If a complete decoding frame is used as a separation point, the proposed designs may be grouped into two main types: frontwave parallel processing and internal parallel processing.
Front-Wave Parallel Processing
In front-wave parallel processing, parallel distribution with the decoding frame data is performed first. The decoder is then used for decoding calculations, where one video segment is split by Group of Pictures (GoP) [1] , [2] and each GoP is distributed to a processor for decoding. Flierl et al., [3] proposed a B frame parallel decoding method. The main concept is that B frames are not referenced by other frames and, hence, may be distributed to different processors for decoding. However, this method is not applicable to H.264 since B frames may be referenced by other frames in H.264 decoding.
Internal Parallel Processing
The front-wave processing structure first completes splitting the data before delivering data to the system for processing. In contrast, the internal parallel processing delivers the frame to the system and allows the system to perform the splitting process. This splitting has the advantage because letting the system to do the process scheduling and planning may produce better parallel decoding efficiency [4] , [5] . However, the drawback is usually that the entire decoding structure needs to be changed. There are H.264 decoding proposals that perform splitting according to slices [6] , because slices are the smallest independent decoding units. Using slices to separate the decoding frames can produce good parallel decoding efficiency. Van der Tol et al., [7] achieve parallel structure by taking each decoding procedure and separating it into various tasks and by assigning different tasks to different decoders.
B. Dynamic Voltage and Frequency Scaling (DVFS)
For many commercial electronic systems, a good power manager is a necessity, especially for handheld system or battery-based devices. Various dynamic power management systems have been proposed [8] , [11] . These management systems dynamically adjust the system voltage or frequency to complete process with the smallest power consumption before the deadline. We calculate the power consumption of the processor using the CMOS manufacturing technology as:
where C is the effective switched capacitance, V is the operating voltage, and is the operating frequency. T P defines the time period of completing process, is calculated as:
According to the energy equation:
Hence, we can reduce the power consumption by adjusting the system voltage or frequency to achieve T P without a time slack and to ensure that the system does not lag while playing the multimedia data before the deadline.
III. PROPOSED PARALLEL DECODER STREAMING PROCESS
In this Section, we provide description of the parallel DVFS decoding system and its design. We also introduce a model for the parallel structure and the DVFS mechanism for stream processing.
A. System Overview
The diagram of the proposed system is shown in Fig. 1 . As the date stream enters the heterogeneous multi-core platform, the Micro Processing Unit (MPU) takes the data stream and performs parallel scheduling. The DVFS system decoding prediction is performed according to the video dependency and video format. In heterogeneous multi-core platforms, the decoding process is added to the video and audio parts of the digital signal processing (DSP) system. This proposal focuses on a single MPU with a multiple DSP core structure platform and addresses parallel decoding. We use MPU to manage system parallel planning with the DVFS prediction and settings. Using the front-wave process design, the DSP decoding need not be changed to implement the DVFS system process. Fig. 1 The architecture of the proposed system.
B. Parallel DVFS on Stream Decoding
In past proposals, the main approach to decrease energy consumption in multimedia decoding system included: 1. reducing time slack and 2. correctly predicting the system voltage or frequency to process the next frame. To achieve these two goals, we use a simple yet applicable concept. To build the entire DVFS structure, we utilize two buffers: frontend and back-end. Using the front-end buffer achieves parallel mechanism and also easily erases time slack. We define as a deadline the set time of decoding for each frame. We do not change the scheduling deadline and use an effective method to predict the system voltage or frequency. We then adjust the system voltage and frequency according to the predicted values and the priorities of decoding tasks.
This proposal combines the offline method and an online mechanism to decrease prediction error rate. Before the decoding begins, the system employs the DVFS model, encoded frame format and the previous frame size, and decoded time to determine the initial system voltage and frequency. The DSP-end will dynamically adjust the system voltage or frequency according to the time spent for executing each function and the relevant information from the decoding process.
C. DVFS Algorithm for Independent Frames
The system voltage should be chosen to achieve a lowpower system. In embedded systems, the DVFS hardware module usually provides several set voltages and respective frequencies to allow the developer to use software tools to control the system voltage or frequency. Let υ V , … , V be the adjustable voltage provided by the hardware and In an ideal situation, the predicted work applied deadline. If the system frequenc T =T , time slack occurs. We propose her prediction that differs from the worst-case p is based on finding the closest T freque inequality:
|T T | |T T |&&|T
This prevents estimation errors and the ne may be estimated as:
We then use T to estimate the syst frequency of the next time segment. 
D. DVFS Algorithm for Dependent Frames
In this Section, we discuss the dependency issue in order to set a suitable sy frequency to reduce system delay. Since decoding process that involves changing and in one frame has no data dependency issues the entire system may begin decoding when of the reference frame completes decoding reference part is entirely decoded avoiding d we only need to ensure that the column above 3, as shown in:
where ED is the time spent for encryption the reference time needed for encryption d is the time needed to decode the first thre based on (2) the frame defined by ,
is the respective to the worst-case (5) kload satisfies the cy cannot satisfy re a time-oriented prediction rule. It ency given by the T | . (6) ext time segment (7) tem voltage and nt frames.
immediate data ystem voltage and e the encryption d re-ordering data with other frames, n the third column to ensure that the data miss. Hence, number remains (8) n decoding, T ED is ecoding, and T C e columns of the frame. We use (8) and the syst determine T ED and T C :
where C is the frame column nu the system voltage and frequency frame in the encryption decoding. In can correct the data dependency iss decoding schemes, various frame f have different speeds.
IV. IMPLEMENTATION

A. Implemention of the DVFS Algor Multicore Platform
The proposed system employs a the Parallel Architecture Core (PAC Industrial Technology Research Ins implements the proposed power effi on the Android OpenCORE. Th operating procedure are shown in Fi from the upper application layer And calls the OpenCORE multimedia fra playback. The OpenCORE is respon DSP for processing. It employs the load, which is based on the previous time to perform prediction of the ap then uses I/O controller to transfer Management Driver and performs D control for the DVFS controller Finally, the OpenCORE activates decoding. All bit streams are International Format (CIF) resolu frames. Fig. 3 The Android system struc tem average priority to
umber that can determine needed by the decoding n ideal circumstances, this sue. However, in realistic format decoding schemes AND ANALYSIS rithm on a Heterogeneous as the hardware platform C) Duo developed by the stitute (ITRI), Taiwan. It ficiency perceptive system he system structure and ig. 3. The system operates droid Package (APK) that amework to perform video nsible for coordinating the DVFS predictor decoder s frame size, and decoded ppropriate DVFS level. It r data to the DSP Power DSP voltage and frequency to achieve coordination. DSP to perform video 30 fps, with Common tion for a total of 300 ture and procedure.
B. Bitrate Effects on Energy Consumption
Different bit rates affect the decoding: the higher the bit rate the larger the frame size. Hence, the system needs a higher frequency to complete decoding. We tested 200 kbps, 400 kbps, and 600 kbps bit rates and measured their energy consumption compared to the baseline without proposed mechanism, as shown in Fig. 4 . As predicted, the power consumption increases with the bit rate. However, the designed prediction module for the three bit rates still has between 36.2% and 41.9% smaller energy consumption.
C. Deadline Miss Analysis
A deadline miss occurs when a frame does not complete decoding before the deadline limit. It may be seen as the marker for tuning the DVFS algorithm. After using the proposed prediction module, there are different levels of a deadline miss. When the module predicts a high DSP load, less energy will be saved even though the deadline miss ratio is small. For example, in the case of a news bit stream, the energy consumption is the highest even though the deadline ratio is the smallest. However, if the deadline miss error rates are added, as shown in Fig. 5 , prediction modules other than news have the bit stream error within 5%. 
V. CONCLUSIONS
In this paper, we introduced a parallel decoder streaming process for power efficiency perception in a multi-core embedded system by combining multi-core scheduling and a DVFS mechanism to provide a highly efficient and energy multi-media decoding mechanism. The DVFS decreases the system power usage through scheduling and correcting calculations and resolves the multimedia data dependency issues. This mechanism was implemented on the Android system. We also analyze the effectiveness of the developed platform. The experimental results show the decrease of 36.2% to 41.9% in power usage. The proposed framework provides a new approach for integrating power efficient oriented mechanisms by tuning the system voltage or frequency in multicore embedded systems. 
