Multimedia support on hand-held devices is growing rapidly.
INTRODUCTION
Advancement of digital media has brought about many changes in our society. The digital information content can be stored, processed and transmitted in many digital file formats using digital systems. Movies can be watched from our mobile devices, surf the Internet and download content with different formats. Recently, there has been a rapidly growing demand for 3D feature films, especially in cinemas and for home entertainment units such as 3D HDTV.
Information formats for audio, text, images, and video can be integrated into a single file or a "stream format". This stream format enables the user to tailor it to a specific target such as compression, analysis, and decompression. This is made possible with standards from organizations such as ISO and MPEG.
Multiview and 3D videos have high image content that needs to be intensely processed during the compression or decompression of video frames. Processing and transmitting high video content requires extensive processing power and resources. Mobile devices have both memory and battery limitations. Some processors have low power consumption, making them suitable for running complex mobile applications or streaming video [1] . Nevertheless, powerful processers are necessary for real-time decoding in mobile environments as well as good-quality displays.
The new video standards have similar coding algorithms and processes for achieving higher compression rates and processing from hardware compared to previous standards. Considering the need for such processing by systems, the support for real-time video decoding is lacking especially for handheld devices.
The aim of this paper is to incorporate SIMD (Single Instruction, Multiple Data) technology into the existing architecture of processors to optimize the multiview coding decoder. This work mainly focuses on the decoding of MBs (IQ -Inverse Quantization) based on the chroma prediction method, including the quarter-pel filter (on MVC) using SIMD. Section 2 briefly discusses SIMD technology used on coprocessors, Section 3 describes the testing platform constraints, Section 4 elaborates MB decoding, and results of the profiled video are presented on the last section.
SIMD
Image processing applications consume a lot of memory and the computation requires enormous power. However, most mobile devices are limited in terms of battery duration, processing speed, and memory. ASIC can be seen as a superior option in terms of hardware acceleration and optimization of video programs [2] , but it has drawbacks. ASIC is program-specific, and any modification to the program requires the integrated chip to be changed. Additionally, because it is necessary to change the resolution to fit most handheld devices, ASIC designs are rather costly and inefficient.
Video codecs operate on large amounts of data. Generally, 8 bits of data are used. When using a 32-bit microprocessor, during computation, some units are not utilized and these consume power. General processors take three clock cycles to execute a single instruction set (fetching, decoding, and execution) with an exception of processors that makes use of the MIPS (Million Instructions Per Second).
SIMD-architecture-based processors are energy efficient for applications that support easy data parallelization during computation [3] . The design shown in Figure 1 shows the architecture of an ARM A8 coprocessor that supports an SIMD instruction set. The registers are treated as vectors for different data types ranging from 8  8-bit-wide vectors to 4  16-bit-wide vectors and 32  4-bit-wide vectors. Arithmetic operations such as addition, subtraction, bit shifting, multiplication, and division instructions [4] are supported by the coprocessor. SIMD instructions are SIMD accelerates detection algorithms [5] by scaling down the overall performance of the algorithm. This is achieved by computing matrices in parallel. Using SIMD calls within interpreted languages provides the developer with the freedom to manipulate the internal data without much hardware monitoring even though some compilers allow automated SIMD vectorization.
Most speech recognition systems rely on statistical models such as HMMs (Hidden Markov Models) to "perform acoustic modeling of speech recognition" [6] . To enhance the computation of intensive likelihood-based statistics, SIMD technology is used, and approximately 27% of the total duration is compensated on the LVCSR (Large Vocabulary Continuous Speech Recognition) systems. This is a fair advantage for systems that implement real-time speech recognition, because the power consumption is not exhausted.
TESTING PLATFORM
The tests discussed in the results section presented further were performed using a Cortex-A8 processor with a 64-or 128-bit configurable bus architecture and a dedicated pipeline to execute SIMD instruction sets. The L1 cache is configurable between 16 KB or 32 KB, and the L2 cache can be configured between 0 KB and 1 MB. The L2 is efficient because it eliminates latency between direct memory access and the ALU (Arithmetic Logic Unit) during the data-fetch phase. The processor also contains a VFP (Vector Floating Point) coprocessor that supports single-precision add, multiply, divide, and square root operations. (More details are provided in the reference manual [7] .) The SIMD technology provided on this processor architecture has 16  128-bit quad-word registers and 32  64-bit double-word registers. It supports 8-, 16-, 32, and 64-bit signed and unsigned integer data types. The architecture includes 32-bit single-precision floating-point elements. Our target system operates at 800 MHz, 512 MB, running a Linux OS release 2.6.32.9. The size of the compiled decoder is 6,546 KB.
The source code for the decoder we used is JMVC software written in C++ and compiled using the ARM tool chain to run on our target system. The software is written for high computing machines. To run the decoder on the target board, a few adjustments had to be made to the source code. The software was optimized by the compiler disregarding SIMD optimization to increase decoding speed and the efficiency of the decoder. The next section discusses the video sequences used and profiles results of the sequences.
PROPOSED WORK
Real-time encoding/decoding is of paramount importance because the temporal and spatial resolutions of the views increase. Most researchers working on optimization of the codecs based on H.264 have shown that the motioncompensation block consumes more than half the total required time for total decoding [8] [9] [10] . Hence, research being conducted is based on finding techniques and algorithms to reduce data computation regarding the MC (motion compensation) on the H.264/AVC standard [9] [11], including MP (Motion Prediction) and ME (Motion Estimation).
When optimizing software, profiling needs to be performed. Because MVC is backward compatible with H.264/MPEG-4 AVC, decoding singleview videos is carried out first during profiling. Figure 2 depicts the payloads of modules used by the decoder and their total contribution in MCPS (Million of Cycles Per Second). The testing sequences are in singleview form. The MC module (Motion Compensation) consumes approximately 34-36%; the DB (Deblocking) filter consumes an average of 9-16%; and Entropy (uVLCUniversal Variable Length Coding) consume 6-8% the fewer contributing modules are inverse transform and Reconstruction of frames. MC profiling was conducted on the granule level and Table 1 shows the innermost modules that are also costly in terms of processing. 
MOTION COMPENSATION (MVC)
Two or more cameras capturing the same scene would result in a large amount of redundant data from common frames from adjacent cameras or frames within the same view. The MVC scheme exploits such temporal redundancies between successive images in each video, whereas the inter-view exploits the similarities between adjacent camera views [12] . Although significant improvements have been made, more research needs to be conducted because the goal of MVC is to support 3D video applications such as teleconferencing and enable viewers with free-viewpoint videos. The aim of MVC was to enable viewers to use the free-viewpoint option in real time [13] , where the viewpoint and view direction can be changed interactively.
Recent solutions such as fast search algorithms within frames were introduced [14] , including the fast mode search during the motion estimation execution to improve the prediction precision and the efficiency of coding base on multi-block and multi-reference frames [15] by provided great enhancement to the standard. To calculate vector prediction, the MBs of the inter-modes or intra-modes are assigned weights according to the cost incurred during motion estimation; the greater the cost, the more accurate the block is considered to be, and the more likely it is for it to be usable for estimation of the next sequence frame.
The decoder can be configured as a main or baseline profile for achieving a coding gain at the cost of computational power. The encoder configuration during the encoding of the sequences was set to three reference frames. Because MC is performed on inter-block coding, MBs are divided into squares with a minimum size of 4  4 pixels. To obtain the random access point of the views when decoded, the pictures contain intra-coded frames that are not encoded with prediction from other frames. Regardless of the type of the slice (I, P, or B), the MBs in the slice have to be decoded with data on each header of the slice. This paper focuses on computing data in parallel on the computation of chroma and luma prediction during the reconstruction of the MB [16] IQ (Inverse Quantization). The two components are calculated separately because the chroma and luma predictions are independent of each other in intra-frame prediction. Each component is based on the mode selected, and the component can be decoded for horizontal and vertical prediction.
The rescaling of MBs with data from the slice header during the reconstruction of the frame is redundant on all MBs. A typical 4  4 MB reconstruction would loop four times. However, using the SIMD given the data type, be it 16 or 32 bits, the computation can be executed without looping. The same applies for the 8  8 or 16  16 MB scaling, which can be achieved efficiently using a NEON instruction set. IQ 
Chroma and Luma Prediction
Quarter-pel filter reconstructs the MB using the already mapped coefficients during the encoding period.
[
, ] '[ , ]
CL Pred x y Rec x y     Figure 4 shows part of our implementation of the SIMD instruction set (VMUL, VPADD, and VSLHQ). The execution time was greatly reduced on the chroma prediction module because the sub modules have 8-and 16-bit-long data. This implies that we can multiply a vector of unsigned 16  4 bits by a 16-bit scalar, and the result would be stored in a 32  4-bit-long vector.
RESULTS
We used the JMVC 8.5 (open-source code) decoder, optimized for C++ on the target system mentioned in Section 3. An ARM tool chain supplied along with d-stream as a cross compiling tool for the decoder was used. To enable optimization for using the SIMD instruction set, the ARM tool chain provides the necessary libraries needed for using intrinsic functions. Figure 6 displays the profiled video sequences using only three views; Figure 7 displays the profiled video sequences using eight different views. Figure 2 shows the profiling results for single view videos. Table 3 summarizes the overall performance achieved using SIMD operations. The chroma prediction and Quarter-pel filter summed up on MC submodule was reduces nearly half of the initial overall payload when optimization was implemented. Vectorization is necessary in order to perform much optimization using SIMD instruction set. In future research, we will focus on the power consumption when the MVC decoder is running and the differences when the optimized decoder is executed as well.
It is notable that more complex algorithms that can be easily broken into vectors can be processed by processors with less computational power, as they can be processed without much hardware constraints. This is because the coprocessor uses the same hardware features as the main processor and has precedence over the L2 cache memory hence does not depend on the main processor. Macro block control and Buffer load modules are pretty expensive, but these modules are directly linked to memory access for acquiring buffer during decoding time, initializing the bitstream, controlling the reference frames, managing the POC order and managing the number of slices in a group. The systems memory delay time, affects this module directly. These modules payloads can be noted on figure 2,6 and 7.
CONCLUSION
Simultaneous parallel execution of data is an advantage to image processing or video compression/decompression using MVC because a large amount of data can be processed without delay or much processing power.
Fugures 6 and 7, depicts the IQ module consuming about 12 -15 % of the total payload on the decoder. This was reduced by 30-35 %. A conclusion can be made from these results, that co-processors are efficient and offer acceleration to multimedia applications running portable devices. The implementation is made on the source code and suitable features required on the processor should be enabled during run time. Figures 8 and 9 show the profiled results of the decoder on the target system in terms of memory, system clock cycles, and scheduler. The system does not require or use additional resources compared to when only the main processor is running. Thus, the amount of memory and the frequency remain the same throughout the duration of decoding. 
