Abstract The latest generation of multicore digital signal processors (DSP), their high computing power, low consumption, and integrated peripherals will allow them to be embedded in the next generation of smart camera. Such DSPs allow designers to evolve the vision landscape and simplify the developer's tasks to run more complex image and video processing applications without the need to burden a separate personal computer. This paper explains the exploitation of the computing power of a multicore DSP TMS320C6472 to implement a real-time H264/AVC video encoder. This work can be considered as a milestone for the implementation of the new High Efficiency Video Coding standard (HEVC-H265). In fact, to improve the encoding speed, enhanced Frame Level Parallelism (FLP) approach is presented and implemented. A real-time fully functional video demo is given, taken into account video capture and bitstream storage. Experimental results show how we efficiently exploit the potentials and the features of the multicore platform without inducing video quality degradation in terms of PSNR or bitrate increase. The enhanced FLP using five DSP cores achieves a speedup factor of more than four times in average compared to a mono-core implementation for Common Intermediate Format (CIF 352 9 288), Standard Definition (SD 720 9 480), and High Definition (HD 1280 9 720) resolutions. This optimized implementation allows us to meet the real-time compliant by reaching an encoding speed of 99 f/s (frame/second) and 30 f/s for CIF and SD resolutions respectively, and saves up to 77 % of encoding time for HD resolution.
Introduction
Nowadays, smart cameras or machine vision solutions [1], [2] need to run complex image and video processing applications on growing amounts of data while meeting hard real-time constraints. New technologies of programmable processors such multicore digital signal processors (DSPs), embedded heterogeneous systems (ARM-DSP [3] , DSP-FPGA, ARM-FPGA), offer a very promising solution for these applications that require high computing performances. They are characterized by a high processing frequency with low power consumption compared to general purpose processor (GPP) or graphic processor unit (GPU). Several manufactures [4] such as Freescale [5] and Texas instruments (TI) [6] solve the challenges of smart cameras with their high performance multicore DSP processors. Exploiting these embedded technologies, smart cameras are changing the vision landscape and pushing developers to run several applications without the need to use any connected PC. In the area of video applications, compression represents an interesting task among the main applications of smart camera or machine vision in addition to other tasks such as object detection, tracking, recognition…etc. The commercialized encoding IPs allow realtime performance accompanied with a lack in flexibility. In fact, they cannot be upgraded to follow the latest protocol enhancements and the latest advances in video compression. Even though, a new video coding standard has appeared on the market, namely H265/HEVC, several smart cameras still integrate old video coding standard like motion JPEG or MPEG4.
Digital signal processors offer software flexibility that is important to allow upgradability. They allow us to build highly flexible and scalable cameras that can follow the latest advances in video compression. Encoder parameters can also be finely tuned depending on the application's requirements. They are also characterized by relatively low software development cost and time-to-market reduction compared to ASIC development or FPGA implementations that require a tremendous VHDL expertise which may not convene time-to-market constraint.
In this context, the TI's high performance multicore DSP processor TMS320C6472 is used to achieve a realtime implementation for the H264/AVC [7] video encoder. This work will be our start point for the new video standard HEVC [8] . Effectively; since HEVC encoder adopts the majority of H264/AVC features (GOPs, frames, and slices structures) our proposed approach will also benefit future H265/HEVC implementation.
H264/AVC encoder is characterized by high coding efficiency compared to previous standards. However, this efficiency is accompanied by a high computational complexity that requires a high performance processing capability to satisfy real-time constraint (25-30 f/s) for iterative coding. When moving to high resolutions, encoding time is drastically increased. Frequency limitation of embedded mono-core processor makes it hard to achieve real-time encoding especially for HD resolutions. Using parallel and multicore architectures will be crucial to reduce the processing time of the H264/AVC encoder.
Several works have been published exploiting the potential parallelism of the H264/AVC standard by applying a functional partitioning algorithms, data partitioning algorithms or both. Multi-processor, multicore, multithreading encoding system and parallel algorithms have been discussed in many papers [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] . This paper presents the frame level parallelism (FLP) approach and describes its complete implementation for H.264/AVC encoder using a multicore DSP TMS320C6472. The main purpose of our work is designing a smart camera video encoder based on H264/AVC standard implemented on sixcore DSP running each at 700 MHz. This encoder should be able to process SD video sequences (720 9 480) at realtime performance 25 f/s. The maximum quality distortion should not exceed 1.5 dB in terms of PSNR. The maximum power consumption is 10 W. This smart camera should support a temperature around 50°C and 220 V AC/12 V DC @ 50/60 Hz as power source.
The remainder of this paper is organized as follows: next section provides an overview of data dependencies and parallelism in H.264/AVC standard. Section 3 details the related works on parallel implementations of H264/AVC encoder. The internal architecture of our multicore DSP TMS320C6472 is described in Sect. 4 . Section 5 presents our optimized implementation of H264/AVC encoder on a single DSP core. Section 6 focuses on the FLP algorithm implementation on five DSP cores. It details the whole coding chain (image capture, bitstream transfers), and finally gives experimental results. The best approach, based on the enhanced FLP, is detailed in Sect. 7 which also includes experimental results. Finally, Sect. 8 concludes this paper and presents some perspectives.
2 Overview of data dependencies and parallelism in H264/AVC encoder H.264/AVC encoder is a video compression standard used to reduce the video data amount to overcome the transmission bandwidth limitation and the huge amount of memory requirement for storing high definition video sequences. This standard defines several profiles such as the baseline profile on which we will focus our work. It consists of performing several functions to generate the compressed bitstream corresponding to the input video as shown in Fig. 1 . This standard divides a video sequence into a hierarchical structure with six levels as shown in Fig. 2 . The top level of this structure is the sequence that contains one or more groups of pictures (GOP). Each GOP is composed of one or more frames. Finally, the frames are divided into one or more independent slices, subdivided themselves into macroblocks of 16 9 16 pixels (MBs) and down to blocks of 4 9 4 pixels for different prediction modes.
Each MB undergoes two prediction types: (1) intra prediction: it consists of performing intra 16 9 16 and intra 4 9 4 prediction modes to reduce spatial redundancies in the current frame. (2) inter prediction: it consists of determining the motion vector of the current MB relative to its position in reference frames. It includes seven prediction modes to reduce temporal redundancies existed among successive frames. A mode decision is then performed to select the best prediction mode based on cost distortion. Integer transform and quantification modules are performed on the residual MB resulted from the subtraction of the source MB and the best selected predicted MB to keep only the most significant coefficients. An entropy coding is finally performed to generate the compressed bitstream. A decoding chain is included in the encoder structure to keep the reconstructed frame that will be filtered by a deblocking filter to eliminate artifacts. The reconstructed frame will be used as a reference for the next frames to perform motion estimation.
According to functional organization and hierarchical sequence structure in H.264/AVC encoder, there are mainly two partitioning families:
Task-level parallelization (TLP) or functional partitioning it consists of splitting the encoder into several steps, identify them into a different group of tasks equal to the number of threads available on the system and run these groups of tasks simultaneously as a pipeline. Thus, the appropriate functions that could be grouped together to be processed in parallel and the other functions that will be executed in serial to respect data dependencies should be efficiently chosen. Also, tasks computational complexities should be taken into consideration to maximize the encoding gain and ensure a workload balance between the parallel tasks. Finally, when grouping functions, synchronization overhead should be minimized as much as possible by eliminating data dependencies between the different function blocks. For example intra prediction modes (13 modes) and inter prediction modes (seven modes) could be processed in parallel because no dependencies existed among them. In the other side, integer transform, quantification and entropy coding have to be processed in serial way given the dependencies among them.
Data-level parallelization (DLP) or data partitioning it exploits the hierarchical data structure of H264/AVC encoder by simultaneously processing several data levels on multiple processing units. DLP is limited by data dependencies among different data units.
For H264/AVC encoder, there are two major types of data dependencies:
Spatial dependencies they exist amongst macroblocks within the current encoding frame. In fact, to perform intra prediction modes, motion vector prediction and reconstructed MB filtering for the current MB, such data are required from its neighboring MBs (Left, Top Left TOP and Top right) already encoded as shown in Fig. 3 . So, the current MB could be encoded only if its neighboring MBs have been encoded.
Temporal dependency to determine the motion vector of the current MB in relative to its position in the previous encoded frames, a motion estimation (ME) algorithm such as MB matching is performed. The search of the corresponding MB is restricted in a specific area called the ''search window'' in the reference frames (the previous encoded frames) instead of scanning the whole frame to reduce the computational complexity. Consequently, a partial dependency among MBs of successive frames is imposed and limited to the search windows. As data partitioning is restricted by these data dependencies, several points could be noticed. No dependencies existed among different GOPs because each GOP is started by an intra frame ''I'' where only intra prediction is performed, so dependencies exist only among MBs of the same frame. The remaining frames of the GOP are a predicted frames ''P'' where both intra and inter prediction are performed. Hence, several GOPs could be encoded in parallel. This method is called GOP Level Parallelism [14] . A partial dependency exists among successive frames of the same GOP due to motion estimation in the search window. Thus, multiple frames can also be parallel encoded once the search window is already encoded; this method is called Frame Level Parallelism [12] . When dividing frame into independent slices, several slices can be processed in parallel manner; this approach is called slice level parallelism [13] . Finally, in the frame itself, several MBs can be encoded in parallel once their neighboring MBs are already encoded; this scheme is called MB level Parallelism [12] .
Related works
To overcome the high complexity of H264/AVC encoder and to resolve the problem of frequency limitation of mono-core processors, many researches have been conducted on the parallelism of H264/AVC encoder to meet the real-time constraint and achieve a good encoding speedup which can be presented by the following equation.
Speedup ¼ Time of sequential encoding Time ofparallel encoding ð1Þ
Several implementations exploiting multi-thread, multiprocessor and multicore architectures are discussed in many papers.
Zhibin Xiao et al. [9] exploited task level parallelism approach. They partitioned and mapped the dataflow of H.264/AVC encoder to an array of 167-core asynchronous array of simple processors (AsAP) computation platform coupled with two shared memories and a hardware accelerator for motion estimation. They processed the luminance and the chrominance components in parallel. Intra 4 9 4 modes and intra 16 9 16 modes are calculated in parallel. Only three modes for intra 4 9 4 instead of nine and three modes for intra 16 9 16 are considered to reduce the top right dependency. Eight processors are used for ICT transform and quantification modules and 17 processors for CAVLC. A hardware accelerator is used for motion estimation. Despite all these hardware resources, a real-time implementation is not achieved. The presented encoder is capable of encoding VGA (640 9 480) video sequences at 21 f/s. Reducing the number of candidate modes for intra 4 9 4 and intra 16 9 16 induces visual quality degradation and bitrate increase.
Sun et al. [10] implemented a parallel algorithm for H.264/AVC encoder based on a MB region partition (MBRP). They split the frame into several MB regions composed by adjoining columns of MBs. Then, they mapped the MB regions onto different processors to be encoded satisfying data dependencies in the same MBs row. Simulation results on four processors running at 1.7 GHz show that the proposed partitioning achieves a speedup by a factor of 3.33 without any rate distortion (Quality, Bitrate) compared to H264/AVC software JM10.2 [11] . In the other side, they are still far from realtime implementation that requires at least 25 f/s. this implementation is able to encode only one frame/1.67 s for CIF resolution and one frame/6.73 s for SD resolution.
Zhao et al. [12] proposed a new wave-front parallelization method for H.264/AVC encoder. They mixed two partitioning methods: MB row level parallelism and frame level parallelism. All MBs in the same MB row are processed by the same processor or thread to reduce data exchanges among processors. MBs in different frames can be processed concurrently if the reconstructed MBs in the reference frame forming the search window are all available. This approach is implemented using JM 9.0 on a Pentium 4 processor running at 2.8 GHz. Simulations on four processors show that the obtained speedup is equal to three (3.17 for QCIF resolution (Quarter CIF 176 9 144) and 3.08 for CIF). Encoding quality has not changed and it remains the same as the original software JM 9.0. In the other side, the run-time is still far from real-time implementation. In fact, only one frame/1.72 s is encoded for CIF resolution.
Yen-Kuang et al. [13] parallelized H.264/AVC encoder exploiting thread-level parallelism using OpenMP programming model. Slice level partitioning is performed on four Intel Xeon TM processors with Hyper-Threading Technology. Results show a speedups ranging from 3.74 to 4.53. The drawback of slice parallelism is that it affects the rate distortion performance. Indeed, it provides PSNR degradation and an important increase in bitrate especially when the frame is decomposed into several independent slices. Sankaraiah et al. [14, 15] applied the GOP level parallelism using multithreading algorithm to avoid data dependencies. Each GOP is handled by a separate thread. Frames in each GOP are encoded by two threads: I and P frames by the first thread and B frames by the second thread. The obtained speedup using dual and quad-core processors are 5.6 and 10 respectively. The drawback of GOP level parallelism is its very high encoding latency that is not compatible with video conference applications.
Rodriguez et al. [16] proposed an implementation of H.264/AVC encoder using GOP level parallelism combined with slice level parallelism on a clustered workstations using Message Passing Interface (MPI). The first approach speeds up the processing time but provides a high latency and the second approach is used to reduce this latency by dividing each frame into several slices and distributing these slices to computers belonging to a subgroup of computers. With this technique, the encoding latency is relatively reduced. However, increasing the number of slices per frame has significant adverse effects on the rate distortion (bitrate increase). Also, clustered workstations are a costly solution and they are not intended for embedded applications.
Shenggang Chen et al. [17] presented an implementation of an on-chip parallel H.264/AVC encoder on hierarchical 64-cores DSP platform. This platform consists of 16 super nodes (four DSP cores for each node). 2D WaveFront algorithm for macroblock level parallelism is used and one macroblock is assigned to one super node. Subtasks for encoding one macroblock such as motion estimation, intra prediction and mode decision are further parallelized to keep busy the four DSP cores that form a node. Speedup factors of 13, 24, 26 and 49 are achieved for QCIF, SIF (352 9 240), CIF and HD sequences respectively. The proposed wave-front parallel algorithm does not introduce any quality loss; however, the used CABAC-based bitrate estimation and parallel CABAC evolutional entropy coder cause a bitrate increase. Real-time processing is not given in this paper.
Yang et al. [18] implemented H264/AVC encoder on dual-core DSP processor ADSP-BF561 chipset using functional partitioning. Core A of the BF561 processor is dedicated to perform mode decision, intra prediction, motion compensation, integer transform (IT), quantization, de-quantization, inverse integer transform, and entropy encoding. Core B is assigned to perform in-loop filtering, boundary extension, and half-pel interpolation. Core A and core B execute tasks in two pipeline stages. The proposed encoder system achieves real-time encoding for CIF resolution but not for higher resolutions (VGA, SD and HD).
Zrida et al. [19] presented a parallelization approach for embedded Systems on Chip (SoCs). It is based on the exploration of task and data levels parallelism, the parallel Kahn process network (KPN) model of computation, and the YAPI programming C?? run-time library. The used SOC platform is based on four MIPS processors. Simulation results of this work show that a speedup of 3.6 is achieved for QCIF format but real-time encoding is not reached even for low resolutions (7.7 f/s for QCIF format).
António Rodrigues et al. [20] implemented the H264/ AVC encoder on a 32-core Non-Uniform Memory Access (NUMA) computational architecture, with eight AMD 8384 quad-core chip processors running at 2.7 GHz. Two parallelism levels are combined: slice level and macroblock level. A multithreading algorithm with openMP model is used for the reference JM. The frame is decomposed into slices; each slice is processed by a group of cores. Several MBs in the same slice are encoded in parallel respecting data dependencies by different cores of the group. The achieved speedup using the whole set of 32 cores is between 2.5 and 6.8 for 4CIF video (704 9 576).
These speedups are not significant compared to the number of cores used for encoding. Using a MB level Parallelism requires that data have to be shared which leads to a memory bottleneck and higher latency. Also, increasing the number of slices introduces a bitrate distortion. Real-time is not mentioned in this work.
Lehtoranta et al. [21] presented a row-wise data parallel video coding method on a quad TMS320C6201 DSP system. The frame is decomposed into slices by row-wise and each slice is mapped to a DSP slave. A DSP master is devoted to swap data to/from DSP slaves. Real-time is reached only for CIF resolution but not yet for higher resolutions. The main drawback of this approach is an increase in bitrate and PSNR degradation because of the use of slice level parallelism.
In [22] , the author developed a parallel implementation of the intra prediction H.264/AVC module using the computational resources of a GPU and exploiting the Compute Unified Device Architecture (CUDA) programming model. The author applied two partitioning methods:
(1) data partitioning by processing the luminance Y and the two chroma components Cr, Cb in different parallel tasks. (2) Task partitioning by processing the intra prediction 16 9 16, intra prediction 4 9 4, and chroma intra prediction in parallel. This implementation of the intra prediction module achieves a speedup of 11, relatively to the sequential implementation of the reference software. But as we know, the inter prediction is the most important module in H264/AVC encoder which takes the lion's share of the processing time. So, preferably, this module should be accelerated in addition to intra prediction. Moreover, processing chrominance data in parallel with the luminance component does not give a significant speedup if we know that the run-time of the chrominance processing is relatively small compared to the luminance processing. Finally, the luminance and chrominance processes are not totally independent. In fact; some dependencies exist among the two components during filtering and entropy coding processes which induce high latency.
Ji et al. [23] proposed an H264/AVC encoder on an MPSOC platform using GOP level parallelism approach. They build three Microblaze soft cores based on XILINX FPGA. A main processor is devoted to prepare the frames into shared memory. Then, each processor among the remaining coprocessors will encode its appropriate GOP. Experiments show that the average speedup is 1.831. The problem of the GOP approach is its higher latency. Consequently, real-time is not achieved. In fact, this solution encodes only 3 f/s for QCIF resolution.
Peng et al. [24] proposed a pure line-by-line coding scheme (LBLC) for intra frame coding. The input image is processed line-by-line sequentially, and each line is divided into small fixed-length segments. The encoding of all segments from prediction to entropy coding is completely independent and concurrent at many cores. Results on a general purpose computer illustrate that the proposed scheme can get a 13.9 as speedup factor with 15 cores but in the other side, this method affects the rate distortion because of the discard of the left dependency for each MB.
Su et al. [25] introduced a parallel framework for H.264/ AVC encoder based on massively parallel architecture implemented on NVIDEA's GPU using CUDA. They presented several optimizations to accelerate the encoding speed on GPU. A parallel implementation of the inter prediction is proposed based on a novel algorithm MRMW (Multi-resolutions Multi-windows) that consists of using the motion trend of a lower resolution frame to estimate the original frame (higher resolution). The steps of MRMW are parallelized with CUDA on different cores of the GPU. Also they performed a multilevel parallelism for intracoding. For that, a multi-slice method is introduced. Each frame is partitioned into independent slices. At the same time, the wave-front method is adopted for parallelizing the MBs in the same slice. Some dependencies within MBs are not respected to maximize the parallelism. Moreover, CAVLC coding and filtering processes are also parallelized by decomposing these modules into several tasks. Experimental results show a speedup of 20 can be obtained for the proposed parallel method when compared to the reference program. The presented parallel H.264/AVC encoder meets the real-time constraint for HD resolution. In the other side, this implementation affects the visual quality by inducing a PSNR degradation ranging from 0.14 dB to 0.77 dB and a little increase in bitrate because of using multi-slice parallelism and some dependencies are not respected.
Adeyemi et al. [26] presented a 4kUHD video streaming over wireless 802.11n. They performed the entire encoding chain including 4 K camera capture, YUV color space conversion, H264/AVC encoding using CUDA on NVI-DIA Quadro 510 GPU and real-time live streaming. To speed up the encoding, several modules are parallelized such intra and inter prediction modules by exploiting a dynamic parallel motion algorithm and a novel intra prediction mode. Also, they used a Zero-Copy memory allocation technique to reduce the memory copy latencies between the host memory (CPU) and the GPU memory. Experiments confirm that 4kUHD real-time encoding for live streaming at low bitrates is possible. Despite these results, we can affirm that GPUs are more suitable for massively parallel computing and not suitable for such algorithm that includes a lot of dependencies. H264/AVC GPU's implementation requires a lot of synchronizations, inter-core communications, and a tremendous programming time to respect all H264/AVC encoder dependencies. Finally, to achieve an interesting encoding speedup, some dependencies are not respected to have more parallelism possibilities which can induce quality degradation or bitrate increase.
Elhamzi et al. [27] presented a configurable H264 motion estimator dedicated to video codec on a smart camera accelerator based on Virtex6 FPGA component. They proposed a flexible solution to adjust the video stream transferred by the smart camera. The accelerator is able to support several search strategies at IME (Integer Motion Estimation) stage and different configurations for FME (fractional Motion Estimation) stage. Experiments show that the obtained FPGA-based architecture can process IME on 720 9 576 video streams at 67 fps using full search strategy. FPGA solution remains an interesting way to achieve real-time processing but when moving to implement the whole H264/AVC encoder, a huge FPGA surface and a lot of design and compilation time with tremendous VHDL expertise are required which may not deal with time-to-market constraint. Finally, the low hardwired block frequency and bus bandwidth for data transfers between processor and accelerator represent the major drawbacks of FPGA implementations.
Jo et al. [28] used OpenMP programming model to parallelize H264/AVC encoder exploiting the TLP and DLP approaches. For the TLP approach, they executed motion estimation modes (16 9 16, 16 9 8, 8 9 6 and 8 9 8), intra prediction modes (intra 4 9 4 and intra 16 9 16) and de-blocking filter for the previous macroblock (MB) in parallel as seven different tasks on an ARM Quad MPCore. Mode decision and entropy coding are processed thereafter in a serial manner. The obtained speedup on four cores is 1.67 for QCIF resolution which looks not significant compared to the number of used cores. For the second approach, they applied the wave-front approach by processing several MBs in the same frame in parallel. Experiments show that this approach achieves a speedup factor of 2.36 using four threads whereas real-time is not noticed also in this work.
DSP platform description
Software flexibility, low power consumption, time-tomarket reduction, and low cost make DSPs an attractive solution for embedded systems implementations and high performance applications. Motivated by these merits and encouraged by the great evolution of DSP architectures, we chose to implement the H264/AVC encoder on a low cost multicore DSP TMS320C6472 to profit from high processing frequency and an optimized architecture to achieve real-time embedded video encoder. TMS320C6472 DSP [29] belongs to the latest generation of multicore DSPs made by Texas Instrument. Low power consumption and a competitive price tag make the TMS320C6472 DSP ideal for high performance applications and suitable for many embedded implementations. Several benchmarks are performed by Texas Instruments to compare between DSP, General Purpose Processors (GPP) and Graphic Processor Unit (GPU) [30, 31] . These benchmarks demonstrate that the C6472 consumes 0.15 mW/MIPS (Million instructions per second) at 3 GHz (when six cores running each at 500 MHz are all used). Also, at 3.7 watts per device, it offers even greater power savings compared to GPP for the same performance range. When the performance is distributed over power, the DSP is four times better than GPU and 18 times than GPP.
As presented in Fig. 4 , our platform consists of six C64x ? DSP cores, very long instruction word (VLIW) architecture, 4.8 M-Byte (MB) of memory on-chip, Single Instruction Multiple Data (SIMD) instruction set, and a frequency of 700 MHz for each core are combined to deliver 33,600 MIPS performance and 4.2 GHz of processing capability (6*700 MHz). At 40°C and at 99 % of CPU utilization for the six cores, the estimated power consumption based on the spreadsheet estimator [32, 33] is equal to 7 W. Each C64x ? core integrates a large amount of on-chip memory organized as a two-level memory system. Level-1 (L1) program and data memories on the C64x ? core are 32 K-Byte (KB) each. This memory can be configured as mapped RAM, cache, or any combination of the two. Level 2 (L2) memory is shared between program and data space with 608 KB as size. L2 memory can also be configured as mapped RAM, cache, or any combination of the two. In addition to L1 and L2 memory dedicated to each core, the six cores also share 768 KB of L2 shared memory. Shared L2 memory is managed by a separate controller and can be configured as either program or data memory. This large amount of on-chip memory may avoid access to the external DDR2 memory, as a result reducing the power dissipation and accelerating algorithms processing since internal memory is faster than external memory. Performance is also enhanced using the EDMA (Enhanced Direct Memory Access) controller which is able to manage memory transfers independently from the CPU. Therefore, no additional overhead is caused when large data blocks are moved between internal and external memory. TMS320C6472 DSP supports different communication peripherals as Gigabit Ethernet for Internet Protocol (IP) networks, UTOPIA 2 for telecommunications and Serial RapidIO for DSP-to-DSP communications. This DSP includes all the necessary components (DMA, RAM (Random Access Memory), input output management) required to communicate with a camera sensor. Note also that VLIW architectures are deterministic and dedicated to embedded real-time applications. This must be taken into account when compared to superscalar GPP (General Purpose Processors) based on an expensive and consuming memory management units, out-of-order units etc.
Finally, it is clear that the power and the features of such DSP family perfectly fit the need of intelligent vision system embedded in smart cameras. It should also allow designers to build highly scalable camera that can follow the latest advances in video compression.
Optimized implementation on a single DSP core
Our choice for using this multicore DSP platform enables us to develop an academic H264/AVC codec [34] in our LETI laboratory (Laboratory of Electronics and Information Technologies) for future research and development targeting embedded video applications. Standard compliant LETI's codec was developed and tested first on a PC environment for validation and then migrated to TMS320C6472 DSP platform. This work will also benefit for our future H265 implementation. Consequently, it can be considered as an optimized version of the JM reference software dedicated to DSP implementation.
To efficiently take advantages of the multicore architecture and the potential parallelism presented in the H264/ AVC standard, we must as a first step, elaborate an optimized H264/AVC architecture on a single DSP core and then move to a multicore implementation. This step consists of designing a data model that exploits DSP core architecture and especially internal memory which is faster than external SDRAM memory. Each core of TMS320C6472 DSP has 608 KB as internal memory LL2RAM shared between program and data. Preferably and to the extent of possible, we should load both program and data within LL2RAM. For that reason, two implementations are proposed [35] . 
MB level implementation
This implementation is the conventional data structure processing in H264/AVC standard. It is based on encoding a MB followed by another MB until finishing the entire frame MBs. The principle of this first proposed architecture is detailed as follows: the program is loaded into internal memory LL2RAM. The current, the reconstructed, the reference frames, and the bitstream are stored into external SDRAM memory regarding their important sizes for HD resolution. To avoid working directly with slow external memory, some data are moved into internal memory such as current MB, search window, and reconstructed MB for the 3 YC r C b components. The design of the MB level implementation is presented in Fig. 5 . It highlights the memory allocations for the luminance components. The DSP core transfers the current MB (16 9 16) and the search window (48 9 48) respectively from the current and the reference frames from external to internal memory. Consequently, the data processing can be performed by the DSP core without external memory accesses. The reconstructed MB (20 9 20), extended by four pixels at the left and the top needed in the MB filtering, is transferred from the local memory into external memory at the reconstructed frame buffer. This process is repeated until the completion of the entire current frame MBs. The most important advantage of this architecture is its adaptability to any DSP even in the case of small internal memory. In fact, only 55.54 KB of internal memory space is required for HD resolution. The major drawbacks of this architecture are the multiple accesses to external memory for transferring a current or a reconstructed MB. It also needs to store left and top neighboring pixels used in the prediction and filtering of the next MBs after each MB processing.
MBs row level implementation
To avoid the first architecture's drawbacks, a second implementation is proposed. The principle of this implementation as illustrated in Fig. 6 consists of loading one MBs row (16 9 frame_width) from the current frame and 3 MBs rows [48 9 (16 ? frame_width ?16)] for the search window from the reference frame to the appropriate buffers created in internal memory. The DSP core encodes the whole current MBs row without external memory access. Then, the reconstructed MBs row [20 9 (16 ? frame_width ?16)] is transferred from LL2RAM to SDRAM memory in the reference frame. Thus, it is not necessary to create another memory buffer for the reconstructed frame.
The reference frame buffer can be exploited to store the reconstructed MBs row; since overwritten data will not be used (they are already copied into the 3 MBs rows of the search window). Moving to the second current MBs row, it is not necessary to load 3 MBs rows for the search window from the reference frame, just shift up the last two MBs rows of the search window in the internal memory and bring the third from the fourth MBs row of the reference image.
This approach outstandingly reduces the access to external memory. Thus, only one external memory access is required to read one MB row instead of 80 accesses to read 80 MBs (1,280/16 = 80) that form a MB row for HD resolution having 1,280 pixels as width. The same prevail is obtained for saving the reconstructed MB row. In addition, when proceeding at a MBs row level architecture, all the left boundaries required in the next MB prediction and filtering are already available in internal memory, so the left neighboring pixels backup is removed. Moreover, this implementation reduces the backup of TOP boundaries, since storing them is required just after finishing processing the whole MBs row, whereas, the MB level implementation needs to store top neighboring pixels after processing each current MB.
Experimental results for the mono-core implementation
In this preliminary work, the two proposed architectures are implemented on a single DSP core TMS320C6472 running at 700 MHz using the H264/AVC LETI's codec. Experimental simulations are performed on the most commonly used video test sequences with CIF resolution downloaded from the referenced website [36] . Table 1 shows the performance of the two implementations based on encoding speed. Experiments show that the second architecture can save up to 18.71 % of encoding run-time. The average encoding speed, obtained on a single DSP core, surpasses the 24 f/s which is very close to real-time. After proving the MBs row level architecture superiority, it is then evaluated using higher resolutions: SD (720 9 480) and HD (1280 9 720). Table 2 presents the achieved encoding speeds when applying the MBs row level architecture on a single DSP core for several SD and HD sequences recommended by the ITU-T and ISO/IEC organizations using YC r C b 4:2:0 format. The SD sequences are obtained by resizing the downloaded HD sequences [36] using OpenCv library [38] . The number of processed frames is equal to 300. The used QP is 30 and the GOP size is equal to 8.
It is clear that mono-core processors with low CPU frequency cannot meet real-time requirement for high resolution video sequences. Thus, moving to a multicore implementation and exploiting H264/AVC parallelism are mandatory to reach real-time encoding for VGA and SD resolutions and improve the encoding speed for HD resolution. First, large amount of data transfer among processors will demand a large system bandwidth to assure inter-processor communication. Second, functions in H.264/AVC encoder have different load balance, so it is hard to equally map functions among processors. Thus, the final performance is always restricted by the processor with the heaviest load. Based on these observations, the frame level parallelism (FLP) approach will be applied in order to enhance the encoding speed and get a low latency without inducing any rate distortion (PSNR degradation and bitrate increase). Our multicore implementation using FLP approach will exploit the optimized mono-core architecture implemented on a single DSP core which is the MBs row level implementation. Our real-time video encoder demo is described in Fig. 7 .
Since our platform has not yet a frame grabber interface and while the main idea of our work consists of evaluating H264/AVC encoder performance for smart camera, we are not interested in this work to study the direct interfacing with a camera sensor.
As a preliminary step, the sensor is simulated by a personal computer (PC) connected to a Universal Serial Bus (USB) HD webcam to capture RAW video and a TCP/ IP stack (transmission Control Protocol/Internet Protocol) using a Gigabit Ethernet to achieve real-time video raw data transfers to the DSP.
This strategy is mainly used for validation purposes and performance evaluations. A commonly used video test sequences in YC r C b 4:2:0 format are used for encoding. Then, the similarity between the output of our DSP implementation and that of the PC implementation is verified.
Once the evaluation results meet our specifications of the target encoder, interfacing with a real sensor will be the next step. In fact, our DSP includes enough of configurable I/O pins (GPIO, SRIO Serial Rapid Input Output etc.) that support several devices which make it easy to interface with camera sensor. Also, there are several camera sensors in the market that include Ethernet port. We can use these cameras and connect them directly to our DSP. The PC, in this case, will no longer be required.
As our DSP platform includes six DSP cores, the first core ''core0'' is assigned as a master. It executes a TCP server program. It is devoted to establish TCP/IP connection with the client (PC) exploiting Texas Instruments (TI) NDK library (Network Developer's Kit [39]). In a first step, it receives the current frames sent by the PC after camera capture and stores them into the external memory which is a shared memory among all DSP cores. The five remaining DSP cores are used to encode the five received frames. For each core, a memory section is reserved to store the current frame (SRC), the reconstructed frame (RECT, which will be the reference frame for the next core), and finally a bitstream buffer where the bitstream will be stored. After encoding, core0 server sends the bitstream of all encoded frames to the client (PC) in order to store or display it.
Inside the internal memory of core0, a TCP server program is loaded to establish connection between the DSP and the PC. H264/AVC executable file is loaded into each internal memory of the five remaining cores. Thus, a C?? project is developed and executed on a PC to capture video from the camera. Our program is based on OpenCv library which is used to convert the captured frames from RGB to YC r C b 4:2:0 format. A TCP socket (@IP, Port number) is created to transmit data between core0 (server) and the PC (client).
When applying frame level parallelism and exploiting One MBs row level implementation, core i starts encoding its appropriate frame only if core i-1 has finished encoding Fig. 7 Encoding demo using frame level parallelism algorithm at least 3 MBs rows from the previous frame. These 3 MBs rows will be used as the search window for the motion estimation of the first MBs row of the current frame processed by core i (see Sect. 5.2). Thus, inter data dependency is respected and consequently, no rate distortion will be provided. The steps of encoding a video sequence using FLP are detailed as follows (Cf. Fig. 8 ):
• After establishing connection between the PC and the DSP, core0 receives five frames from the PC as five cores are devoted to encoding. Each frame is loaded into the SRC buffer of each remaining core (1-5).
• When the reception of the five current frames is completed, core0 sends five inter-processor communication interruption events (IPC) to cores 1-5; which are in a wait state for an interruption event from core0; to indicate that SRC frames are already in external memory so they can start encoding.
• Core1 is the first core that begins encoding. Upon completion encoding the first 3 MBs rows of the SRC frame, it sends an IPC to the next core (core2) which itself is in a wait state for an interruption from core1 to start encoding its appropriate frame. The same procedure will be reproduced from core3 to core5.
• To avoid that core i exceeds core i-1 (which is possible because the load balance is not uniform between successive frames and it can give an erroneous result), the encoding of the next MBs row is conditioned with the reception of an IPC from the previous core. Thus, each core will send an IPC to its next core after encoding a MBs row that its index is higher than three. Since each core starts encoding after its previous core finishes encoding 3 MBs rows, it should not wait an IPC from the previous core to encode the last 2 MBs rows of each SRC frame; otherwise encoding will be blocked by waiting an incoming IPC. As a result, each core will totally send Max_MBs_rows-2 interruptions to the next core. When all cores finish encoding the current frames and specifically core5 which is the last core that finishes its task, cores1-5 send five IPCs to core0 which is in a wait state to indicate that the bitstream of five frames is ready in external memory and has to be transferred to the PC.
• When receiving these five IPCs, core0 sends the bitstream of the five frames to the PC via the Gigabit Ethernet link.
• After the end of bitstream receiving, the PC captures another five frames and sends them to core0. The same work thereby will be reproduced. 
Cache coherency
Multicore processing often leads to cache coherency problem. This is due to the simultaneous access of two or more cores with a separate cache memory for each core to the same location in a shared memory. In general purpose multi-processor, programmers don't have such problem because it is controlled automatically by a complex hardware. But in our multicore DSP architecture, designers have to control it, since there is no such automatic controller. To deal with cache coherency, the Chip Support Library (CSL library) [40] from TI provides two API commands:
• CACHE_wbL2[(void *)XmtBuf, bytecount, CACHE_-WAIT] to write back the cached data from the cache memory to its location in the shared memory.
• CACHE_invL2[(void *)RcvBuf, bytecount, CACHE_-WAIT] to invalidate the cache lines and force the CPU to read data from its location in the shared memory.
In our case, when core0 receives the current frames from the PC, it should write back the cached data to external memory. In the other side, core1-core5 should invalidate the current SRC frames addresses in the cache memory before starting encoding to use the updated data. Also, when core1-core5 complete encoding, they should write back the bitstreams from the cache memory to the external memory to overcome the cache coherence problem with core0 which will send the bitstream from the external memory to the PC. Furthermore, among core1 and core5, the problem of cache coherency exists because core i processes data (the search window) that it was written by core i-1 (Reconstructed MBs row). So, the same principle should be applied. Before sending an IPC to the next core, a write back of the reconstructed MBs row must be applied. In the other side, the next core should invalidate the cached data of the search window before starting encoding to process an updated data that it was modified by the previous core.
Experimental results for the Frame Level
Parallelism implementation on five DSP cores Several experiments are performed using the same video sequences that have been tested in the mono-core implementation to correctly evaluate the performance of the two implementations. Tables 3, 4 and 5 respectively illustrate the encoding speeds (f/s) for CIF, SD and HD resolutions for the mono-core and the multicore implementations. The encoding speedup is also computed for each video sequence. Experiments are performed using different GOP sizes (8 and 16) and different QP (QP = 30 and QP = 37). The number of encoded frames is equal to 300.
Experiments on five DSP cores show that speedup factors of 2.92, 3.27, and 3.79 are achieved respectively for CIF, SD and HD resolutions. Experimental results approximately verify the theoretical results. In fact, the obtained speedup factors are slightly less than the maximum speedups. This is due to inter-communications needed among different cores, write-backs and cached data invalidations. The proposed FLP implementation achieves an encoding speed of about 70 f/s for CIF resolution surpassing real-time constraint of 25 f/s. Encoding speed is efficiently improved for SD and HD resolutions compared to mono-core implementation.
For SD resolution, the average encoding speed is 23 f/s which is not far from real-time video encoding compliant.
Enhanced Frame level parallelism approach: hiding communication overhead
The classic FLP implementation improves the encoding speed compared to the mono-core implementation but does not efficiently exploit the DSP cores. A lot of time is wasted (processor waiting data) which reduces our multicore implementation efficiency. Moreover, communication overhead is not optimized. To avoid these drawbacks, this part presents the enhanced version of FLP approach based on hiding communication overhead. For the first version of FLP approach, core1-core5 wait that core0 completes the reception of five frames, although encoding can be immediately started after the reception of the first frame. Furthermore, core0 waits that core1 to core5 finish encoding their respective frames to start sending the bitstreams, although it can start sending to the PC any available bitstream. In the other side also, during encoding, core0 is in a wait state; consequently, this time can be exploited to prepare the next five frames to overlap frames encoding and frames reading processes. To realize these optimizations, a ping pong buffer is used for each SRC frame instead of a single buffer as shown in Fig. 9 . A multithreading approach is employed on the PC side. Three threads are used to manage reading raw frames, sending them via Ethernet, receiving encoded bitstream, and saving it in a file. The strategy of our implementation is described in Fig. 10 and consists of the following steps:
• The first thread ''thread1'' captures the first frame from the camera and sends it to core0 which will store it into the ping buffer SRC[0] of core1. Core0 sends then an IPC to core1 to indicate that it can start encoding its current frame.
• When receiving an IPC from core0, core1 triggers the encoding. At the same time thread1 moves to read and send the second frame to core0 which will store it into the ping buffer of core2. This step is repeated until receiving the five frames. Thus, each core immediately starts encoding after core0 receives its current frame without waiting for all the frames.
• While core1-core5 encode their frames with the same principle as the first FLP implementation, thread1 sends the next five frames to core0 which will store them into the pong buffers SRC [1] of each core. Because encoding process takes more time than reading process, communication delays are hidden and they do not contribute to the parallel run-time.
• When encoding is achieved on a core i, the bitstream is stored into the ping buffer bitstream[0]. Consequently, core i sends an IPC to core0 to inform it that it can forward its bitstream to the PC. After that, core i starts encoding its pong frame stored into SRC . At this time, ''thread3'' writes the bitstreams in a file and thread1 sends the next five frames to core0 which will store them into the ping buffers SRC[0] of each core. With this technique, the ping bitstreams writing, the pong SRC frames encoding and the next five ping SRC frames capturing and sending are processed in parallel.
• The processing is then looped in a reverse order for SRC frames and bitstreams through ping pong buffers.
• When looking at Fig. 10 , no significant delays have occurred. All cores process their respective data 77 % of encoding time and processes about 12 f/s instead of 2.6 f/s obtained on a single core. During our computation of the enhanced FLP encoding speed, the cost of data transfer is taken into account. Experimental results show that our proposed data transfer scheduling technique completely hides the communication overhead. The time of capturing frames, transferring them to DSP, receiving them by core0, and loading them to DSP memory does not contribute in the encoding run-time thanks to using the ping pong buffer technique and exploiting a multithreading algorithm.
Our optimizations based on hiding communication overhead allow the enhancement of achieved speedup factors and encoding speeds compared to the non-optimized FLP implementation.
For low and medium video resolutions such as CIF, VGA (640 9 480) and SD, real-time is achieved on less than five cores which allows exploiting the remaining cores in other tasks (biometric recognition, access control, texture detection, and video surveillance application etc.). This will give an important advantage to our multicore DSP if it will be integrated into a smart camera system.
It may be noted that several factors have contributed to the achievement of this performance despite the complexity of the encoding steps detailed above and especially the simultaneous accesses to the external memory by different cores which may cause a significant latency.
First, our encoding implementation is based on ''MB row level architecture'', so each core reads a MB row from the external memory to the internal L2 memory. The processing will be performed thereafter by the CPU between the L1 and L2 level memories which reduces the external memory bottleneck. Secondly, 128 kbytes of L2 memory are configured as cache for each core. Thus, access to a memory location triggers a prefetch of a memory ''line'' into the cache memory by the cache controller. This allows the reduction of cache misses so accelerating encoding runtime. Reconstructed fraction and bitstream are not copied directly into the external memory after their processing but they are kept into the cache memory which reduces the external memory access. Third, in addition to eight processing units for each core which allow performing eight instructions per cycle, code composer studio IDE (Integrated Development Environment for DSP programming) allows the generation of an optimized assembler code that exploits pipelining. Thus, the different cores may not perform the same load instruction from the external memory at the same time, a core i can perform prefetch instructions, other core can perform load instruction and another one can execute ADD instructions for example etc. Moreover, our For a deep evaluation of our encoder, the encoding performances of our LETI's video codec are compared to those of the reference software JM 18.6 in terms of PSNR, bitrate and encoding speed. As we noted before, the LETI's codec is an optimized version of the JM where we have applied several software and architectural optimizations on different encoder modules (intra, inter, transform, filter) to adapt this software to the DSP architecture and improve the encoding speed. Also some functions have been programmed in assembler language to efficiently exploit the internal resources of our DSP.
The reference software JM18.6 is implemented on an Intel core2 Quad CPU running at 2.33 GHz. Our codec is evaluated on the multicore DSP TMS320C6472 running at 700 MHz for each core. Simulation parameters are detailed in Table 9 . Table 10 shows a comparison of the encoding performances between the JM 18.6 and LETI's codec. This comparison is based on three criteria:
• DPSNR (dB): presents the PSNR decrease for our codec compared to the JM reference software.
• DBitrate (%): presents the percentage increase in bitrate for our codec compared to the JM reference software.
• Encoding speed (f/s): it depends on the CPU frequency and applied optimizations.
The above criteria are detailed by the following equations:
Experiments show that the reference software gives better encoding performances in terms of PSNR and bitrate. In fact, our encoder induces PSNR degradation by 1 dB in average and an increase in bitrate by 3 % in average compared to JM. This returns to the various software optimizations applied in our encoder to improve encoding run-time.
In the other side, regarding encoding speed, we can note that the JM reference software is not an optimized algorithm compared to our codec. Real-time is not achieved even for low resolutions.
Our encoder is able to process 99 f/s for CIF resolution while the reference software encodes only 10 f/s in average. For SD sequences, our encoder meets the real-time encoding compliant while the reference software is able to encode just 2.7 f/s. For HD resolution, our encoder is 11 times faster than the reference JM. Our parallel implementation of H264/AVC encoder is also evaluated and compared to previous parallel implementations performed on different platforms as presented in Table 11 . Experiments show that some implementations have not satisfied the real-time constraint. In fact, the reference software, as we said before, is not an optimized codec which makes it hard to meet the real-time encoding constraint.
Other works have succeeded to reach the real-time for low resolution but not yet for higher resolutions. GPU's implementation allows achieving real-time encoding for HD resolution thanks to the great number of processing cores but in the other side, this proposed scheme induces some rate distortion (PSNR degradation and bitrate increase). GPU platform remains an interesting solution which is able to meet the real-time requirement for high computational applications.
Regarding our parallel H264/AVC multicore implementation, we can note that our optimized FLP approach enhances the encoding speed and achieves real-time encoding performance without inducing any rate distortion compared to the mono-core implementation. Our solution has satisfied our specifications concerning the smart camera video encoder in terms of encoding speed, PSNR degradation and power consumption.
Finally, we can note that our H264/AVC video encoder demo can also be reused for the recent HEVC video encoder. In fact, this video standard adopts almost the same hierarchical data video structure of H264/AVC encoder (GOPs, frames, slices, MB). Practically, the same dependencies of the H264/AVC encoder exist among HEVC data units. Consequently, the principal modification will only affect the algorithm and some internal data allocations but the external skeleton remains the same (frame capture, frames encoding by the different cores and bitstream sending and saving).
Conclusion
In this paper, an optimized H264/AVC encoder implementation on a multicore DSP TMS320C6472 was presented. Frame Level parallelism approach was used to accelerate encoding speed. Hiding communication overhead had allowed enhancing FLP implementation and improving the speedup factors. Experiments of enhanced FLP on five DSP cores running at 700 MHz showed that real-time constraint was achieved by reaching 99 f/s for CIF resolution and 30 f/s for SD resolution as encoding speeds. Our parallel implementation saved up to 77 % of encoding time for HD resolution and ensured an important speedup factors ranging from 4.10 to 4.49 without inducing any quality degradation or bitrate increase. Our work validated the capability of real-time processing, even for high complexity applications, by smart camera systems if they are based on embedded multicore DSP. As perspectives, we will try to reach real-time encoding for HD resolution by implementing our approach on the latest generation of Texas Instruments DSP (TMS320C6678). It includes eight DSP cores each running at 1.25 GHz, giving a large possibility to achieve real-time constraint for HD resolution. Also, two partitioning methods can be combined to improve encoding efficiency. Power consumption of our multicore implementation will be taken into account to more evaluate our embedded encoder. All this work will be reusable to implement the new H265/HEVC video standard on TMS320C6678 DSP.
