Abstract-The short time to market cycle and the target to reduce design and verification costs are driving forces to design programmable implementations of the video processing algorithms. We present two processor architectures the first one representing an application-specific instruction set processor (ASIP) design, whereas the second architecture represents a domain-specific instruction-set processor (DSIP) architecture with more general purpose instruction-set. In this work, we present results for H264 and VP8 in-loop deblocking algorithms. The processors are based on the transport triggered architecture which provides scalable instruction-level parallelism and, thanks to its simple structure, lend itself to cost effective designs. Both of the designs are programmed with C language with a minimal additional parallelism markup. The designs fulfill realtime requirements for filtering macroblocks in high-definition video. The first architecture, based on special function units, filters a high-definition stream (1920 × 1080) at 75 fps, whereas the second architecture, which provides a better programmability, filters the stream at 53 fps. The processors run on 200 MHz clock frequency and the areas vary from 146k to 373k gate equivalents depending on the processor architecture.
I. INTRODUCTION
A conventional approach for implementing embedded multiformat video decoders is a monolithic fixed-function hardware design. However, in this approach each supported format introduces silicon area overhead, not to mention the laborious design and verification effort caused by the error prone hardware design flow. Leading video codec algorithm developers wish to release a new improved format every two years which require the video devices to be upgradable and forward-compatible. Device manufacturers want to reduce time to market cycle in order to decrease engineering expenses and to improve the quality of service. A solution to the problem is to use programmable implementations which are tailored to support existing formats, but are also forward-compatible for future formats and updates.
The forward-compatibility adds new challenges due to fact that the programmable hardware running the software has to support, e.g., varying block lengths without an intolerable performance drop. Thus, in practice, strictly application-specific processor designs are not tempting. Because the programmable design partially has to sacrifice application-specific hardware optimizations and perhaps add additional resources to improve forward-compatibility, a clear challenge in the programmable designs is the energy consumption. The energy consumption is an important design criteria in general, but in mobile devices, especially in small hand-held devices, the limited power supply challenges the architecture designers [1] . Graphics Processing Units (GPUs) are designed for parallel computing in the graphics processing domain and can nowadays be used for general purpose computing [2] . However, their efficient utilization requires data parallel algorithms, and the power consumption of GPUs with enough computing power limits their use in small hand-held devices. In order to tackle these challenges, we propose two novel customized processor designs based on a compiler supported processor template called the Transport Triggered Architecture (TTA) [3] .
Section II briefly describes the function of an in-loop deblocking filter in video processing and the processor template used in the proposed design. Related work on programmable architectures is presented in Section III. Section IV presents two processor architectures for in-loop deblocking filters; the first one relying more on special function units, and the second one having SIMD-style function units with a wider applicability for future algorithms. Section IV-D compares the proposed designs against a filter implementation on an ARM processor. The results are summarized in Section V.
II. DEBLOCKING FILTER
A block based transform and quantization can cause artifacts in coded images at block edges which impairs visual quality. Deblocking filters aim to smooth borders between blocks to improve the appearance of the decoded video. An inloop deblocking filter is applied after reconstruction of residual and prediction blocks, before using it as a reference frame or displaying it on the screen.
Applying the same processing resources is possible for different formats. The VP8 format [4] , for instance, defines normal and simple filtering which include a rather similar computation in comparison to the H264 [5] filtering modes. In fact, the similarities are so close to each other that the same hardware can be designed to support both of these deblocking filters [6] . The challenge in many video standards is their fine grained control how the blocks are processed. The deblock filtering in H264 and VP8 standards is a good example having a fine grained control for edge filtering [6] leading to conditional execution making an efficient resource allocation harder.
We focus on luma component filtering since it is more demanding than processing the chroma component. Latencywise, it is reasonable to do filtering for the chroma component in parallel with another similar or less complex core. Filtering the chroma component requires roughly half of the processing power compared to filtering the luma component.
III. RELATED WORK
Citro et al. [7] proposed a programmable architecture for deblocking filtering. The architecture consists of RISC style Processor Engines (PEs) which control microprogrammable hardware accelerators (HWA) optimized for deblocking filtering. In the architecture, the picture data and macroblock (MB) headers are stored into the header memory. A PE gets a synchronization signal to execute filtering, decodes the headers and issues commands to the HWA to read a buffer which includes luma and chroma pixel data. The PE sends a command to the HWA to upload filtered pixel data through DMA into the memory. The architecture runs at 300 MHz clock frequency and the latency per MB is 812 clock cycles which is enough for filtering a 1920 × 1080 high-definition stream at 45 fps.
Major et al. [8] propose a dynamically reconfigurable instruction cell architecture processor [9] in which the instruction control is similar to the control in Very Long Instruction Word (VLIW) architectures. The designed architecture consists of a coarse grained reconfigurable fabric of interconnected instruction cells. The instruction cells are configured based on the algorithm so that the processor supports Instruction Level Parallelism (ILP) without constraints from dependencies between operations or by small branches. No detailed hardware complexity nor performance are presented in the results.
In the previous works, the implementations do not rely on conventional compiler-targeted programmable architectures, but have specialized hardware to accelerate the execution. In order to provide true gains of programmable designs, our design goal was to preserve high level language programmability without sacrificing the performance.
A. Transport Triggered Architecture
Transport Triggered Architectures (TTA) differ from conventional processor architectures by executing operations as side-effects of programmer controlled data transports [3] . In TTAs, there is only one programmer-visible instruction type: move. TTA can be used as a basis for VLIW-style processor designs with static Instruction-Level Parallelism (ILP), where multiple operations are executed in multiple function units in parallel under a programmer defined schedule. A fine grain programmer controlled ILP and a customizable bypass network makes the TTA well-suited for computational intensive signal processing applications which contain statically exploitable parallelism. The main advantage of the TTA approach in comparison to conventional VLIWs is in its scalability; the register file complexity is not as large bottleneck for very wide designs as it is for the "operation triggered" VLIWs [10] . Fig. 1 illustrates a toy example of the TTA processor. Function Units (FUs) are shown as boxes including operations such as addition and multiplication. The processor includes a Register File (RF) and an I/O unit which can be, for instance, a load-store interface to a random access memory, or access to a hardware FIFO. FUs have input and output ports for input operands and results. The port with a cross illustrates a trigger port. A write (data move) to the trigger port provides both the operand data and the operation code (opcode) of the desired operation. In the addition FU, for example, the opcode defines whether an addition or subtraction operation should be executed for the operands. Operand values and a result remain in the input and output registers of the FU until they are overwritten by the programmer. The example processor has four transport buses. The FU ports are connected to transport buses (horizontal black lines) via sockets (vertical white bars). The black dots illustrate the FU connection to the particular bus. A bus can simply connect the two FU ports or the bus can be connected to all ports. Thanks to its programmer-visibility, the interconnection network can be customized. The number of connections presents a design tradeoff between area complexity, energy consumption, cycle length, and programmability.
TTA processors are programmed by controlling data moves through the interconnection network. Data moves can be assembly coded, but in practice the feasibility of the architecturesoftware co-design design flow requires that a high level language and an efficient automatically generated compiler are used. Therefore, programming a TTA processor does not differ from programming conventional processors in case a higherlevel language compiler is used.
New TTA processors can be co-designed using an open source toolset called TTA-based Co-Design Environment (TCE) [11] . TCE helps in the TTA processor design by providing, among other tools, an instruction cycle accurate simulator that enables fast architecture exploration iterations and a high-level language compiler that automatically adapts to the modifications in the designed TTAs. The toolset contains a library of function units which have verified RTL descriptions. If the processor requires new function units not available in the database, the designer has to implement them before the processor generation. The toolset generates a processor RTL implementation and a test bench which can be simulated and synthesized with the third party tools.
IV. PROPOSED ARCHITECTURES
We propose two alternative processor architectures for deblocking filters. Architecture I includes special function units customized for a specific deblocking filter format, and can be categorized as an Application-Specific Instruction-set Processor (ASIP). The motivation for the Architecture I was to execute several bit-level operations in a single clock cycle using special function units (SFUs) and to remove conditional branches from the program code by including the conditional execution inside the SFUs. This way, the latency caused by numerous bit operations and the bottlenecks caused by the conditional branches could be removed at the price of somewhat worsened programmability.
Architecture II, on the other hand, relies on more generally applicable SIMD style FUs supporting basic operations in subword parallel fashion. Thus, Architecture II supports better general purpose processing and can be categorized as a Domain-Specific Instruction-set Processor (DSIP) useful for a wider range of video applications, but still providing a good performance. As later shown, Architecture II consists of fundamental FUs accelerated with very few SFUs making it almost a general purpose digital signal processor (DSP).
A. Architecture I
Originally, Architecture I was designed for the VP8 deblock filtering. The FUs included in the architecture are summarized in Table I . The table summarizes the number of FUs in the architecture, latency and complexity in gate equivalents (GEs). The 16-bit fixed point architecture is based on scalar FUs and SFUs. Even though the architecture has application-specific SFUs it still contains several Arithmetic Logic Units (ALUs) including basic arithmetic operations such as addition, subtraction, shift, comparison and the conventional bit operations.
All the SFUs in the architecture are optimized for the H264 and VP8 deblock filtering. A SFU called mask computes whether the absolute of two pixel value difference is less or equal to given edge or interior thresholds. Based on the output of the mask SFU, the filter decision SFU computes whether the filtering is executed or not. The filter decision SFU does multiple comparisons between the absolute of pixel value differences and thresholds which would be time consuming with several bit operations. Thus, we use a simple single cycle SFU to solve the problem. Architecture I has SFUs for simple and normal filtering types. The SFUs execute all the operations needed in actual pixel filtering in a single clock cycle. The filter SFUs includes operations such as shift, add, saturate and clip (clip a value z, so that x <= z =< y). The mask, filter decision, normal filter and simple filter SFUs execute multiple operations including also short conditional branches inside the unit enabling a branch free software execution. The data is streamed into the processor through a hardware FIFO using a stream in FU and written out using a stream out FU, one 8-bit data element at once. The efficient utilization of the stream FUs require ordered data in a FIFO storage block which indicates that there has to be a data sorter before the deblock filtering block. Otherwise, there has to be an efficient data ordering mechanism inside the processor, such as we will later introduce for Architecture II.
The architecture has 29 data buses with reduced connections to support the required parallelism. The bus utilization in percents and writes has been tabulated in Table II . On average, the bus utilization is approximately 85 percent over the algorithm execution time. Based on the experience the achieved bus utilization indicates an efficient resource allocation for a high level language implementation. In general, somewhat better bus utilization can be achieved with hand written assembly assuming the same resources. When evaluating the efficient bus utilization, a designer should make sure that data are mostly transported between function units without stalls and spilling intermediate results to memory or register files. Spilling data to memories increases the bus utilization but naturally wastes resources. Approximately similar bus utilization efficiency is achieved with Architecture II.
The processor complexity in total is 146 k gate equivalents synthesized with 180 nm CMOS technology. The maximum achieved clock frequency for the old CMOS technology is 200 MHz. If data for a vertical and horizontal filtering are streamed into the processor in correct order, the processing is executed in 322 clock cycles. When the vertical data are streamed in and stored in to the memory to wait for the horizontal filtering, the memory access overhead increases the latency to 501 clock cycles. Thus, the incoming data order has a significant effect on the latency. 
B. Architecture II
Maintaining generality was the motivation for the Architecture II design. The architecture is based on 32-bit fixedpoint Single Instruction Multiple Data (SIMD) operations. The SIMD architecture is a natural choice for the video processing algorithms due to the fact that many video processing algorithms can be vectorized. As the video algorithms can easily exploit 128 or even 256-bit width SIMD units, there is a rather easy way to improve parallelism in the current design. Fig. 2 illustrates the applied FUs in Architecture II.
The figure also presents the number of input and output ports of the FUs showing highest number of ports in FUs (e.g. load/store and pack FUs) which operates with scalar values. Architecture II access memory with a scalar or vector load/store unit (LSU). The scalar unit access a 32-bit word with a single address, whereas the vector operation accesses four consecutive 32-bit words with a single address. The vector LSU is based on the memory bank structure which in this case consists of four banks. The vector LSU enables loading or storing a single 4 × 4 block of pixel data at once. The unit can load data in three clock cycles and store in a single clock cycle. The vector LSUs are essential to remove memory access limitations which can be a significant restriction a for video processing algorithm. There is no stream in or stream out FUs included in the architecture.
The pixel data are stored in the memory and most of the time it is accessed with vector LSUs to load or store a 4 × 4 block of pixels at once. When filtering horizontal edges, the data are in correct order but for the vertical edge filtering the data need to be rearranged. If the accessed 4 × 4 block of data is thought as a matrix, a transpose of the matrix needs to be taken to get the data in correct order. Due to the very strict latency requirements, logical operations are not feasible to do the ordering. Thus, we included an efficient pack SFU which unpacks, arranges and packs the vectorized data. The same pack SFU is also applied to pack the boundary strength (BS) values for the blocks. Packing BS values reduces the interconnection network traffic. With the help of the SFU similar performance can be provided than with the ordered FIFO streams proposed in Architecture I. However, when comparing the data access mechanisms in Architecture I and Architecture II, streaming vector data through FIFO buffers may actually be a feasible solution when latency restriction becomes even tighter than it was in this study. Memory loads combined with data arrange may easily become a bottleneck for the architecture which should support high volume data streams.
The SIMD select FU controls the conditional execution. It has a significant role in latency reduction when program branches are avoided. We formulated conditional structures as before hand calculated predicates from which the select FU selects the right program path to execute. The hardware complexity of the FU is insignificant compared to the benefit it provides. In the current TCE, the select FU is called using an intrinsics, but in the future release the compiler will automatically support the use of such select operation.
In addition, the processor architecture has three different types of scalar arithmetic logic units (ALUs) including conventional operations but also combined arithmetic operations, such as two consecutive additions, shift combined with addition and consecutive and operations. These ALUs are found out to be very useful for the video processing domain and particularly in control logic processing. The architecture includes also SIMD type FUs for addition, shift and comparison operations. Table IV summarizes the architecture area with three different interconnection network complexities. The different IC variations are produced using the automatic Connection Sweeper tool [12] available in TCE. The tool gradually reduces the interconnection network until a given cycle count reduction threshold is reached. The threshold is given in per cents and the clock cycles of different configurations are compared to the original configuration. The connections are first reduced from the register files as they tend to be more expensive. In addition to reducing connections, the tool can remove complete buses if the given threshold enables it. As the automated tool can generate tens or hundreds of different configurations, an explore tool has a pareto set finder which can print the interesting configurations for the designer. An important feature of the Connection Sweeper tool is that it preserves the programmability of the processor because the program is recompiled for each generated architecture.
The presented results are for 200 MHz clock frequency. In Architecture II, all the buses are fully connected between the function units which enables efficient software bypassing [3] , but register connections are reduced to decrease the IC complexity. The IC in all designs contain a single signed 32-bit and two signed 5-bit short immediate buses for moving constants. The rest of the buses are reserved for regular data moves.
The reduced number of interconnection network connections naturally affects the cycle count because of the reduced compiler freedom. For example, IC Version I with 23 buses has a 460 clock cycles latency per macro block, Version II with 15 buses executes the same program in 526 clock cycles and Version III with ten buses in 697 clock cycles. Thus, the IC complexity is a significant design parameter when a tradeoff between area, energy consumption and latency is analyzed. The effect to the cycle length is clear as well. For example, while Version I can be synthesized to run on 200 MHz clock frequency (180 nm CMOS technology), Version III with ten buses can achieve a clock frequency up to 256 MHz.
C. Software
The presented results are for C programs without finegrained assembly optimizations. There are at least two wellknown bottlenecks for programmable platforms that exploit static parallelism such as the proposed. The first one is trying to expose the parallelism in a program using the sequential C language. TCE toolset supports also the OpenCL language which has a native support for a parallel program description, but it was not applied in this work.
In our proposed solution, we relied fully on ILP and a single core to avoid the need to use a multithreaded multicore implementation which always have additional overheads due to synchronization etc. The program was described in mostly standard C. Unfortunately, due to its free pointers and seriality, it often becomes impossible for the compiler to extract the inherent parallelism in the described program at the compile time. Therefore, we improved the ILP utilization with a simple parallelism description capability supported by the TCE compiler. We introduce a mechanism called parallel region markers which can be used to denote parallel sections in the C program to enable the static parallelization of the independent regions without needing to rely on a fragile compiler analysis. In other words, the responsibility of detecting dependency free sections is placed on the programmer which is similar to the motivation of parallel programming languages and defining threads In OpenCL. This simple mechanism enabled the compiler to efficiently exploit the resources in the designed wide machines at the instruction level, leading to a high-efficient fine grained parallel design with no synchronization overheads.
We applied the parallel region markers over independent function calls as shown in Algorithm 1. In this instance, the all four function calls are allowed to be scheduled in parallel without the need for the compiler to consider data dependencies between them. Algorithm 1 shows how the parallel region marker is used in the C function. Fig. 3 illustrates the filtered inner horizontal boundary (bolded, red line) in the macroblock. A single memory load operation reads the pixel data of 4 × 4 block including 16 8-bit pixels. To filter the macroblock boundary eight 4 × 4 blocks are read. Since the horizontal boundary can be filtered in parallel, we apply the parallel region marker to inform compiler that there is no data dependency between the four FilterHor function calls. In this example the four regions are executed in parallel if there is enough computing resources available. The second bottleneck is related to the conditional execution of the program code, i.e. if-else structures in program execution. Conditional branches cause easily serial execution if the architecture and program compiler does not support branch predication [13] or some other mechanism to handle branches.
Another, rather simple way to replace branches is to execute program code branches speculatively and then use a select operation to pick the results from the right branch according to a predicate. The technique is applied in the presented applications in Architecture II. An aggressive speculative execution of short branches is found out to be an efficient method to produce large basic blocks to the instruction scheduler to improve the ILP.
D. Comparison
We compared the TTA implementations to ARM Cortex-A8 and monolithic hardware implementations. The ARM has NEON technology providing 128-bit single-instruction multiple-data (SIMD) data engine and a branch prediction unit to accelerate conditional execution. A single core ARM Cortex-A8 is a high performance processor that has the ability to scale the clock frequency from 600 MHz up to 1 GHz and operate in less than 300 mW. The ARM with assembly optimized program for the in-loop deblocking filter can process a single MB including boundary strength computation in 1500-3000 clock cycles with average latency of 2500 clock cycles.
The area complexity of a monolithic hardware for the VP8 and H264 deblocking filters is approximately 30 k gate equivalents. Supporting more formats with a monolithic hardware increases the area complexity approximately to 120 k gate equivalents. The latency for processing a single MB is 400 clock cycles.
The most interesting comparison can be done with the proposed Architecture II which processes a macro block in 460-697 clock cycles. The area complexity of Version I is 373 kGE, whereas Version III has an area complexity of 228 kGE. Table IV shows that in a high performance programmable design the extra cost is mainly caused by the interconnection network and instruction processing overhead. The interconnection network is left fully connected between FUs which is not absolutely necessary to provide good programmability. Thus, with IC optimization the architecture complexity can be still reduced. Both TTA architectures have been synthesized with an old 180 nm CMOS technology which unfortunately does not illustrate the best performance the processor can achieve. However, Kultala et.al [14] present TTA results with newer CMOS technologies showing higher clock frequencies for the designs.
V. CONCLUSIONS
We emphasize the significance of co-optimizing the processor architecture and software for domain-specific algorithms to meet the low energy consumption and latency requirements set for the programmable designs. Critical design items for a hardware architecture are sufficient data access, support for SIMD units with width buses and carefully picked special function units. Since algorithms often include conditional execution, a hardware support for branch predication securing execution without stalls is also important.
When considering high throughput software designs it is fundamentally important that compiler can extract the parallelism from the written program. For this reason, we end up introducing a novel mechanism called parallel region markers. Another solution is to apply, for example, OpenCL language which has a native support for a parallel program description.
Architecture II is mostly built on basic FUs which guarantees the reusability. So far, we studied the performance which can be reached with a single core programmable processor. The SIMD units are definitely a powerful way to exploit data level parallelism for video processing implementations. A thread level parallelism and a multicore implementation is the solution to further increase performance.
The current TTA implementation based on the Architecture II with 200 MHz clock frequency is enough for filtering a 1920 × 1080 high-definition stream at 53 fps. On the other hand, the reduced IC in Version III of Architecture II provides a higher 256 MHz clock frequency and a less complex processor design and filters 44 fps.
