Abstract-Systems relying on fixed hardware components with a static level of parallelism can suffer from an underuse of logical resources, since they have to be designed for the worst-case scenario. This problem is especially important in video applications due to the emergence of new flexible standards, like Scalable Video Coding (SVC), which offer several levels of scalability.
INTRODUCTION
Current multimedia applications demand flexible and high performance systems in order to be implemented. The typical solution is to design hardware-software partitions, where the most computational intensive tasks are done in hardware and the rest of tasks are done in software. It is also necessary to adjust the performance of each hardware component of the system, by selecting the optimal combination of design parameters, like pipelining depth or parallelism level that optimizes the whole system behavior. However, achieving an optimal design remains a challenge, since most real task workloads are dependent on run-time system conditions, such as the application working point, and the input data to be processed [10] . Accordingly, traditional design techniques opt to oversize all the blocks for the worst-case scenario. In all but the worst case, hardware logic will be underused. Conversely, in our approach, a flexible and scalable scheme is explored.
In this paper, the Dynamic and Partial Reconfiguration (DPR) of commercial Xilinx FPGAs is leveraged [1] [2] . DPR allows the design of IP cores offering run-time adaptable parallelism, and achieving a flexible assignment of resources. In order to take advantage of this, the proposed approach is based on scalable architectures that can change their size at run-time, following a 2D highly modular and regular array structure template previously proposed in [3] . Hence, the scaling process can be carried out adding or removing regular blocks belonging to the architecture. This feature drastically reduces reconfiguration time and partial bitstream storage needs. Specifically, the explored architecture consists of an arrangement of special purpose Functional Units (FUs) working in parallel. In addition, the regularity of communication patterns through the array reduces the number of distributed memory accesses.
Current video standards allow different levels of scalability and profiles [4] . However, static parallelization approaches have to be designed to support the worst cases since they are not capable of adapting their behavior to the variable system demands. Furthermore, there exists an interest from both the Joint Video Team of the ITU and ISO/IEC standardization organizations, in developing scalable media formats to come up with the Scalable Video Coding (SVC) standard, an extension of H.264/AVC [5] [6] . An SVC video bitstream might contain all the hierarchical information needed for the three supported scalabilities (quality, spatial and temporal resolutions). However, this bitstream can be decoded in several devices simply by considering different amount of information. Considering the large number of combinations in the allowed SVC scalability levels, flexible mechanisms are required to adapt the characteristics of the decoder to the requirements of each client during run-time. The use of the run-time reconfigurable scalable architecture proposed in this paper allows for an adaptable scenario. It might be composed of different blocks in charge of video decoding tasks, with the capability of dynamically adapting its size, and accordingly, its performance, to the type and levels of scalability selected by the users. At the same time, this architecture is able to balance, at run-time, the areaperformance-energy consumption of each block. This scheme maximizes the use of reconfigurable resources.
Going further in the profiling of the SVC decoding process, the Deblocking Filter is not only one of the most computationally intensive tasks of the standard, but it is also highly dependent on the selected video profile [7] . This is the motivation for the analysis and implementation of a scalable Deblocking Filter fully compliant with H.264/AVC and SVC standards, following the scalable architecture approach. In addition, a parallelization scheme is proposed that performs the scaling process consistent with the data locality restriction desired in the architecture The rest of this paper is organized as f 2, the role of the Deblocking Filter in focusing on the importance of flexibi Then, a review of the state of the architectures is shown in Section 3. system architecture is detailed, parallelization strategy. Section 5 goes design issues regarding run-time sca Section 6 some implementation detail shown. Finally, in Section 7 some conc work are discussed.
II. DEBLOCKING FILT
Recent video coding standards like SVC provide high performance and flexibility in comparison with their prede cost of a higher level of complexity a cost. Due to their constraints and unfeasible to ensure real-time restrictions software devices. In order to solve this developments combine hardware/sof solutions by moving computationally in to hardware (HW); whereas the rem implemented in software (SW). Thanks t such as [20] , [21] , [22] and [23] ; it is p which functions deal with HW or SW Within the encoding and decoding pro 30% of the whole computation tim deblocking filter (DF). In this sense, the a perfect candidate for being implemente DF reduces the visual perfection of introduced by other encoding or decod effect which is particularly characteristic coding standards. The inclusion of DF i the bit rate by 5%-10 %, while retainin quality [13] . A detailed description of can be found in [14] . According to the SVC standards, the filtering operation i wise. Each MB is composed of 16 basic comprising of an arrangement of 4 x 4 p in Figure 1 , vertical edge V0 of the block filtered horizontally first from top to bo edge V1, edge V2, and edge V3. Ve performed in a similar way. The edge H0 is vertically filtered fr right, followed by the edge H1, H2, a edge, the filter takes four pixels of each b of the edge as its inputs. Consequently MB there exist direct dependencies with neighbors, which are located in differe follows. In Section n H. 264 
III. RELAT
Adapting parallelism's lev run-time has received researche perspectives, most of them taki computation-on-demand possib including DPR features. For approaches explore schedulin selection issues, rather than challenges. For instance, PARL that deals with mapping and from using dynamic parallelism maximize the performance of a selecting the optimal paralle application task. The variable achieved by instantiating sev implementation working concu the task workload is equally d reducing proportionally the tota limitation of this work is its res parallel tasks. Thus, it only dependences between disjoint that is not valid for DF and man a similar problem is tackled dynamically adaptive reconfig propose balancing area and exe application loops, dependant running concurrently in the rec the loop is unrolled, the higher work is focused on accelerator novelties on how different provided.
After having reviewed ex approaches working under d proposals regarding dynamicall be analyzed next. Achieving sc allows freely reusing configur within the reconfigurable logi scalability related works are functionality or operation qual performance tradeoffs. For inst provided in [11] , offers the number of taps to adjust th compromise between filterin resources. Furthermore, a scala [12] allows varying the numbe are calculated by the core, in o terms that will be subsequentl adjust video coding quality. A filter architecture is proposed and the H.264/AVC deblocking explores block level parallelism number of processing units wo execution time of a single m dealing with MB dependences being interesting for flexibility maximum number of 4×4 blo dition, the standard imposes ltering over vertical. The plied to each MB, as well as coding mode of each MB, eo sequence, as well as some TED WORK vels of processing cores at ers' attention from different ing advantage of the unique bilities offered by FPGAs r example, some existing ng and optimal parallelism n considering architectural LGRAN [9] is a framework scheduling aspects derived m selection. Its purpose is to an application task chain by elism granularity for each e levels of granularity are veral copies of each task urrently. As a consequence, divided among all instances, al execution time. The main stricted applicability to datay considers tasks without data blocks, an assumption ny other algorithms. In [10] , d, but this case considers gurable accelerators. They ecution time while unrolling on the inputs and tasks configurable area. The more r the parallelism. Since this r selection, no architectural accelerators are built are xisting DPR state-of-the-art dynamic conditions, some ly scalable architectures will calability by means of DPR rable areas for other cores ic. However, most existing e oriented to adapt core lity, rather than to set areatance, the scalable FIR filter capability of adapting the he filter order, offering a ng quality and required able DCT implementation in er of DCT coefficients that order to adapt the number of ly quantified, and therefore Also a scalable Deblocking in [7] , utilizing both DPR g filter. However, this work m, where the variation of the orking in parallel affects the macroblock (MB), without s. This approach, while still y purposes, is limited by the ocks that can be processed simultaneously, in one MB, with no further scalability levels. In addition, the designed floorplanning does not allow for easy reuse of the area released when shrinking the filter, so real area-performance tradeoffs cannot be easily achieved. Compared with previously reported architectures, in this work a dynamically scalable Deblocking filter is proposed in order to be integrated within run-time flexible decoders. It exploits a coarser granularity, compared with state of the art approaches, and it allows reusing the released area, in order to balance area-performance tradeoffs. Scalable DF follows a parallelization strategy previously proposed in [8] , fully compatible with the scaling process of the architecture, incorporating data dependences between MBs, and allowing scalability at any level, from just one functional unit (FU) up to the maximum number of resources available in the reconfigurable logic area.
IV. GLOBAL SYSTEM ARCHITECTURE
In this section, both the proposed algorithm parallelization as well as the global architecture are described focusing on general design issues, while reconfiguration details are offered in next subsection.
A. Proposed Algorithm Parallelization Strategy
The number of elements working in parallel in the proposed scalable architecture can be modified at run-time by means of DPR. Larger architectures will be able to process more MBs at a time; however, in order to obtain consistent results, the parallelization strategy has to be scalable itself, since it must respect the data dependences between MBs independently of the size of the architecture. In this sense, all MBs are directly dependent on their left and top filtered neighbor MBs. A detailed analysis of dependencies, as depicted in Figure 2 , reveals that current MB horizontal filtering depends on previous neighbor MB vertical filtering, and that the current MB vertical filtering depends on top neighbor horizontal filtering. In Figure 2 , MB H refers to a MB after having been filtered horizontally and MB HV after it has been filtered both horizontally and vertically. To finish processing a MB, MB HV has to be filtered horizontally and vertically together with their right and top neighbors, respectively. Once MB HV is filtered horizontally again, it is depicted as [MB HV ]. A possible solution to exploit MB-level parallelism might entail using a wavefront order in the same way as state-of-the-art multiprocessing solutions [15] . However, this approach needs to wait twice as many clock cycles for filtering a full MB (MBcycle), that is, until [MB HV ] is available, before the subsequent core starts processing. Considering this fact, an optimized wavefront order scan has been proposed by the authors in [8] , in which horizontal and vertical filtering are separated in sequential stages. As a result, an MB can be filtered one MBcycle earlier, as compared with previous approaches. This is because when the MB vertical filtering begins, the top MB horizontal filtering has already finished, assuming [MB HV ] is available.
B. Proposed DF Architecture
The core of the proposed architecture is a coarse grain homogeneous array of FUs, as depicted in Figure 3 . Each unit is able to carry out a complete filtering operation on an MB, such that the full array can process in parallel a region of the image. A more detailed description of each FU can be found in [24] . The main strengths of the proposed structure are its inherent parallelism, regular connections and data processing capabilities. In order to feed each FU with the required MBs, as well as to synchronize the array, modules in charge of controlling input and output FIFO memories have also been included. This module receives the corresponding MBs sequentially from the memory, and sends them to the array of FUs. In addition, in order to parallelize data provision to the FUs, other modules named Input Memory (IM) have been included at the top of each column of the processing array. Main components of these blocks are FIFO memories that distribute the suitable MBs vertically across each column. With the purpose of capturing the MB associates with the corresponding units, a module called Router has been attached to each FU. Once all FUs have been fed with the necessary MBs, they are processed in parallel. MB routers are also in charge of transmitting the processed MBs to the elements after the FU, called Output Memories. These blocks are based on FIFO memories that store the elements received from the vertical connection, and transmit them again in sequential order to the Output Controller (OC). This OC will send back the processed MBs to the external memory. Data sending, processing and results transmission stages have been pipelined and overlapped iteratively in two phases, as shown in Table I .
All the blocks of the architecture include distributed control logic to manage data transmission. Distributed control makes the architecture fully scalable, by means of the addition or removal of modules. These modules automatically communicate only with their neighbors using shared signals, without having to implement a centralized control designed ad hoc for each possible architecture size, which would reduce scalability. Figure 2 ) have to be shared between the FU filtering an MB, and the FUs in charge of their top and left neighbors. To reduce this overhead, an MB will always be filtered in the same unit as its left neighbor, and its top neighbor will be filtered in the unit below the MB. As explained in the implementation section, specific connections between FUs have been created to allow the exchange of this semi-filtered information, without involving routers. The final allocation sequence for filtering all MBs within a frame is dependent on the total number of FUs of the architectural array and also on the number of MBs of the height image frame. Thus, each FU filters all MBs contained in a particular row of the frame while always respecting data dependencies. However, if the number of FUs is smaller than the height of the MBs in a frame, the filtering process is modified. The frame will be processed by stripes with a height the same as the number of FUs in the array.
This proposal not only reduces the amount of transferred information among FUs, but also, unlike stateof-the-art proposals, only the current MB must be requested from the external memory for each filtering process, as the information related with the neighboring MBs is received during horizontal and vertical filtering stages from other FUs, as explained above. Consequently, data transferred between the external memory and DF is also reduced.
V. ARCHITECTURAL DESIGN FOR RUN-TIME SCALABILITY
One of the main advantages of using this kind of highly modular and parallel architecture, when considering exploiting DPR, is the straight forward nature of generating the design partitioning following the Modular Design Flow [16] . Thus, each module will be treated as a reconfigurable element on its own. In consequence, their VHDL descriptions are synthesized, mapped, placed and routed independently. However, since most of the components of the architecture are equal, a unique version of each type (IM, OM and FU) will have to be generated, and afterwards it can be reused in different positions of the array In this section, design issues corresponding to each module type, as well as main floorplanning decisions taken to achieve scalability, will be described.
Communications between reconfigurable elements, using bus-macros (BMs) [17], will also be detailed. The latest Xilinx dynamic reconfiguration design flow, from the v12 release onward, avoids the use of these fixed macros to guarantee the correctness of communications across modules frontiers [18] . Instead of BMs, elements called Partition Pins are automatically inserted by the tool in all frontier pins corresponding with all of the reconfigurable module types to grant communication validity. Even though it reduces DPR overhead, this approach does not allow module relocation in different positions of the FPGA, as this is unsuitable for this kind of scalable architecture because of module replication. In the following subsections, the implementation of each basic module of the architecture is described.
A. Router and FU
Since each FU is always attached to its router, both have been implemented inside a single reconfigurable module. As mentioned before, the role of the router is to capture the first MB received from the IM in each processing stage, and then transmit it without changing the subsequent MBs to the FUs below. Both data and control vertical connections among routers and the IM/OM have been included in the module border, as shown in Figure 4 . Additionally, specific point-to-point connections with the FU below have been created to exchange semi-filtered information, as described in the previous section. In the case of the last FU of the column, the next FU is located at the top of the next column. To tackle this issue, a bypass logic was included both in the OM and the FUs blocks to send it upward. In the case of the FU, the bypass connection is highlighted in Figure 4 . . Figure 6 , representing the full architecture.
B. Input Memory (IM)
Unfiltered MBs coming from the external video frames' memory are transmitted across the IMs' FIFOs, retaining the MBs to be processed by the column of FUs immediately below it. Consequently, the memory size of the IM limits the maximum vertical size of the architecture. In the final implementation, due to physical restrictions of the FPGA, the architecture was limited to a 2×3 size. Once this memory has been filled, the IM distributes the MBs in the vertical direction, to the FUs of the same column. Consequently, both horizontal and vertical BMs have been included, as is shown in Figure 5 . Lines to transmit semifiltered MBs have been included for the worst case scenario, as it was explained in section A. Consequently, the IM receives semifiltered data from the FU bypass output, and send it to the IM on the right. In addition, this module sends semifiltered data from the left IM, to the first unit of the column below. Furthermore, in the case of the last column a special communication module has been implemented to send this data back to the IC. To make this feasible, a special horizontal semifiltered bypass has been included through the IM. To tackle the heterogeneity problem of the FPGA, two versions of this block have been implemented to allow for 2D reconfigurations, as shown in Figure 6 .
C. Output Memory (OM)
This block includes the logic in charge of receiving the completely filtered MBs from the FU column above, as well as transmitting data to the output controller. It also includes inputs and outputs for transmitting semifiltered MBs. Specifically, it bypasses data coming from the semifiltered MB output of the FU immediately above to its semifiltered bypass input, so it can be transmitted to the next column, as has been already been detailed.
D. Input Controller (IC)
The IC is the input block of the architecture, and it is the communication point with the static part of the system. This block is not dynamically reconfigured when the DF is scaled, only certain registers are configured from the external embedded processor, in order to indicate the current dimensions of the DF and the size of the video frame. With these dimensions, the IC generates the correct MB reading address sequence for any size architecture. Bus-macros corresponding to MB data and control signals have been included, to communicate with the adjacent IM.
E. Output Controller(OC)
The OC is also part of the static design. It receives data from OMs, and sends it back to the video memory. Consequently, BMs have to be located across the staticreconfigurable borders to allow for different size architectures. However, future work will be carried out to communicate OC outputs with IMs to have a unique communication module with the static area.
The scalability of the full architecture is shown in Figure 6 . Because the columns are different, 8 independent bitstreams have been generated, one for the IC, one for the OC, two per each FU, IM and OM. In addition, the communication modules have been included to close open connections. Thus, the maximum size achieved is 2×3, using the right half of a medium size Virtex-5 FPGA (xc5vlx110t). 
VI. IMPLEMENTATION DETAILS AND RESULTS
This section collects the implementation results obtained considering both the architecture itself and the DPR issues.
A. Architectural Details
Within the reconfigurable stage of this architecture, the array might be formed by different number of FUs. In this section, the architecture has been implemented without considering dynamic and partial reconfiguration issues. In other words, different sizes are achieved after a new synthesis process of the full architecture. The number of units varies according to the performance or the ondemand environment constraints. Depending on how the FUs are distributed into the array (m×n), different configurations are obtained, where m and n are the width and height of the array. For a specific amount of FUs there exist several configurations, characterized by having the same computational performance, but requiring different HW resources and with different data transfer delays. In this scenario, resource utilization of each basic block of the architecture is shown in Table II . With regards to synthesis results, which are more detailed in [8] , the FU limits the maximum operation frequency of the whole architecture at 124 MHz Performance variations are shown in Table III . This data outlines the minimum operation frequency necessary for processing different video formats at 30 frames per second (fps), coinciding with real-time constraint. Each column refers to a specific number of FUs, while each row expresses frequency values for a specific video format (M×N), where M and N are its width and height in MBs respectively.
Using many FUs is not efficient for processing the smallest video formats, since some units will keep idle during filtering execution. The limit of FUs is determined by the height of the video format expressed in MBs. As an example, SQCIF and QCIF formats are 6 and 9 MBs high respectively; as a consequence configurations with more than 6 or 9 FUs are not appropriate for these formats. On the other hand, using a low number of FUs is not possible when processing the highest multimedia format, UHDTV, since its associated configurations need more than 200 MHz for real-time performance, whereas the maximum frequency of this architecture is 124 MHz. 
B. Dynamic Reconfiguration Details
In this section, details regarding the reconfigurability of the architecture are shown. Adapting the architecture to be dynamically scalable implies the addition of BMs, as well as the design of an independent floorplanning for each module. As previously shown, each module has been designed to occupy extra area, in order to be able to route all internal signals within the reconfigurable region. Thus, even though not all logical resources within the region are occupied, the entire region is necessary to come up with a fully routed design. Consequently, resource utilization is increased, as can be seen in Table IV, regarding each  element, and in Table V , concerning to different sizes of the full core. Information in Table IV can be compared  with Table II , to see the overhead of designing for DPR. Only main reconfigurable modules have been included. Values for area usage for each element directly impact the size of the corresponding bitstreams, and consequently, the DF reconfiguration time itself. Details about the reconfiguration process are outside the scope of this work. It is carried out by means of an Internal Configuration Access Port (ICAP) controller described in [25] , which includes specific features to allow module relocation, specifically addressing these kinds of scalable application. Through the regularity of the architecture, and making use of the DPR, only 8 different bitstream files are enough for configuring any m×n size. These bitstreams can be replicated in different positions of the FPGA to scale the core to any valid size. Consequently, to add an extra row to the DF, only five new modules have to be configured (two corresponding to the new FU row, the output memories and the communication module in the new positions), while the others will remain unchanged. Furthermore, all these bitstreams are already configured in other positions of the FPGA, so that they can be directly replicated from the inside of the configuration memory, without accessing external memories.
VII. CONCLUSIONS AND FUTURE WORK
This work addresses the design of spatially scalable architectures which are able to adjust their size at run-time, by means of dynamic and partial reconfiguration of commercial FPGAs. Exploiting this feature, a variable parallelism level adjustable to changing application requirements is achieved. Thus, the area-performance trade-off can be balanced at run-time. Furthermore, these benefits are applied to the design of a dynamically scalable Deblocking Filter architecture, where size can be adapted on-line to fulfil the variable requirements of the different profiles and scalability levels of H.264/AVC and SVC video coding standards.
Future work is being carried out in order to improve both the architecture design and its dynamically reconfigurable implementation. Regarding the architecture, performance will be enhanced by optimizing the allocation strategy of MBs to FUs. At the moment, this strategy is fixed. Given a frame size as well as certain DF dimensions, each FU will always process the same MBs. This strategy simplifies DF control, but a new video frame cannot be processed until the previous one has been completely filtered, introducing an extra overhead. In addition, memory consumption and its distribution within the FU will be also optimized, reducing the area of each FU module. This improvement will also impact results of designing for DPR, since routing inside each module will be simplified. Accordingly, FUs will be floorplanned in narrower regions, looking for the homogeneity of both FU columns.
