Abstract. This paper presents a scalable core architecture based on a generic systolic array. The size of this kind of cores can be adapted in real-time to cover changing application requirements or to the available area in a reconfigurable device. In this paper, the process of scaling the core is performed by the replication of a single processing element using run-time partial reconfiguration. Furthermore, rather than restricting the proposed solution to a given application, it is based on a generic systolic architecture which is adapted using a design flow which is also proposed. The paper includes a related work discussion, the proposal and definition of a systolic array communication approach, which does not require the use of specific macro structures and permits to achieve higher flexibility, and a design flow used to adapt the generic architecture. Further, the paper also includes an image filter application as a simple use case, along with implementation results for Virtex 5 FPGA.
Introduction
Current multimedia applications are offered on heterogeneous terminals, with a broad range of features, using different communication networks and variable bandwidth availability capabilities [1] . As a result, single devices are supposed to deal with multiple coding standards, which evolve and emerge in short time. Consequently, devices lifetime is shortened and their replacement with new ones, with advanced features, requires speeding up time-to-market.
This challenge can be solved providing more flexibility to devices by including adaptability capabilities. Device adaptation can be based on different parameters, like the battery level, the available computational power, the target coding standard or even a profile within a standard. The need of adaptability could be easily fulfilled by the use of software implementations. However, most of the multimedia related tasks are compute-intensive and demand high performance and fast execution, which can be achieved in hardware. In this context, reconfigurable computing can fulfill both, performance and flexibility, requirements.
Among the flexibility requirements, there is a wide interest in proving solutions that permit to scale in real-time the functionality of a hardware block. Functional scaling is achieved by modifying the size of the operation performed by a core, depending on the application requirements at a given moment. Such solutions can be advantageous in many domains. Among others, in coding standards, where scaling is oriented to variable-size hardware operations, like the Discrete Wavelet Transform (DWT) presented in [2], the variable-size Discrete Cosine Transform (DCT) in [3] or in motion estimation and filters [4] , and also, in tasks scaling for multi-standard communication systems [5] and [6] .
This paper addresses a solution where the functional scalability of a hardware block is achieved by means of spatially scaling the physical implementation of a core. This means modifying the area occupied by the core inside a reconfigurable system. In addition, with this kind of solutions, different tradeoffs between the area occupied by a core and its performance can be set. An example of such system can be found in the scalable window based image filter proposed in [7] , or in the scalable DCT presented in [8] .
A direct approach to create variable-size scaling cores is to implement the same task in several cores, with different performance and area requirements, and load a suitable one in the system depending on the available hardware resources. Differently, highly parallel, modular and regular architectures have been studied as a scalable core architecture alternative to reduce the overhead of the adapting process. These architectures can be scaled by means of the addition and removal of parallel blocks resulting in lower adaptation times. Among the architectures with these characteristics, scalable cores based on systolic arrays and distributed arithmetic are the most common, like the ones presented in [9] and [10] . Distributed arithmetic provides scalable solutions to perform arithmetic operations, while systolic architectures can solve full computingintensive tasks in a broad range of fields. An interesting summary of different systolic arrays, each one for a specific application field, can be found in [11] .
There are several alternatives to implement the process of scaling a core, like the use of parameterizable HDL code, which results in different core implementations once synthesized [12] , or to use a clock gating technique for the unused elements [3] . However, the first solution does not permit real-time adaptation, while the second one does not release the unused logic.). Therefore, the exploitation of partial run-time reconfiguration capabilities of state of the art Field Programmable Gate Arrays (FPGA) is the widely adopted solution, since it overcomes these limitations. This paper focuses on systolic-array-based scalable cores that permit run-time adaptability. However, differently from the related work discussed in the paper, it presents a general systolic architecture that can be customized, using a proprietary design flow, to solve concrete problems. The proposed solution permits, using a single processing element replication process, to scale the functionality of a core mapped at run-time, or even to create a new one. The replication of the basic element is carried out by means of dynamic reconfiguration. Additionally, an approach to communicate single reconfigurable elements of the array, which does not require the use of specific macro structures, is provided. The proposed communication approach permits to provide generality to the systolic array, to gain flexibility and also reduces the run-time reconfigurable systems implementation area overhead.
The rest of the paper is organized as follows. In section 2, the related work is described, highlighting the main differences with the proposed solution, which is presented in detail in section 3. Section 4, provides implementation results and a use case of the proposed architecture and finally, conclusions can be found in section 5.
Related Work
In this section, a review of run-time scalable cores based on systolic arrays is included. Some representative related works in this specific topic have been selected
