Abstract-Dynamically reconfigurable architectures are emerging as a viable design alternative to implement a wide range of computationally intensive applications. At the same time, an urgent necessity has arisen for support tool development to automate the design process and achieve optimal exploitation of the architectural features of the system. Task scheduling and context (configuration) management become very critical issues in achieving the high performance that digital signal processing (DSP) and multimedia applications demand. This article proposes a strategy to automate the design process which considers all possible optimizations that can be carried out at compilation time, regarding context and data transfers. This strategy is general in nature and could be applied to different reconfigurable systems. We also discuss the key aspects of the scheduling problem in a reconfigurable architecture such as MorphoSys. In particular, we focus on a task scheduling methodology for DSP and multimedia applications, as well as the context management and scheduling optimizations.
Abstract-Dynamically reconfigurable architectures are emerging as a viable design alternative to implement a wide range of computationally intensive applications. At the same time, an urgent necessity has arisen for support tool development to automate the design process and achieve optimal exploitation of the architectural features of the system. Task scheduling and context (configuration) management become very critical issues in achieving the high performance that digital signal processing (DSP) and multimedia applications demand. This article proposes a strategy to automate the design process which considers all possible optimizations that can be carried out at compilation time, regarding context and data transfers. This strategy is general in nature and could be applied to different reconfigurable systems. We also discuss the key aspects of the scheduling problem in a reconfigurable architecture such as MorphoSys. In particular, we focus on a task scheduling methodology for DSP and multimedia applications, as well as the context management and scheduling optimizations.
Index Terms-Clustering, partitioning, performance tradeoffs, reconfigurable computing, system level.
I. INTRODUCTION

R
ECONFIGURABLE computing systems combine a processor with reconfigurable hardware. This area of computing is consolidating itself as a real alternative to application specific integrated circuits (ASICs) and general-purpose processors. The main advantage of reconfigurable computing derives from its unique combination of broad applicability, provided by the reconfiguration capability, and achievable performance, through the potential parallelism exploitation. The role of reconfigurable hardware is to implement a large set of applications, which traditionally have been suited to ASICs, while providing the required performance. The processor orchestrates the whole system and it may be also used for computations so as to contribute to the overall throughput. This emerging research field has produced a lot of different architectures [1] - [5] , as well as the consequent development of CAD tools [6] - [16] .
A reconfigurable device may be mainly characterized by the granularity of the configurable units, the number of configuration levels (contexts), the type of reconfiguration (dynamic or static), the computation model and the memory organization. The features we have mentioned are the deciding factors that will shape the application development framework.
The granularity influences the hardware performance for a particular application. A coarse grain device is not usually suitable for applications with many bit-level manipulations. On the contrary, it is usually a good choice for arithmetic operations at the byte level, as the device is typically composed of reconfigurable basic units such as unsophisticated processors.
Dynamic reconfiguration allows the change of the device configuration on the fly during system operation. On the other hand, static reconfiguration implies a computation stall, during chip configuration. We could say that dynamic reconfiguration permits the consideration of a new dimension in the design space, the time multiplexing of logic resources. Xilinx XC6200 family and MorphoSys architecture [1] are examples of these devices.
Recently, multicontext architectures have appeared to minimize the reconfiguration overhead [1] . They can store a set of different contexts (configurations) in a context memory. When a new configuration is needed, it is loaded from the context memory if available. As the context memory is on chip, this operation is much faster than the reconfiguration from an external memory, which allows to dynamically perform a full reconfiguration. For example, MorphoSys architecture [1] has a context memory that can store 32 different contexts, and a full or partial change of one context only takes one clock cycle, compared to one single context transfer from the external memory, which takes more than 200 cycles. Hence, reconfiguration time may be dramatically reduced, if the context memory is carefully managed.
The computation model of most configurable systems is single instruction multiple data (SIMD) or multiple instruction multiple data (MIMD) . Some of them may follow the very long instruction word (VLIW) model. Finally, the memory can be organized as shared address space (SAS) or distributed address space (DAS) and each address space may be arranged into one or multiple banks.
Obviously, the development of computer-aided design (CAD) tools that speed up the design process, exploit the architectural features, and achieve the throughput requirements, becomes extremely necessary. This paper describes a CAD framework which produces executable code from a given high-level input description of an application. It considers all possible optimizations that may be carried out at compilation time, through the scheduling of the context/data transfers and their allocation.
We address the above problem for a multicontext reconfigurable SAS architecture. In particular, MorphoSys is the target architecture. In this paper, we focus especially on the task scheduling methodology taking into account the exploitation of the architectural features. Moreover, we describe a heuristics for context scheduling and allocation so that the reconfiguration impact on the latency is minimized.
Within the applications that are commonly addressed in reconfigurable computing, digital signal processing (DSP) and multimedia applications (like image or video processing) are probably the most attractive ones. They have usually been implemented with ASICs since they are computationally intensive with a large amount of potential parallelism, which makes these applications good candidates to be implemented with reconfigurable computing. On the other hand, they have an internal structure with some features that may be exploited. Our approach takes into account these considerations in order to simplify the problem solution.
The paper begins with a detailed discussion of the existing compilation tools, as well as the research related to task scheduling for reconfigurable systems and context management. Section III describes the general development framework and introduces each of the steps that compose it. Section IV then describes the MorphoSys architecture, and summarizes the architectural features that are relevant to this problem. Section V is devoted to the analysis of the methodology and techniques used to solve one of the most important steps of the application development framework, task scheduling. Following this section we discuss the fundamental aspects of context management and present a new methodology to face this issue. Furthermore, we propose a heuristic method that quickly provides a quasi-optimal solution. In Section VII, the theoretical assumptions are validated through a set of experimental results. Finally, Section VIII concludes the paper.
II. RELATED WORK
In this section, we present separately the previous work related to the three major contributions of this paper: compilation framework, task scheduling and context management.
A. CAD Synthesis Tools and Compilers for Dynamically Reconfigurable Hardware
We distinguish two basic approaches to this problem. One of them derives from the research on compilers for parallel machines [6] - [11] , whereas the other has evolved from the background of traditional synthesis tools, including high-level synthesis (HLS) and logic synthesis [12] - [16] . Thus, we will simply refer to them as compilers and CAD tools, respectively.
Compilers for reconfigurable computing usually borrow the techniques commonly used for parallelism extraction. They take advantage of the research for MIMD parallel machines [7] , or VLIW [7] , [8] . On the other hand, [6] considers a SIMD model and proposes some new optimizations. Most of them are developed for a particular architecture (commonly coarse grain), although CAMERON [9] and DEFACTO [11] projects target a general reconfigurable architecture. All these compilers share a very important feature: they use C or very similar notations as the input format. This facilitates the work for programmers who are not familiar with hardware description languages (HDLs), which makes it a very desirable condition.
The granularity of the optimizations is usually fine grain [6] - [8] . However, the compiler for Sassy [17] (a C subset developed in the CAMERON project) is the only work that explicitly mentions the ability to exploit fine and coarse grain parallelism.
Some of these compilers, [7] , [8] , [10] , [11] , use the Stanford University intermediate format (SUIF) compiler infrastructure [18] , which provides a general environment for the exploitation of machine independent optimizations.
The design flow for a typical CAD tool, [12] - [16] , can be represented as in Fig. 1 . As in HLS, the input to these tools is usually specified in a HDL. In the initial steps, they incorporate temporal and spatial (only for multichip systems) partitioning, which divides the computation into partitions that fit in one device. Partitioning relies on the quality of HLS estimations. The goal of these steps is to minimize the transfer among partitions, and to maximize the parallelism. The partitions are fed to the logic synthesis tools, which may be provided by the vendors of commercial devices. [16] and [13] are good representatives of this kind of approach. The latter work proposes a genetic algorithm to solve the spatial partitioning for multichip systems. Additionally, the temporal partitioning problem is formulated as an integer linear programming (ILP) model and it is solved through an ILP solver. In the case of [12] , it additionally addresses important issues such as incremental reconfiguration, whereas [16] considers the control synthesis separately. The environment presented in [15] has a different sequence, the logic synthesis is carried out prior to the partitioning, floorplanning and routing phases. On the other hand, [14] uses a library with VHDL descriptions of typical functions and algorithms. The behavioral input is synthesized to a netlist that is later partitioned, floorplanned, and routed.
B. Task Scheduling for Reconfigurable Hardware
The scheduling problem in reconfigurable computing is relatively new. Most approaches are versions of existing HLS techniques, extended to consider specific features of reconfigurable systems such as the reconfiguration time [19] - [22] . A heuristic technique based on static-list scheduling, enhanced to consider dynamic area constraints is proposed in [21] , while [22] presents a level-based scheduling algorithm. A new approach to the problem is presented in [23] , [24] , where an ILP model is applied to the temporal partitioning of a task graph. Additionally, [25] proposes a technique for loop fission that reduces the configuration overhead. In [26] , we presented a new approach that finds the task schedule with the optimal execution time. As far as we know, it is the only work that considers some important architectural issues such as multiple levels of reconfiguration, overlapping of computation and data/context transfers and multiple data caches. The work targets MorphoSys, a coarse grain dynamically reconfigurable system.
C. Context Management
One of the major bottlenecks in a reconfigurable system is the reconfiguration time, even for dynamic reconfiguration. Consequently, it is possible to obtain substantial performance improvements through a correct management of the context memory. To the best of our knowledge, [27] and [28] are the only investigations that explicitly tackle this issue. [27] addresses the reconfiguration overhead reduction through configuration prefetch. However, its applicability is restricted to single context architectures. On the contrary, [28] addresses context management for multicontext reconfigurable architectures. It presents a heuristic technique to efficiently minimize the configuration overhead.
III. GENERAL COMPILATION FRAMEWORK
This paper is focused in two different scheduling tasks: "kernel scheduler" and "context scheduler and allocator." The methodologies proposed to solve these tasks (Sections V and VI) represent the main contributions of this work. However, these tasks have been conceived to be integrated within a complete compilation framework, which defines the input information and the goals of every task. Consequently, the overall description of this compilation framework is presented in this section before analyzing the particular tasks that have been solved in this work.
From our point of view, there are two major issues that unlike our approach, are only superficially considered by previous work. First, the compilers based on research for parallel machines are focused on parallelism extraction, and, therefore, they explore local regions of code, i.e., basic blocks, traces, or hyperblocks [8] . This makes it difficult to consider such issues as the impact of reconfiguration time and data transfers on performance, as well as incremental configuration. All these parameters should be considered from a global point of view. Second, they do not propose any optimization technique regarding reconfiguration management nor explain how to weigh up the reconfiguration overhead that may be produced by different solutions.
On the other hand, the design problem for tools that evolved from HLS basically becomes one of partitioning since it is the only portion of the design process that has yet to be widely investigated. In this approach, the considerable amount of fine grain tasks that compose an application may produce a huge partitioning space, which could result in a suboptimal solution for most cases.
DSP and multimedia applications are usually composed of a group of typical macro tasks, which usually contain an important amount of code. One application example is moving picture experts group (MPEG), a standard for video compression and decompression. The encoding sequence consists of motion estimation, motion compensation, discrete cosine transform (DCT), quantization, inverse quantization, inverse discrete cosine transform, and inverse motion compensation. These functions are also used in a lot of other applications. For example, DCT is widely used in the field of video processing. We will use the term kernel to refer to one of these macro tasks.
The general framework that we propose is presented in Fig. 2 . It is a library-based approach. The application description is provided in C code, and we assume it is written in terms of kernels. The kernel library stores the code that specifies not only the functionality, but also the mapping to the reconfigurable units. This mapping is, so far, performed by hand (see [1] for some examples). This approach enables an environment to be supported with the following advantages: -reuse of kernel code, which distributes the initial development costs and enables refined optimizations to get optimal results; -design modularity, which enables fast application development cycles and easy adaptation to a graphical environment; -global optimizations may be exploited more easily above the kernel level; -design space manageability, since the number of kernels is low.
The first step of the framework is information extraction, which generates from the input code all the parameters that the following tasks need. This information includes kernel execution time, data reuse among kernels, data size, data dependencies (as data flow graphs) and context size of each kernel. Data reuse refers to the data that may be used by different kernels, if they are stored in the internal memory. This will avoid future reloading of data. Thus, if a kernel produces or uses some data that are used by other kernel, both kernels reuse these data.
Later, the kernel scheduler explores the design space to find a sequence of kernels that minimizes the overall execution time. Although each application imposes some constraints on the execution flow, there may be a lot of valid kernel sequences with different execution times.
At this point we know the best sequence of kernels, but there are other events that should be carefully scheduled such as context loading and data transfers. Furthermore, the allocation of context and data will strongly influence the memory utilization and so the scheduling and the final execution time. The context-data scheduler/allocator will simultaneously accomplish both tasks, taking into account some other optimization conditions such as the simplicity of the code and the overhead introduced by the scheduling.
Finally, the code generator will encode the results of the previous steps so that the optimized version of the code is generated. This includes the code of all the kernels, as well as the scheduling specifications. Then, the executable code is generated by the native compiler.
It is obvious that this methodology does not explore some design space regions, which would be obtained by dividing the kernels. However, these solutions would rarely improve the result since subkernels would introduce a lot of data transfers among them. In order to consider the nonexplored regions, it would be possible to introduce another task before kernel scheduling, which would obtain a suitable granularity. In this work, we consider optimizations at the kernel level.
Since an application is always executed as an ordered sequence of configurations and data transfers and the application structure can always be provided in terms of kernels, the proposed compilation framework can be applied to any reconfigurable architecture. However, the specific tasks that compose the compilation framework as well as the kernel mappings stored in the library must take into account the issues of the target architecture. Thus, it would be necessary to know some parameters such as reconfiguration and data transfer time and some architectural features such as architecture reconfigurable blocks, memory organization, data buses, etc. In our case, we have chosen MorphoSys as the real world example of this research. Thus, the different tasks in the framework can only be applied efficiently to architectures with similar features, which are presented below.
Once that the compilation framework has been described, the technical contributions of the paper can be enumerated. We have addressed two of the main tasks of this compilation flow (Fig. 2) . The first one is the kernel scheduler which is described in Section IV. We present a technique that explores the design space considering performance optimization. It provides an execution sequence of kernels, which in turn is the input to the next task in the compilation flow, the context-data scheduler and allocator. Section V is devoted to the analysis of context scheduling and allocation issues, whereas the study of data scheduling is postponed to be addressed in a future work.
IV. MORPHOSYS OVERVIEW
MorphoSys architecture is an integrated coarse-grain multicontext reconfigurable system. M1 is its first implementation and, as shown in Fig. 3 , it consists of an array of reconfigurable cells (RC), a control [reduced instruction set computer (RISC-RIS)] processor, frame buffer (FB), and a direct memory access (DMA) controller.
The core of this reconfigurable chip is the RC array, which is composed of 64 RC arranged as a grid. The architecture of each RC is similar to the data path of a processor. However, there is no control unit, and the control information is instead included in the context. This context is the binary code that specifies the functionality implemented by the RC array, as well as the active interconnections among cells. The context word controls the arithmetic logic unit (ALU) function, the internal multiplexors, the usage of registers, etc. Context information is stored in the context memory (CM). The CM is a two port (one read, one write) memory that can store 32 different 256 bit context words; 16 corresponding to rows and 16 corresponding to columns. The particular context to be implemented is loaded from the respective CM location to the context register in the RC. The CM allows dynamic configuration. At the same time that a context is being executed, future configurations can be loaded into the context memory.
The frame buffer is composed of two sets, each having two banks. The RC array is able to simultaneously access two independent data words stored in different banks of the same set. While the RC array works on data from one set of the frame buffer, the DMA controller enables concurrent data transfer between the other set and the external memory. Consequently, computation and data movements (RAM FB) can overlap in time if they are carried out over different sets of the frame buffer. The DMA controller also enables transfers between the external RAM and the context memory.
The MorphoSys system operation is controlled by TinyRISC, a RISC processor whose instruction set has been extended with some instructions for specific control of this chip (change of configuration, data movements ). The main architectural issues of MorphoSys that are relevant for scheduling are summarized as follows:
1) computation on the RC array can overlap with: -context loading on CM row (column) positions, while computation is accessing column (row) positions; -data transfers (RAM FB) to/from one set of the FB if computation is using the other set. 2 context loading and data transfer can never overlap. A typical application can be represented as a loop of " " iterations of a sequence of kernels. The first and the last iteration take different times to be executed when compared with the iterations inside the loop body. As " " is always a big number (396 for MPEG), a very small error would be induced if we only considered the execution of iteration " " in relation to the following " " and the previous " ." Based on the above considerations, our execution model can be represented as in Fig. 4 . The superscript " " refers to the current iteration. Kernels, and , use data from set zero. At the same time, kernel data (the results from previous iteration " " and input data for the current iteration " ") are transferred between the external RAM and set 1 of the FB. A similar reasoning may be applied to kernel and the rest of the kernel data. Of course, in the next iteration, computations will be carried out on the other set of the frame buffer. The figure also illustrates the fact that context loading cannot overlap with any data transfer operation, but may overlap with computation during the interval called . is defined as shown in the figure and it is constrained by the reconfiguration accesses to the a particular set of the CM, which will make that this set is not available for context loading. For example, in Fig. 4 means that the corresponding set of the CM is continuously accessed for reconfiguration, and then it is not available for any context loading.
Note that the execution model that we have defined establishes the features of the target architecture that can be addressed by the kernel scheduling tasks. It has to be dynamically reconfigurable and overlapping of computation with data/context transfers has to be possible.
V. KERNEL SCHEDULER
The target applications of this work are usually described by a data flow graph with loops, whose nodes are kernels. However, data dependencies allow the applications to be scheduled in many different ways. The performance requirements are usually so demanding that nearly optimal solutions are needed for many applications. Therefore, the goal of the kernel scheduler is to find the optimal kernel sequence to minimize the execution time. However, in applications with defined time constraints, the solution only has to meet them.
We will discuss the scheduling issues with the example in Fig. 5 . A given application can usually be scheduled in many different ways. Fig. 5(a) and (b) shows two extreme schedules of a hypothetical application. If each kernel is executed only once before the execution of the next one (case a), each kernel's context has to be loaded as many times as the total number of iterations,
. Moreover, if some data that have been used or produced remain in the FB, they can be reused by other kernels, and data reloading is avoided. In the other extreme, if each kernel is executed times before executing the next one (case b), each kernel's context has to be loaded only once. However, as the size of the data produced and used by any kernel is likely to exceed the size of the internal memory resources, data reuse may be low. Furthermore, in case b data transfers for a kernel can potentially overlap with the computation of any other kernel, and the possibilities of overlapping are maximal. In the former case data/context transfer for a kernel can only overlap with its own computation, and the possibilities of overlapping are minimal. The optimal solution usually remains between these two extreme alternatives. Intermediate solutions [ Fig. 5(c) ] can be obtained if the application is partitioned into sets of kernels that can be scheduled independently of the rest of the application. We use the term partition to refer to one of those kernel sets. In these solutions, context and data reuse, as well as the possibilities of overlapping computation and data/context transfer may be possible to some extent between the two extreme sequences, nevertheless the overall effect may produce a better result.
From the above discussion, it is clear that there are three optimization criteria, which are summarized as follows: -context reloading (minimize); -data reuse (maximize); -computation and context/data transfer overlapping (maximize). The scheduling of kernels within a partition is not just the simple ordering of kernels. In architectures with multiple memory sets such as MorphoSys, the scheduling of kernels can be viewed as an assignment of computations to one of these memory sets. The assignments will influence the reuse of data among kernels, as well as the possible temporal overlapping of computation and data movements. For example in Fig. 4 , computation of kernels and ( and in the figure) are assigned to set 0, whereas computation of kernel is assigned to set 1. and cannot reuse data and computation of can overlap with . We use the term cluster to designate a group of kernels that are assigned to the same set of the FB. In Fig. 4 there are two clusters:
and . So scheduling may be reduced to cluster generation.
The optimization criteria are conflicting, therefore it is necessary to do an exhaustive search in order to find the optimal solution, which would imply the generation of all possible partitions and then the scheduling of every partition. It is obvious that this approach will not be efficient. However, if we find a way to estimate and bound the quality of a partition, it will only be necessary to schedule a few selected candidates. So, our proposal divides the problem into the following two independent tasks: 1) partitioning of the application; 2) scheduling within each partition of the best solution/s found in the previous step. Both tasks have to generate sets of kernels (partitions and clusters) so a single algorithm has been contrived in order to solve them. Below, we provide a description of the exploration algorithm and later both tasks are separately analyzed.
A. Exploration Algorithm
One of the desirable properties of the exploration algorithm is the potential generation of any possible solution. It will ensure that the optimal solution can always be found. Hence, we propose a recursive algorithm with backtracking. The starting set is the whole application for partitioning (a partition for scheduling). Partitioning of this initial solution will allow all possible partitions (clusters) to be visited. As stated in the previous section, this solution [ Fig. 5(a) ] also maximizes two of the optimization criteria: data reuse and possibilities of computation-data transfer overlap. Nevertheless, if the solution has to meet time constraints and we find a way to generate potentially good solutions during the first algorithm iterations, the exploration algorithm could be finished before the exploration is completed. This can be achieved by guiding the search with a criterion. As the starting solution maximizes two criteria, context loading has to be improved with subsequent iterations, but it would have to produce the smallest negative impact on the other criteria. The possibilities of overlapping can only be taken into account when a partition (cluster) is visited, however, data reuse can be estimated easily as explained later. Additionally, any reduction of the size of the partitions (clusters) will improve context loading. Consequently, data reuse is the best search criteria. Note that data reuse is now considered to be independent of memory assignment, since this is what the second task of kernel scheduling (scheduling within a partition) will look for and it is not known.
The exploration procedure is implemented as follows. Each edge of the DFG (data flow graph) is numbered in ascending order according to the amount of data reuse between the kernels connected by that edge [ Fig. 6(a) ]. If data reuse is the same for several edges, any order is valid, but numbers are not repeated. Then, the edges are removed in ascending order generating the branches shown in the exploration tree (Fig. 6) . Note that if the tree is visited from the top to any leaf, it is not possible to find a descending edge sequence. This fact guarantees that every solution is only generated once. For example, in Fig. 6(f) , edge 2 cannot be removed, as the last removed edge is 3 and it would imply a descending order. This process results in the formation of groups of kernels that have no joining edges [ Fig. 6(c)-(f) )]. Each separated group of kernels forms a potential partition (cluster) whose feasibility has to be checked. For example, in Fig. 6 (c) the generated subset is not a partition, since needs the results from . Every time a new partition appears, a different cover of the DFG is built, and therefore a different solution is explored.
This algorithm facilitates the pruning of the exploration tree, which is handled by the evaluation of a bounding check (BC). This condition ensures that no subset of the current partition will improve the actual results. So, if BC is false [ Fig. 6(d) ], we perform backtracking to continue the exploration of other potentially better regions.
B. Partitioning of the Application
From the point of view of performance, the quality of a partition can be estimated without performing scheduling. If we assume that parallel computation with data transfer is always possible, the execution time can be directly estimated through the evaluation of an expression. Furthermore, this expression gives a lower bound for the real execution time : Let be a partition (see the equation at the bottom of the next page), where the overall minimum number of context memory words that need to be loaded per loop iteration; the time spent to load ; the number of consecutive kernel executions before the execution of the next one [ Fig. 5(d)] ; the execution time of kernel ; (Fig. 7) the portion of computation time that can overlap with context loading; the time to load the input data from the RAM to the FB if kernel can reuse the data of the previous kernels ; the time to store the results from the FB into the RAM. is graphically illustrated in Fig. 7 . In the first case of this expression, computation completely overlap with context and data transfers, since [ Fig. 7(b) ]. In the second case, and, therefore, the whole available is used by [ Fig. 7(b) ]. Only the remaining computation time can overlap with data transfers. Sometimes the FB can store the data for several iterations of the considered partition. In this case, every kernel could be executed times before the next kernel [ Fig. 5(d) ]. This is usually called loop fission and will reduce the context loading rate by a factor . We have assumed that only one kernel can be executed at a time. The reason is that typical kernels are complex enough to exploit the whole RC array, and simultaneous execution of several kernels would not improve the resulting performance since it would complicate the context and data management, as they would belong to different kernels. However, as the technology enables more complex devices, the possibility of multiple kernels may be an attractive alternative that will require further research.
When the total number of context words in a partition is bigger than the CM size, is computed from the context words that cannot remain in the CM during all the iterations.
Every time the exploration algorithm generates a new partition is computed and the lower bounds for all partitions in the application cover are added to obtain the overall lower bound . In order to check if this cover has the best partitioning solution, its overall lower bound can be compared with a lower bound for the whole search space (SS), LB(SS). In if if LB(SS) assumes that the three optimization criteria are optimized. Overlapping and data reuse are maximum, and context loading is zero. If and LB(SS) are equal, the best candidate has been found.
Let us now address the bounding check, which was introduced in the previous subsection. Imagine a partition whose data transfer and context loading completely overlap with the computation time . Any subpartition cannot reduce the lower bound for the total execution time. Additionally, if " ", no subpartitioning improves . These checks are jointly expressed in (1) .
Let be a partition and a set of subpartitions such that . If (1) Now, suppose that the size of the data used by a given partition is smaller than the size of one set of the FB, where the best schedule assigns all the computations to one set of the FB and all data movements to the other set. In this case, computation and data overlap is maximum [as in Fig. 7(a) ] and the total execution time for this partition equals LB. On the contrary, if the total data exceeds the size of one set of the FB, this kind of schedule is not possible and the scheduling has to be done (Fig. 4) . Therefore, if a partition meets condition (1) and its data fit into one set of the FB, no partitioning of will improve its execution time . These two conditions form the bounding check for partitioning:
Let be a partition and a set of subpartitions such that (see equation at the bottom of the page), where stands for the size of the data used by partition .
The size of the data can be obtained by the sum of all the input data and the results generated by all the kernels. However, if a kernel has already been executed, the data that are not used by the following kernels can be replaced by the generated results (Fig. 8) . Thus, if every kernel is executed once before the following one " " and may be expressed as shown in Fig. 8 . Let be a partition. Then
Size of input data for kernel except those shared with kernels Size of the results of kernel When a kernel is consecutively executed more than once "
," the general expression may be derived from the previous one (see Fig. 8) In this expression, the data used are multiplied by for all the kernels except one of them, say . In order to explain this, suppose that is going to be executed times. The first execution of requires " ". The second one requires " " and so on. Thus, the maximum value obtained is the real size of the data used by
The expression for provides a way to compute . Its value has to be as high as possible provided that the data of fit into the FB. An important consideration is that within a partition, , the reordering of kernels (when possible) can change false if and (Size of one set of the FB) true otherwise
. In order to minimize , the kernel with the largest amount of data used is executed first. Therefore, the data that have already been used can be replaced by the generated results.
C. Scheduling Within a Partition
Given an ordered partition (OP), its execution time (ET) OP may be expressed as shown in the equation at the bottom of the page, where the cluster " "; the time to load the minimum number of context words only in cluster ; the variable with its usual meaning corresponding to the kernel . This expression has a similar structure to and represents the maximal horizontal dimension of Fig. 4 . It should be noted that we have assumed that the execution of cluster " " always overlaps with the storing of the result generated by the previous cluster "
" and the loading of the input data for the next cluster "
." The exploration algorithm is used to generate clusters. However, clusters are only feasible when the size of its data fits into one set of the FB.
For scheduling within a partition, the bounding check is the same as the stopping criteria. If the execution time for a given partition equals its lower bound , there is no other clustering that can improve the result. Moreover, if a partition, , is scheduled and " " that has not been scheduled and " " that has been scheduled, then the best scheduling has been found.
VI. CONTEXT-DATA SCHEDULER AND ALLOCATOR
The kernel scheduler generates a solution that can be represented for a partition as in Fig. 7(a) . It only imposes a high-level order (sequence of kernels). However, the context loading and data movements have not been scheduled (compare Figs. 4 and  7) . During the kernel scheduling task, it is estimated that these events are optimally scheduled. So, if the scheduling of these actions is not carefully performed, the performance predicted by the kernel scheduler will not be achieved. Therefore, the context-data scheduling becomes a crucial task whose goal is to achieve the degree of performance that the kernel scheduler assumes.
In the type of applications we deal with, the data requests and sizes are known before execution. This enables all the possible optimizations to be carried out at compilation time, without any penalty on the execution time.
Similarly to the kernel scheduler, the context-data scheduler and allocator has to maximize the overlapping of context and data transfers with computation so that the final execution time is minimized. The selection of the positions for the data in the frame buffer and the context words in the context memory (socalled memory allocation) is closely related to the scheduling. First, a bad use of the memory resources may force some additional data loading that otherwise could be avoided. Second, it may have the drawback of generating memory fragmentation, which causes two negative effects on the execution time: memory access initialization time and the generation of a more complex code (next step in the framework). Every time a loading process has to be started, a new loading statement is introduced in the code. This overhead may delay the execution of the following instructions and decreases the final system performance.
There are more considerations regarding the complexity of the final code. As stated before, the schedule of a typical application can be represented as a loop of a series of kernels. This facilitates the translation into a programming language. Similarly, from the data and context point of view, it is desirable to obtain a periodical schedule in order to minimize the complexity of the final code. We will say that this schedule is periodical in time.
The above reasoning can also be applied to spatial recurrence. If data are placed in the same positions (periodical in space) as they were in the previous iteration, the form of the statements will be the same. Otherwise, different statements may be needed for loading of the same data in different iterations.
Of course, there may be solutions whose ideal execution time is lower than the periodical solution. However, the increase in code and the resulting processor overhead that the nonperiodical solution introduces will make it a worse solution.
To summarize, it is desirable to find a solution such that -overlapping of context and data tranfers with computation is maximized; if ifdata and context memory fragmentation is minimized; -schedule is periodical in time and space. To address this problem we propose a methodology that divides it into the following three tasks: 1) distribution of overlapping time between context and data transfers; 2) context scheduling; 3) data scheduling. The first step is taken so that the two following tasks may be addressed as separate problems. It distributes, between context and data transfers, the time available for overlapping with computation. This decision divides the problem into two subproblems and is based on the small interaction between both tasks.
Of course, context scheduling is only necessary when the total number of context words exceeds the size of the context memory.
The next two subsections describe the first and second step in this methodology. The case of data scheduling will be addressed in future work.
A. Distribution of Overlapping Time
There is an important difference between context and data storage. A kernel always uses the same context words, whereas the data may change from one execution to another. If some context words are kept in the context memory until the next iteration execution, their reloading will be avoided. Furthermore, if time for overlapping is available during the execution of a kernel, it can be used to perform some loading of context words of other kernels. Therefore, this loading is skipped so that some time will be saved in case there is no time for overlapping in future executions. We could say that the context gives more "optimization choices" than data transfers. Consequently, it is good to maximize overlapping between data transfers and computation and then try to optimize context loading overlapping with the remaining free computation time. Therefore, [total computation time available for overlapping with context loading corresponding to rows (columns)] is limited to the time that is not used by data movements, . Furthermore, is also limited by the free time of the row (column) CM locations, . Finally, the number of free positions in the context memory also limits the number of context words that can be loaded. The time available is , where is the size of the context memory corresponding to rows (columns). All these ideas can be simultaneously expressed in the following:
The maximum is used to ensure that is never lower than zero.
In summary, data movements use as much time as they need and the free time (if available) is used by context loading.
Let us define TAOC as the total computation time that can overlap with context, regardless of whether it corresponds to rows or columns. The experimental results have shown that, the way TAOC is split between rows and columns has no impact on the overall execution time (see Table III in Section VII). Because of this fact and for the sake of simplicity, in the next sections we will assume a single type of CM locations without distinguishing between rows and columns. Each of them will need a separate and equivalent process. Hence, we will use TAOC to refer to either or .
B. Context Scheduling
As context words are reused in each iteration, it is important to keep as many words as possible in the memory, while performing the minimum number of context reloadings that do not overlap with computation. Therefore, we first have to establish which context words have to be loaded, and which ones may stay in the memory, so that the total execution time is minimized. At the same time, the decision over which loadings will overlap with computation is made. The process just described will be called context selection, its goal being to minimize the total number of context loads that do not overlap with computation. This is a better criterion than avoiding as many context loadings as possible since, for example, if , it may be possible to reduce or even avoid context loadings that do not overlap with computation.
Once the specific context words that will be reloaded have been chosen, it is necessary to select the memory locations that these words will occupy. This context allocation process takes into account the initialization time that results when a block is placed in nonconsecutive locations (memory fragmentation). Context selection has bigger impact on the overall execution time than context allocation so it will be considered first.
The context scheduling task can be applied to any multicontext reconfigurable architecture, since multicontext is the only architectural issue that we will assume in this task.
1) Context Selection:
A mathematical model has been contrived to state the problem, and this has been solved through two different approaches. The first one explores all feasible solutions in search of an optimal result and its role is to validate the results reached by the second one, which is a heuristics.
Let us consider the example shown in Fig. 9 to illustrate several definitions that will be used later. is a partition whose kernels have been scheduled and we want to schedule the context loadings. The corresponding numbers of context words are . In this case we take . As the total number of context words is 42, it is necessary to reload some of them.
We present our method through a matrix PREP EX. The " th" column represents the number of context words that are in the CM corresponding to kernel " ." Different rows stand for different instants. The first row takes into account the context words that are in the CM just before the execution of (the context words have been prepared for execution), while the second one represents the context words loaded in the CM just at the end of the kernel execution. Similarly, the third and fourth rows and the fifth and sixth rows have the same role with respect to and . This matrix is split into PREP (odd rows) and EX (even rows) which group the rows concerning preparation (before execution), and those ones related to execution itself. Since during the execution of its whole context has to be in the CM, the number within a circle always equals . Similarly to , (within a rectangle) always equals , as the context words of are "prepared" for its imminent execution. Therefore, is the number of context words of , that have been loaded without any overlap with execution, just after the execution of . On the contrary, is the number of context words of that have been loaded overlapping with execution of . Thus, the goal of the context selection is to minimize the number of context words that are reloaded without overlapping with computation, . a) Mathematical Model: We present here a mathematical model whose solutions (EX and PREP arrays) are valid context distributions. Two cost functions will be used to quantify the quality of the solutions. From the discussion in the previous subsection, it is clear that all the context words have to be in the CM during the whole execution of a kernel. Thus if then: and
The total number of words in the CM cannot exceed its size, SCM. Moreover, there are no free positions since they would be used in order to minimize the number of context loads. So and The number of context words allocated to a kernel will never exceed the total . So and Only a portion of TAOC is assigned to each kernel, . It may be used for context loading of and it cannot be greater than and . Moreover, TAOC is a limit for limits the time assigned for overlapping with Any proposed solution that meets all these expressions is a feasible solution. In order to evaluate the quality of this solution, we will use a cost function given by the number of context loads that do not overlap with computation.
If several solutions have the same " ," the best one has the minimum number of context loadings that overlap with computation. Hence, the secondary cost function (" ") is b) Heuristic Approach: During the complete execution of a kernel, all its context words are in the CM. So, if it is possible to overlap context loading with computation, new context words have to be stored in positions that are not occupied by the current kernel. If a kernel has just been executed, its context words have the highest probability of being replaced since before its next execution all other kernels have to be executed. If there are not enough locations, the previous kernel in the execution flow will also be used. The rest of the context words remain in the CM. Consequently, we can compute the row " ," of EX from the row " " of PREP where is the portion of that has not yet been used. The same reasoning is applied if the context loading does not overlap with computation. However, all the CM locations are then candidates. The first locations replaced belong to the kernel that has just been executed and the previous ones are used if necessary. Consequently, we can compute the row " " of PREP from the row " " of EX where is the number of locations that are currently being used. Both expressions are used within two iterative procedures, so that one EX row and one PREP row are generated. They are shown in Figs. 10 and 11. Notice that each row is generated from the previous row. Therefore, it is necessary to produce a starting row. The process is repeated until a periodical solution is found or, otherwise, it stops after a number of iterations and selects the best solution found.
The starting row is computed in allocating the CM positions, so that the number of loadings for each kernel is the same. We can imagine that there is a free block in the CM and every time a kernel is going to be executed its own context words occupy the free block. In the case "
," this kind of solution will lead to a periodical one, because the same number of context words are loaded into the same positions each iteration. In the general case, "
," the heuristics will build the solution. TAOC is expressed in integers of one word context loading time and in PREP EX, is represented as a superscript associated with the circled elements.
Regarding , we have found from the experiments that the best solution is obtained when TAOC is divided as equally as possible among all , if the total number of kernels (NT) is even. Otherwise (NT odd), the division is carried out among , as in Fig. 12 . TAOC is expressed in integers of one context word loading time, and in PREP EX, is represented as a superscript associated with the circled elements.
2) Context Allocation: The context selection task obtains a solution that provides information about the optimal configuration of context words in the memory (periodical "in time"), as well as whether the context loads overlap with computation or not. The next step is to decide where to place each context word so that memory fragmentation is minimized, while providing a periodical solution "in space."
In the context selection solution, there are some context words that remain in the CM (static context words) during all the iterations and some others that are reloaded (dynamic). We can always choose to allocate the dynamic positions in a single block (see Fig. 13 ). Therefore, the problem is reduced to studying the allocation of the dynamic context words.
The specification of a context selection configuration includes complete information about which context words have to be replaced and which ones have to be placed. This fact constraints the context allocation task. For example, in Fig. 13 the number of context words of kernel is decreased in four words from the fourth row to the fifth row of the dynamic PREP EX. Hence, these free positions are replaced by four context words of the third kernel . However, a better solution could that uses one position of . In order to consider all the possibilities, we will not specify how many context words disappear from one row to the next one. Once a kernel has been executed, all its dynamic positions are free and the context allocation task decides which to use. We use a representation of all the possibilities as presented in Fig. 13 (set of equivalent dynamic PREP EX).
The allocation process now has to avoid the fragmentation of the context blocks that are loaded as a whole in one step. Thus, a graph (Fig. 14) representing the dependencies among the context blocks is generated, so that all possibilities can be explored in an orderly manner. A node in the graph represents a dynamic block. An edge is added between the nodes if the source node disappears from the set of dynamic PREP EX and the destination node appears later.
Let us consider the set of dynamic PREP EX array in Fig. 13 . If we suppose that the context loadings that are necessary in each row are loaded consecutively, but independently of the other rows, there are eight context blocks. These blocks are identified with letters from " " to " " and the number indicates how many context words the block has. The graph of the context block dependencies is represented in Fig. 14 . For example, the block " " is connected to " " and " ." However, for " ," the dependencies cannot go further than the first row of PREP EX (i.e., " ") because all its dynamic positions are occupied.
The process is now divided into two steps. First, we look for a periodical solution, because it is the most important condition. Then, we try to place the context blocks in consecutive positions.
In order to find a periodical solution, the graph is explored from the sources until it reaches a context word that belongs to the starting kernel (i.e., ). If this is not possible, we perform backtracking.
The context words that are added to a path are removed from the graph. If a nonperiodical path is built, we perform backtracking to look for a periodical path. The whole process is applied until there are no context words in the graph. Afterwards, we try to place the paths in the context memory, so that context words that belong to the same block are in consecutive positions. The final solution for our example is shown in Fig. 14 . All the blocks, except " ," are placed in consecutive positions. Sometimes, the optimal solution implies some fragmentation. This usually happens when (as in Fig. 14) the TAOC distribution among the kernels is not uniform. If it were uniform, the sizes of the dynamic blocks would be similar and, therefore, fragmentation could be avoided.
VII. EXPERIMENTAL RESULTS
In this section, we present the experimental results for MPEG and ATR, as well as a group of synthetic experiments, in order to demonstrate the quality of the proposed methodologies for kernel scheduling and context scheduling allocation. MPEG is a standard for video compression. ATR stands for automatic target recognition and we analyze the main tasks that compose it, final identification (FI) and second level detection (SLD). Experimental input data have been obtained by manual analysis (we have not yet implemented the information extractor) except the parameters and TAOC, which been estimated. The experiments differ in data dependencies, number of kernels and/or kernel information (execution time, number of context words, input data, DFG, etc.).
A. Kernel Scheduler
The experimental results for kernel scheduling are summarized in Table I . As we expected, the best solution is found during the first iterations. So, although we should explore the whole search space in order to guarantee with absolute certainty that the best solution found so far is optimal, it is only necessary to explore a small fraction of the search space (less than 20% for partitioning and 10% for scheduling within a partition in most cases). Thus, not only good solutions, but even the optimal solution are generated typically during the first algorithm iterations (Table I, columns 3 and 5) . This fact could considerably reduce the exploration time for applications where near-optimal solutions are enough.
The results also show the validity of the methodology. Splitting kernel scheduling into two steps improves the algorithm performance since the scheduling within a partition only has to be performed a few times. Otherwise, it would be necessary to carry out both steps: generate a partition and scheduling within a partition for the whole search space. The quality of the lower bound as an estimation of the execution time (Table I , columns 2 and 4) allows the best candidates to be selected. This, together with the bounding check, lead to a quick pruning of the search space (Table I , exploration time column).
B. Context Scheduling and Allocation
The heuristic context selection (Table II and III) can find a nearly optimal solution in most cases in a few algorithm iterations. As a matter of fact, although Table II presents the exploration time for context selection and allocation, the contribution of the former is lower and always less than 0.1 s. The optimal values are provided through the exhaustive exploration of the mathematical model and are only used to validate our heuristic technique. However, this exhaustive method cannot be used for practical purposes, due to its huge exploration time.
Table II presents the results for both uniform and nonuniform TAOC distribution. The latter distribution has been generated in all experiments so that at least one of differs from the average in 30%. As shown, the uniform distribution usually provides better results. Table II ) the absolute error is bigger than usual. This leads to high relative errors for experiments with low number of context loadings. However, in all the experiments a very important conclusion may be drawn from the results. The quality of the solutions obtained strongly supports the assumption of the kernel scheduler, which estimates that the context scheduling is performed in an optimal way. This supports our assumption of splitting up kernel scheduling and context scheduling. It should be noted that the column labeled C i gives an upper bound of the number of context loads, which could by obtained by a trivial algorithm (all the context are loaded every iteration). In all cases the solution found (NCL) is much closer to the optimal value (optimal NCL) than to the upper bound C i .
The context allocation task always finds a solution with very low fragmentation. In order to consider the quality of the algorithm, we present the exploration time that is necessary to generate low fragmentation solutions. As shown, in all cases a fragmentation lower than three or four is generated in less than one second. It should be noted that further optimizations produce very little performance improvement, or even none, in most cases.
Finally, Table III illustrates that there is no dependence between the total number of context loadings and the TAOC distribution between row and column positions of CM.
VIII. CONCLUSION
In this work, we have presented a general strategy to face the scheduling problem in reconfigurable computing. Two of the most important tasks in this framework, task scheduling, and context scheduling allocation, have been discussed in detail, and we have proposed techniques to solve both problems. In particular, we have addressed the problem for the MorphoSys architecture. The task scheduler finds the optimal solution through the exploration of a pruned search space. The guided search algorithm generates good solutions in the first iterations in order to reduce the exploration time.
The problem of context loading management is divided into two tasks: context selection and allocation. We proposed a mathematical model for context selection and two different ways to solve it. The first one is a heuristics for context selection, which provides a nearly optimal solution in just a few iterations. The second alternative is used to validate the heuristics through the exhaustive exploration of the feasible solutions. The comparison of the experimental results shows the good quality of the solutions that can be obtained. Finally, we have solved the context allocation through the exploration of a block dependency graph, so that the memory fragmentation is minimized. Future work will address data management at compilation time.
