Accelerator cores in low-power heterogeneous processors have on-chip local memories to enable parallel data access. The memory capacities of the local memories are very small. Therefore, the data should be transferred from the global memory to the local memories many times. These data transfers greatly increase the total processing time. Memory allocation technique to increase the data sharing is a good solution to this problem. However, when using reconfigurable cores, the data must be shared among multiple contexts. However, conventional context partitioning methods only consider how to reuse limited hardware resources in different time slots. They do not consider the data sharing. This paper proposes a context partitioning method to share both the hardware resources and the local memory data. According to the experimental results, the proposed method reduces the processing time by more than 87% compared to conventional context partitioning techniques.
Introduction
Image processing has become a key area in the fields of consumer appliances, high-safety vehicles, security systems, etc. Since mobile appliances such as digital cameras, mobile computers and vehicles are used in these fields, the power consumption is a critical factor. Conventional CPU-based image processing systems consume a lot of power and are very difficult to use in mobile appliances. Therefore, an effective way to implement image processing in mobile appliances is to use low-power heterogeneous multicore processors that contain different cores such as CPUs and accelerators. Examples of heterogeneous multi-core processors are [1] and [2] . The former has multiple cores of CPUs and dynamically reconfigurable ALU arrays. The latter has multiple cores of CPUs, a micro-controller and SIMD (singleinstruction multiple-data) type Multi-bank matrix processors.
In heterogeneous multicore processors, the area of a core has to be reduced to integrate many cores. Dynamically reconfigurable accelerators is an effective way of reducing the area. Dynamically reconfigurable accelerators have multiple contexts that share the same resources in different time slots [3] . A context represents the circuit configuration that belongs to a particular time slot. To share the resources, a large application has to be divided into several small parts in such a way that each part is small enough to be mapped onto a context. This process is called context partitioning.
In traditional context partitioning, the most important goal is to share and reuse the same PEs among different time slots. However, in heterogeneous multicore processors, data sharing between contexts is more important than the PE sharing since the data transfer time is usually larger than the computation time. Figure 1 shows the hierarchical memory architecture in reconfigurable accelerators. It contains a large memory module (global memory) placed outside the accelerator core and several small memory modules (local memory) placed inside the accelerator core. The data are often transferred between the global memory to the local memory. These data transfers greatly increase the total processing time. To reduce the data transfer time, the data amount transferred to the local memory must be reduced. To reduce the data amount, more data should be shared among the contexts in the accelerator.
To explain the data-access-driven context partitioning, let us consider the example given in Fig. 2 . Figure 2 (a) shows an image data accessed in the computations of window 0 and window 1. Figure 2 (b) shows the computations of each window. According to Fig. 2 (b) , the computations of both windows are the same. Therefore, in the conventional context partitioning, only a single context is required. that has only one context. In the conventional context partitioning, we have to allocate the data to the memory modules as shown in Fig. 2 (d) to realize the computations of two windows. In the memory allocation result in Fig. 2 (d) , the data (1,0) and (1,1) are allocated to memory modules 0 and 1. Although only 6 data are required, we have to transfer eight data to the memory modules. Figure 2 (e) shows the result of the proposed context partitioning. There are two partitions with different interconnection networks between the memory and the PEs. The computations of windows 1 and 2 are done in the contexts 1 and 2 respectively. To realize the computations, we have to allocate the data to the memory modules as shown in Fig. 2 (f). To implement this memory allocation result, only 6 data have to be transferred. The data in memory 1 are shared between two contexts. When the transferred data amount decreases, the data transfer time also decreases. Memory allocation method based on the data sharing to reduce the data transfer time is already proposed in previous works [4] , [5] . The work in [4] proposes a method to accelerate the optical-flow extraction by reducing the data transfer time. The work in [5] proposes a memory allocation method for the image processing to reduce the data transfer time by increasing the data sharing. To get the maximum advantage from such works, we need an efficient context partitioning method to access the shared data. However, such a method has not been discussed in previous works. The main contribution of this paper is to study of an efficient implementation of image processing in heterogeneous multicore processors. For this purpose, we extend the approach of the work in [5] to propose an efficient context partitioning method to access the shared data and to reuse the hardware resources. We discuss how to divide a large task in to partitions, how to map partitions to the reconfigurable accelerator by assigning them to contexts and how to schedule the context to effectively access the shared local memory data.
The remainder of this paper is organized as follows. Section 2 discusses the related work on memory allocation and context partitioning. Section 3 explains the proposed context partitioning method for reconfigurable accelerators. Section 4 shows a mapping example to explain the proposed context partitioning. Sections 5 and 6 are evaluations and conclusions.
Related Works

Previous Partitioning Methods and Their Problems in
Heterogeneous Multicore Processors
Context partitioning techniques for reconfigurable processors are proposed already in [6] and [7] . A force directed scheduling algorithm to partition sequential circuits is proposed in [6] . A temporal partitioning technique for reconfigurable processors is proposed in [7] to reduce the execution time of contexts. Although these methods discuss how to divide a large application to reuse the PEs, they do not consider how to access the shared memory. In heterogeneous processors, efficient use of memory is more important than reusing PEs since the biggest problem heterogeneous processors is the large data transfer time. To use memory effectively, memory allocation techniques are proposed in [8] - [10] . A hierarchical matching approach for stereo matching to reduce the computation amount is proposed in [8] . The parallel access of multiple memory modules is discussed in [9] and [10] . These methods are proposed with an assumption that random access is possible and the allocated data can be accessed at any time from any memory address. In recent heterogeneous multicore processors [1] , address generation units (AGUs) are included in the accelerator cores to increase the address generation speed. AGUs contain simple hardware units such as adders and counters to reduce the accelerator core area. Therefore, the memory access is restricted by AGUs to the most common memory access patterns such as stride-access. Therefore the traditional memory allocation techniques cannot be applied. To explain the restricted memory access using AGUs, let us consider Fig. 3 . It shows a memory access order of a stride-access. The horizontal axis is the time and the vertical axis is the memory address. When the memory access order is given as in Fig. 3 , the memory address accessed at time t is given by Eq. (1).
The relationship between the memory address and time given by Eq. (1) is called the addressing function. The terms r, P and c(t) are the "address-increment", the "stride-width" and the "base-address" at time t respectively. As shown in Fig. 3 , the address equals base address c 0 at time t. The address increases by r in each control step. That is, when time equals 1, address equals c 0 + r. Similarly, address increase by r for P control steps linearly. When time is P − 1, address is c 0 + (P − 1) × r. When the time is P, address does not increase by r and it takes a different value c 1 as shown in Fig. 3 . During the next P control steps, the address increases linearly by r in each control step. Figure 4 shows the addressing-function-constrained memory allocation. Figures 4 (a) and 4 (b) show the coordinates of the pixels of an image and the control steps where a set of pixels are accessed respectively. To access these pixels, we use a simple addressing function. Figure 4 (c) shows one possible memory allocation. In this example, the pixel [0,1] is copied to two memory locations: 0x01 and 0x05. Similarly, the pixel [1, 2] is copied to 0x04 and 0x08. Even though we need to access only 8 pixels, we have to transfer 10 pixels to the local memory modules where two of them are duplicated. This is called "the data duplication problem".
Memory Allocation Used for the Proposed Context Partitioning
To solve this problem, addressing-function-constrained memory allocation is proposed in [5] . Since this paper is an extension of the work in [5] , we briefly describe the addressing-function-constrained memory allocation. The targeted application in [5] is window-based image processing. Window-based image processing refers to the processing of data in blocks. Such blocks are called windows. Many image processing applications such as mathematical morphology [11] and stereo matching based on SAD calculations [8] contain window-based processing. The data are accessed in windows from left-to-right and topto-bottom as shown in Fig. 5 (a). This window access is similar to the "raster scan" in image processing. The segment of data accessed by moving a window left-to-right is called a "horizontal-block". The scanning order of the pixels inside a window is shown in Fig. 5 (b) . The pixels are scanned in columns from left-to-right.
The memory allocation in [5] is defined as follows. The image data are allocated to m memory modules. The value of m is defined by Eq. (2) such that the window height H window must be a multiple of m.
Note that, N is the set of natural numbers. For example, when H window is 12, the number of memory modules should be one of 1, 2, 3, 4, 6 or 12 and b should be one of 12, 6, 4, 3, 2 or 1 respectively.
. . , A m−1 denote the addresses of m memory modules. Equation (3) give the the address of memory module A (y MOD m) .
The scan area width is denoted by W scan area . The variables y and k in Eq. (3) are determined by Eqs. (4) and (5) respectively where the scan area height is denoted by H scan area .
The memory address A (y MOD m) accessed at time t is given by Eq. (7). 
The window width is denoted by W window . The control step of the addressing function of the horizontal-block number h is given by t h . Figure 6 shows an example of the memory allocation. Figure 6 (a) shows the coordinates of the pixel data in the scan area. The scan area width and the height are 10 and 10 respectively. A window of size 4 × 4 is used for the scanning. Two memory modules are used to allocate pixel data. The value k is determined by substituting 10, 4, and 2 for H scan area , H window and m respectively in Eq. (5). The range of k values is 0 ≤ k < 8. For each k value, we can determine the q k values from Eq. (6). For example, if k equals to 0, then q k is 0 or 2. For each q k value, we can determine the y values from Eq. (4). For example, if q k equals to 0, then y is 0. Substituting x, y and k values in Eq. (3), we can determine the memory address where the pixel (y, x) is allocated. Figure 6 (b) shows the allocated data on memory modules. Table 1 shows all the k and y values and their respective memory allocation equations. Let us consider the pixel (y, x) = (2, 1) allocated to the memory addresses 03 and 22 of memory module 0. According to Eq. (5), the k values are 0 and 2. Therefore, according to Table 1 , we can determine the memory address for each set of y, x, k values. When y = 1, x = 2 and k = 0, the memory address is 03. When y = 1, x = 2 and k = 2, memory address is 22. The data duplication is reduced by sharing the data among horizontal-blocks. As shown in Fig. 6 (b) , the data in memory module 1 are shared between the blocks 0 and 1. Similarly, the data in module 0 are shared between the blocks 1 and 2 and so on. The reason for the data duplication is the restriction of memory access by the addressingfunction-constraint as explained in Sect. 2.1.
Context Partitioning
Partitioning and Scheduling of Contexts
This section explains how to partition an application to access the data allocated according to the memory allocation method in [5] . After the memory allocation, the image data are accessed using the addressing function given by Eq. (7).
To implement this addressing function on AGUs, we need to set the parameters of the addressing function. Addressing function parameters m, n, P and c(t) are given by Eqs. (8), (9), (10) and (11) respectively.
According to Eq. (11), parameter c(t) is changed in every W window × b cycles. Therefore, it is required to change the context in every W window × b to reconfigure AGUs. Due to this reason, the number of clock cycles per a context (N cycle ) is defined by Eq. (12) .
During the period of W window × b clock cycles, one window is processed. Therefore, the computation of a window is assigned to a context. The number of contexts (N contexts ) equals to the number of the windows in the scan area. The number of contexts is given by Eq. (13) .
Since the processing of a single window is assigned to a context, the scheduling of the contexts is the same as the scheduling of windows access. As explained in Sect. 2.1, each horizontal block contains W scan area − W window + 1 number of windows. The relationship between context number N context and block number h is given by Eq. (14) .
Successive windows belong to each horizontal blocks are assigned to successive contexts and executed one-by-one. The scheduling of the data access within a context is done by the AGUs. The AGUs are configured using the parameters obtained from Eqs. (8), (9), (10) and (11) . The parameters m, n and P are the same in all the contexts. The parameter c(t) is obtained by substituting the memory module number M d and the context number N context (given by Eq. (14)) in Eq. (11) . A detailed example of mapping and scheduling of contexts is given in Sect. 4.
Configuration of Contexts
This section explains how to assign a partition of an algorithm to a context. To map the partition, we have to decide the connections among PEs and memory modules as shown in Fig. 7 . At this stage, we know the degree of parallelism of the partition. Therefore, we can draw the data flow graph (DFG) and use direct allocation to map it to the PE array. Such mapping has been discussed in previous works [12] . In this section, we explain how to define the connections between memory modules and PEs.
To define the connections, we consider m memory modules as shown in Fig. 7 . Since all m memory modules feed data in parallel, they must be connected to m PEs. Let PE 0 , PE 1 , . . . , PE m−1 denote the m PEs and M 0 , M 1 , . . . , M m−1 denotes m memory modules. The connection between a memory module and a PE is denoted by the symbol −→. Using these notations, the PE d connected to memory module M d is shown by Eq. (15) .
At this stage, we know the degree of parallelism which is decided by m from Eq. (2). Knowing the degree of parallelism, we can define the DFGs of the contexts that can be directly allocated to the PE array. We also know the connections between the memory and PE array from Eq. (15) . The addressing functions of AGUs are also known. Therefore, we can generate the configuration data of all the contexts automatically. A detailed description of how to map a window-based application is described in Sect. 4.
Mapping Example: Block Matching
Processor Architecture
Block matching is a very common example of windowbased image processing. Implementation of block matching in VLSI processors has been done in [13] and [14] . VLSI architectures of block matching using one-dimensional and two-dimensional systolic array processors are presented in [13] . A VLSI architecture of block matching based on systolic array processor and shift registers arrays is proposed in [14] . Such architectures are proposed for application specific processors and not for reconfigurable processors. In this paper, we discuss how to implement block matching on heterogeneous processors with reconfigurable accelerators. We use the heterogeneous multicore processor called "RP1" proposed in [1] . The block diagram of RP1 processor is shon in Fig. 8 . It has 4 CPU cores and 2 FE-GA (Flexible Engine/Generic ALU Arrays) accelerator cores. All the cores are connected through a bus called "SuperHyway". An off-chip SDRAM is connected to the processor through the SuperHyway. Figure 9 shows the architecture of the FE-GA accelerator. It has an array of 32 PEs called "ALU" cells and "MLT" cells. The FE-GA has 10 local memory modules of 4 kByte each. The AGUs are included in the FE-GA as shown in Fig. 9 for the address generation. Let r 0 , r 1 , . . . , r (N context −1) denote the "address increment", P 0 , P 1 , . . . , P (N context −1) denote the "stride width" and c 0 , c 1 , . . . , c (N context −1) denote the "base address" of N context contexts. The memory address accessed at time t is given by Eq. (16) .
where e ∈ {0, 1, . . . , N context − 1} (16) It is the addressing function of the AGUs. The parameters m, Fig. 8 Heterogeneous multicore processor architecture. (16) is the general format of a simple addressing function. Although it has multiplication operation, it is implemented by repeated additions. Divisions is done by counters and adders. To execute more complex addressing patterns, we have to reconfigure the addressing function parameters in each context. The FE-GA contains 256 contexts which are dynamically reconfigurable. The sequence manager shown in Fig. 9 controls the dynamic reconfiguration. Figure 10 shows an example of dynamic reconfiguration done by the sequence manager. The processing starts with the context 1. After the context runs for 2 clock cycles, the sequence manager changes the context to the context 2. In this example, the context 2 executes a condition. If the condition is met, the sequence manager changes the context to the context 4. Otherwise the context 3 is executed. Similarly, the contexts 3 or 4 run for the given number of clock cycles. When the contexts 3 or 4 is executed, sequence manager change the context to the context 5. The scheduling of contexts, that is the order of the execution of contexts and the number of clock cycles are defined by the user. The sequence manager dynamically changes the contexts according to the schedule. More detailed explanation of the functions of the sequence manager is given in [15] and [16] . In each context, we can change the operations in PEs and AGUs. We also can change the interconnection network. The dynamic reconfiguration can be used to dynamically change the addressing function to generate more complex addressing patterns.
Context Partitioning of the Block Matching Algorithm
We consider the block matching algorithm used in opticalflow extraction as an example to show how the context partitioning is done. In optical-flow extraction, corresponding pixels between two images taken at time t and t + δt are searched. Figure 11 shows two images taken at δt time difference. Figure 11 (a) shows a reference window for a pixel in the image at time t Fig. 11 (b) shows a scan area in the image at time t + δt. Different candidate windows are selected from the scan area and the SAD (sum of absolute differences) with the reference window is calculated. The more similar the reference window to the candidate window is, the more smaller the SAD becomes. Therefore, the candidate window with the minimum SAD value is selected as the corresponding window to the reference window. A detailed description of block matching is given in [17] . The specifications of the block matching example is given in Table 2. Figure 12 shows the DFG of the computation of one corresponding point in the block matching algorithm. The same DFG is repeated for all the search areas to find all cor- responding points. However, it is impossible to map such a large DFG to the accelerator core without dividing it into several partitions. We use the partitioning method proposed in Sect. 3 for the context partitioning.
According to the memory allocation explained in Sect. 2.2, we use Eq. (2) to find the degree of parallelism (m). Since the number of memory modules in FE-GA is 10 and the window height (H window ) is 16, the maximum parallelism is 8. After knowing the parallelism, we can define the DFG for each partition. The DFG of a single partition is given by Fig. 13 . The DFGs of all partitions are the same as Fig. 13 . Note that the term "AD" in Fig. 13 represents the absolute difference calculation. We directly allocate the nodes of the DFG to the PE array. The allocation result is shown in Fig. 14. According to Eq. (13), the number of contexts (N context ) equals 81. The connections between PEs and memory modules in each context are obtained by Eq. (15) . The first row of PEs are connected to 8 memory modules. There connections are shown in Fig. 15 . As shown in Fig. 15 , the connections between PEs and memory modules are changed after 8 successive contexts. The addressing function parameters and the number of cycles per a context can be determined by Eqs. (8), (9), (10), (11) and (12) . Table 3 summarizes the AGU parameters and scheduling of contexts. According to Table 3 , the addressing functions are different for all contexts. Figure 16 shows the processor board used for the evaluation. It has the RP1 heterogeneous multi-core processor [1] that contains 4 CPU cores and 2 FE-GA cores. Figure 17 shows the flow-chart of the scheduling and the allocation of the SAD computation process. The CPU is used to transfer data to the FE-GA local memories. Each time, the data of one reference window and one search area are transferred. After the data are transferred, the SAD calculation is done in FE-GA. Then all the SAD values between reference window and candidate windows are transferred back to CPU. In CPU, the minimum SAD value is searched and the corresponding points between two images are calculated. This process continues for all the corresponding points as shown in Fig. 17 . The processing time data in this evaluation are obtained by dividing the number of clock cycles by the clock frequency. The number of clock cycles are counted using a performance counter in the RP1 processor. Table 4 shows the comparison of memory access-aware context partitioning with other context partitioning methods. The application is block matching explained in Sect. 4. The specifications of the images are given in Table 2 . Method 1 maps the DFG directly to the PE array without considering the memory access. It connects 8 memory modules to 8 PEs to get parallel memory access. Since memory access is not considered at the stage of context partitioning, the connections between the contexts and the memory modules does not change. Therefore, it has only 1 context. Method 2 partly considers the memory access. It share the data among different horizontal blocks using 9 contexts. However, it does not share the data within a horizontal block. Proposed method shares the maximum amount of data using 81 contexts. This implementation is explained in Sect. 4. According to the results, the total processing time is reduced in the proposed method by 95% and 87% compared to that in methods 1 and 2 respectively. According to the results in Table 4 , the data transfer time is very large in method 1 and 2 compared to the proposed method. Since method 1 and 2 do not consider the memory access, new data have to be transferred every time when the scanning window moves. Therefore, the data transfer time increases. However, in the proposed method, data are shared among all scanning area. Therefore, the data transfer time is very small and that reduces the total processing time.
Evaluation
The computation time of window-based image processing t compute depends on the computation amount N compute , the frequency of the accelerator f and the degree of parallelism. The degree of parallelism equals the number of memory modules m. Therefore, the computation time is given by Eq. (17) .
Since the computation amount is the same for all three methods, the computation time is also the same. The data transfer time between the global and local memories depends on several factors such as the amount of data, the frequencies of the global memory, local memory, the system bus and the CPU. The global memory, local memories, the system bus and the CPU contain different clock networks and it is required to synchronize each of these to transfer data. The processing time required for this synchronization process is very difficult to estimate. Therefore, to estimate the datatransfer time as accurately as possible, we assume that the data-transfer-time per a 16-bit word is a constant for a given application. Note that, the data transfers between the global memory and the accelerator's local memory in RP1 processor is done in 16-bit words. The data-transfer-time per word is determined by α and is measured experimentally. According to the experimental results, α is 0.0475 (ns 
The total processing time is the sum of computation time and data transfer time. According to Eqs. (18), (19) and (20), the data transfer time depends heavily on the number of contexts. When the number of contexts are large, the data transfer time decreases. Methods 1 has only 1 context and no data are shared. Therefore, the data transfer time is large. Method 2 has 9 contexts and some data are shared among contexts. Therefore, the data transfer time is small. Method 3 has 81 contexts and a lot of data are shared among contexts. Therefore, the data transfer time is very small. Figure 18 shows the processing time comparison of heterogeneous and homogeneous multicore computing. In heterogeneous computing, we used up to two FE-GA cores with one CPU core. In homogeneous computing, we used two SH-4A CPU cores. The clock frequencies of FE-GA and SH-4A are 300 MHz and 600 MHz respectively. The context partitioning of FE-GA is done according to the proposed method. According to the results, heterogeneous computing has increased the processing speed by 6 to 11 times compared to homogeneous computing. This shows that if the application mapping is done effectively, heterogeneous processing can achieve considerable speed-ups. However, if the mapping is not done effectively, the processing time in heterogeneous computing can be worse than that in homogeneous computing. As shown in Fig. 18 , if the task partitioning method 2 is used, the processing time is larger than the homogeneous implementation.
Conclusion
We have proposed a method to automatically partition a large window-based application into several small parts. We also discuss how to assign small partitions into contexts and how to schedule the contexts. The proposed method partition an application in such a way that the data are shared among multiple contexts. However, conventional context partitioning methods only consider how to share the limited hardware resources. They do not consider memory access and data sharing. Therefore, proposed method gives small data transfer time and that reduces the total processing time. According to the results, the proposed method reduces the processing time by more than 87% compared to conventional context partitioning techniques.
