SUMMARY For an FPGA-based heterogeneous multicore platform, we present the design methodology to reduce the total processing time considering data-transfer. The reconfigurability of recent FPGAs with hard CPU cores allows us to realize a single-chip heterogeneous processor optimized for a given application. The major problem in designing such heterogeneous processors is data-transfer between CPU cores and accelerator cores. The total processing time with data-transfers is modeled considering the overlap of computation time and data-transfer time, and optimal design parameters are searched for.
Introduction
Applications used in low-power embedded processing to high performance computing have different tasks such as data-intensive tasks and control-intensive tasks. Therefore, the optimal architecture is different from application to application. Heterogeneous multicore architectures are one promising way to execute such applications powerefficiently. They use different processor cores such as CPU cores and accelerator cores as shown in Fig. 1 . Examples of such processors are [1] and [2] , which contain multiple CPU cores and accelerator cores. The CPU cores are suitable for control-intensive and complex computations, while the accelerator cores for data-intensive and regular computations. When tasks of an application are allocated to the most appropriate processor cores, all the cores work together to increase the overall performances power-efficiently.
Current heterogeneous processors have a fixed amount of cores and each core has a fixed amount of processing elements (PEs). Since there are many different applications, some applications may work well in a particular heterogeneous processor, while some applications may not. Moreover, large data transfer time between multiple cores is a serious problem. To solve these problems, we consider an FPGA-based heterogeneous multicore architecture model. Recently, speed and power consumption of FPGAs are greatly improved, and it would be very practical to use the FPGA-based platform for real applications. FPGAs also contain hard CPU cores as seen in Xilinx Zynq-7000 [3] and Altera Cyclone V SoC [4] . Therefore, CPU cores and accelerator cores can be efficiently implemented on a single FPGA. Moreover, recent FPGAs are large enough to hold hundreds of processor cores. Our earlier work in [5] , we have proposed an FPGA-based heterogeneous multicore processor platform. However, the data transfer time between the CPU core and accelerator cores is significantly high. One popular method to reduce the data transfer time is called double buffering, where two data buffers are used. When one buffer is accessed for the computation, the data are transferred to the other buffer. After the computation is finished, the buffers are interchanged. However, this requires a large memory and only 50% is used for the computation. Since the internal (on-chip) memory is a scarce resource in FPGA, it is desirable to use most of the memory resources for the computation. Another method to reduce the data transfer time is to hide the data transfer between one core with the computations of the other cores. Since the recent FPGA-based processors contain many accelerator cores due to the large number of LUTs, we use this method to reduce the data transfer time. Since FPGA is reconfigurable, we can design the optimal architecture for different applications to reduce the processing time.
However, designing the optimal architecture that has the smallest processing time for a given application is a difficult problem and it takes a large design time. To solve this problem, we propose a very basic architecture model that has hard CPU cores and FPGA-based accelerator cores. The architecture model is based on our previous work [5] . Unlike in [5] , the proposed architecture model does not contain a fixed number of accelerator cores, PEs or a fixed amount of internal memory modules. Instead we defined some design parameters such as the number of cores, the degree of parallelism, etc. We optimize our architecture model for a given application by choosing the optimal design parameCopyright c 2015 The Institute of Electronics, Information and Communication Engineers ters. The optimal number of accelerator cores are chosen to hide the data transfer overhead.
In this paper, we propose a heterogeneous multicore processor design methodology to reduce the total processing time under the resource constraint. We propose a parameterized architecture model and introduce an evaluation methodology to find the optimal architecture for the design parameters. In the optimization problem, we focus on window-based processing which has many applications such as stereo matching [6] , feature detection [7] , scaleinvariant feature transformation (SIFT) [8] , histogram of oriented gradients (HOG) [9] , matrix processing, filtering, etc. The evaluation using filter computation as an example demonstrates that the processing time estimated by the proposed design methodology has sufficient accuracy compared to the actual measurement of the FPGA architecture. Moreover, the optimal architecture changes for different applications, and it is possible to derive such architectures using the proposed method.
Heterogeneous Multicore Architecture Model
The heterogeneous multicore architecture model is based on the proposal in [5] . Figure 2 shows the overall architecture. It consists of FPGA-based custom accelerator cores, a hard CPU core and an on-chip memory. An external memory is connected to the CPU core through the FPGA board. The accelerator architecture used in this paper is based on the FE-GA (flexible engine/generic ALU array) accelerator proposed in [1] . FE-GA is a 16-bit coarse grain MIMD accelerator. It has a very simple architecture and simple interconnection network. Since the interconnection network is a critical part in FPGA-based designs, FE-GA based MIMD architecture is ideal for FPGAs. It is very easy to implement and easily scalable by changing the number of PEs and memories. Moreover, it has been studied extensively for memory allocation [10] , data transfers [11] , context partitioning [12] , etc and many efficient techniques are proposed. It is already been used to implement various applications in many prior works, such as audio encoding [1] , feature extraction [13] , optical-flow extraction [14] , etc. Therefore, we choose FE-GA as a base for the MIMD accelerator used in the proposed design. Figure 3 shows the architecture of an MIMD accelerator core. It consists of a 2-dimensional array of PEs, local memory modules and address generation units (AGUs). In order to simplify the interconnection network, only the leftmost PEs can directly retrieve data from local memory modules, and only the rightmost PEs can directly write data to local memory modules. PEs, AGUs and interconnection network are dynamically reconfigurable. A PE consists of a 16-bit fixed-point ALU and a multiplier as shown in Fig. 4 . It is capable of doing operations such as addition, accumulation, subtraction, comparison, absolute difference computation, multiplication, etc. Since the data path is fully pipelined, it takes only one clock cycle to complete any operation. The address calculation in the proposed architecture is explained in Fig. 5 . In CPUs and GPUs, the address calculation and data processing are done on the same ALU as shown in Fig. 5(a) . To reduce the address processing time, AGUs (address processing uints) are employed. The address calculation is done on AGUs in parallel to the data processing which is done on ALUs as shown in Fig. 5(b) . Several address patterns and AGU architectures for image processing have been discussed in previous works [15] , [16] . Since accelerator cores use multiple AGUs to access multiple memory modules, a relatively large area is required. However, considering the benefits of power-efficiency and high performance, it is worth spending resources on AGUs. In this work, the address function proposed in previous works [7] , [10] is used. This address function is simple, and the resource usage of AGUs is small.
To increase the performance of an FPGA-based hetero- geneous multicore platform, it is important to consider not only the architecture of accelerator core but also the number of accelerator cores. As shown in Fig. 6(a) , if the number of accelerator cores is one, data-transfer and computation on an accelerator core are done in serial. On the other hands, if the number of accelerator cores is two as shown in Fig. 6 (b), data-transfer to core 1 and computation on core 2 are done in parallel. Therefore, we can reduce the total processing time by changing the numbers of cores and PEs per core while keeping the total area of all cores is a constant. This techniques is used in many multicore processors such as GPUs [17] , [18] and the Cell.B.E processor [19] . Since the number of cores and PEs per core are fixed in these processors, the advantages of this technique are limited. On the other hand, an FPGA-based architecture can reduce the total processing time efficiently by choosing the optimal number of cores and PEs to hide most of the data transfer time.
Total Processing Time Minimization

Window-Based Processing Model
We use window-based processing as an example to minimize the total processing time of an FPGA-based multicore platform. Window-based processing contains repeated access to the same data that belong to multiple overlapping windows. Therefore, it is important to maximize the data sharing, while allowing parallel processing. Such a data sharing and scheduling scheme is proposed in [10] , and we use it on the proposed architecture. The work in [10] proposes an off-line scheduling scheme where a part of the data are transferred to the accelerator core, and the computation is performed. During the computation, the data are not transferred to the accelerator core. Similarly, during a data transfer, the accelerator core pauses its computations. The transferred data are stored in multiple local memory mod- ules in the accelerator core in such a way that the data are accessed in parallel. Therefore, no data collision occurs inside an accelerator core. Please refer [10] for detailed discussions on how the data access is done inside an accelerator core. In this paper, we generalize this off-line scheduling scheme for multiple accelerator cores. The data transfer to one accelerator core starts only after the data transfers to all the other cores are finished. Therefore, no data collision occurs between the data transfer from the CPU core to the accelerator cores. Similar to [10] , no data collision occurs inside an accelerator core as well. Figure 7 shows the window-based processing model proposed in [10] . As shown in Fig. 7 (a), an image is divided in to N partial partial images. The width and the height of the partial image is given by P W and P H respectively. The data between different partial images are not shared. A batch of W P partial images are processed in parallel. The term W P is called the degree of window parallelism and N partial ≥ W P . The pixels in a window are accessed in pixelparallel column-serial manner as shown in Fig. 7(b) . The data in a column are accessed in parallel. This parallelism is called the pixel-parallelism and denoted by P P .
As shown in Fig. 8 a partial image contains multiple scan areas. After the first scan area is accessed the next scan area, which is one pixel bellow, is accessed. The data in scan areas are accessed by sliding a window from left-to-right.
The different scan areas of a partial image are processed sequentially in the accelerator cores as shown in Fig. 9 . The processing of a scan area is assigned to a sequence. In the first sequence, all the pixel data belong to the scan area one are transferred. In the second sequence, only the difference of the first and second scan areas is trans- ferred. The rest of the data are shared. Moreover, the new data are overwritten to the memory addresses with obsolete data which are not required for further processing. This method minimizes the data-transfer time since there is no data-duplication, and also optimizes the memory capacity. The width and the height of a scan area are equal to the partial image width P W and the window height W H respectively. Therefore, in one partial image, there are S (P H −W H +1) scan areas.
In the above explained window-based processing model, the following relationships exists. The degree of pixel-parallelism (P P ) must satisfy the relationship given by Eq. (1), where, C M is a natural number and W H is the window height.
The number of accelerator cores (N C ) and the degree of window-parallelism (W P ) must satisfy the relationship given by Eq. (2), where, N W is the number of windows processed in parallel in one accelerator core.
The parallelism of operations are constrained by the resources available in the FPGA. The MIMD architecture model explained in Fig. 3 contains columns of PEs where each column has n PEs. Only the first column is connected to the memory while the rest of the columns use the computation results of their previous columns. If we consider direct mapping, DFGs of most window-based applications have a tree-like structure. To implement this structure, we need n × (log 2 n + 1) number of PEs, where n and (log 2 n + 1) are correspond to the depth and height of the tree. Therefore, we can have (log 2 n + 1) columns of PEs in the MIMD architecture. This kind of architecture models proposed in many works such as [1] . In window-based processing, the pixel parallelism P P equals to the number of PEs in the first column. Therefore, the number of PEs required to process at the degree of pixel parallelism equals to P P × (log 2 (P P ) + 1). Since W P partial images are processed in parallel, the resource constraint is given by Eq. (3), where maximum num- ber of PEs available is PE MAX .
Moreover, to implement window parallelism, the scan areas of W P partial images must be stored in the FPGA internal memory. This constraint is given in Eq. (4)
where, IM max is the maximum amount of internal memory in the FPGA. Form Eqs. (3) and (4), we can see that the degree of parallelism W P × P P is limited by the amount of PEs and internal memory of the FPGA.
Processing Time Estimation
In this section, we explain the formulation of the processing time minimization problem. The processing of each sequence (a scan area) in Fig. 9 is divided into the following three phases.
Phase 1 Data-transfer from the CPU cores to the accelerator cores Phase 2 Computation on the accelerator cores Phase 3 Data-transfer from the accelerator cores to the CPU cores.
In this section, we discuss the processing time estimation of each phase and the total processing time.
Data-Transfer Time from CPU Cores to Accelerator Cores (Phase 1)
The data are transferred from the CPU core to the accelerator core through the AXI (Advanced eXtensible Interface) bus connected to the ARM processor. The bus width of the accelerator core's input memory and the word width of the input data are given by B B and B CA . If B B ≥ B CA , we can transfer several data in parallel. On the other hand, if B B < B CA , one word is divided in to several segments and transfer each segment in a serial manner. Therefore, the number of words transferred from CPU core to accelerator core at a time (N CA ) is given by Eq. (5) .
The data transfer between CPU cores and accelerator cores is shown in Fig. 10 . The frequencies of the external memory, CPU cores and accelerator cores may not be the same. Even though frequencies does not match, CPU cores and the data bus has necessary hardware such as memory controllers, data buffers, etc for an efficient data transfer. However, we cannot determine the transfer speed by using the parameters such as bus width, frequencies, etc. Therefore, we measure the data transfer time between CPU and accelerator cores using sample data. The term α is the average time per word-transfer from CPU cores to accelerator cores.
The amount of words transferred in sequence S 1 is different from those in the other sequences. In sequence S 1 , the amount of words transferred from a CPU core to an accelerator core is P W × W H as shown in Fig. 9 . Data-transfer time from a CPU core to an accelerator core in the sequence S 1 (t CA1 ) is given by
Note that N W is the number of windows processed in an accelerator core as described in Eq. (2). In other sequence (S 2 to S (P H −W H +1) ), the amount of words transferred from a CPU core to an accelerator core is P W . Data-transfer time from a CPU core to an accelerator core in each sequence (t CA2 ) is given by
Computation Time (Phase 2)
We estimate the computation time (t comp ) in each sequence on the accelerator core. The architecture of the accelerator is fully pipelined. After the pipeline is filled, the computation is done in every clock cycles. As shown in Fig. 8 , the size of the window is W H × W W . Each scan area includes (P W − W W + 1) windows. When processing a window, P P pixels are calculated in parallel as shown in Fig. 7(b) . Therefore, t comp is given by
where f A is the clock frequency, and t pipe is the pipeline latency of the accelerator core. Note that the address generation time does not appear in Eq. (8) since the address processing time is completely overlapped with the data processing time. This is because the address and data processing are done in AGUs and PEs respectively in parallel. 
The amount of words transferred from an accelerator core to a CPU core is (P W − W W + 1) in each sequence. Datatransfer time from an accelerator core to a CPU core in each sequence (t AC ) is given by
where β is the average time per word-transfer from accelerator core to CPU cores.
Estimation of the Total Processing Time
We process W P partial images in parallel, and the processing time required for this is given by t partial . Figure 11 shows the time chart of the processing in an accelerator core. The time t init consists of the initial data-transfer time from the CPU cores to the accelerator cores and the computation time in the sequence S 1 . These initial data-transfers to different cores cannot be done in parallel since there is only one bus that has a limited bandwidth. Therefore, t init is given by
During t mid , t trans and t comp are repeated as shown in Fig. 11 . The time t trans shown in Fig. 11 is defined by Eq. (12) t trans = t AC + t CA2 + t ctrl (12) where, t AC is the data-transfer time from accelerator cores to CPU core, t CA2 is the data-transfer time from CPU cores to accelerator cores and t ctrl is the control overhead due to starting and stopping the accelerator cores. To estimate t mid , we have to consider the overlap between the data-transfers and the computations. This overlap can be classified into following two cases.
Case A1:
The computation time of one core partially hides the data-transfers of the other cores as shown in Fig. 12(a) . This is represented by t comp < (N C − 1) × t trans . During t mid , the time period (N C × t trans ) is repeated (P H − W H ) times since there are P H − W H + 1 sequences as explained in Sect. 3.1 and (P H − W H ) of those sequences belong to t mid . Case A2: The computation time completely hides the datatransfer time of the other cores as shown in Fig. 12(b) . This is represented by t comp ≥ (N C − 1) × t trans . Similar to the case A1, the time period (t comp +t trans ) is repeated (P H − W H ) times during t mid . According to these two types, t mid is given by Eq. (13). 
Fig. 12
The data-transfers and the computations during t mid
The time t f inal is the data-transfer time from the accelerator cores to the CPU cores in the sequence S (P H −W H +1) . During t f inal , the data-transfers form the accelerator cores to the CPU cores (t AC ) overlap with the computations as shown in Fig. 13 . This can be classified into three cases.
In case B1, t AC is smaller than t trans , so that t comp of one core hides t AC of all the other cores as shown in Fig. 13(a) . When t comp < (N C − 1) × t trans , t comp can be either "greater than or equals to (N C − 1) × t AC " or "smaller than (N C − 1) × t AC ". In case 2, t comp of one core completely hides the t AC of the other cores as shown in Fig. 13(b) . In case 3, t comp of one core partially hides the t AC of the other cores as shown in Fig. 13(c) . According to these three cases, t f inal is given by Eq. (14) .
The processing time required for W P partial images (t partial ) is given by t partial = t init + t mid + t f inal (15) The total processing time denoted by T image is the time required to process a whole image. As explained in Sect. 3.1, an image is divided into N partial partial images, and W P partial images are processed in parallel. Equation (15) gives the processing time of W P partial images. After processing W P partial images, another W P partial images are processed. Therefore the total processing time required to process a whole image is given by
Note that, for smaller images where N partial = W P , T image equals t partial . Using Eqs. (11), (13), (14) and (15), we can see that the total processing time is derived from the combination of the design parameters, W P , P P , N C , N W , P W and P H . These parameters define the architecture of the FPGA-based heterogeneous multicore platform and its scheduling. Therefore, it is very important to find the optimal combination of design parameters that minimize the total processing time. The design parameter optimization is discussed in Sect. 4.
Evaluation
We use Xilinx Zynq-7000 EPP ZC702 board [20] for the evaluation. Zynq integrates a dual Cortex-A9 CPU cores and FPGA equivalent to Atrix-7 on a single chip. In addition, the evaluation kit has a DDR3 SDRAM for an external memory. The proposed heterogeneous platform is designed using Xilinx PlanAhead 14.2. The CPU core is programmed using C language on Xilinx EDK 14.2. Figure 14 shows the implemented architecture. There are accelerator cores, one Cortex-A9 hard CPU core, the AXI4 bus and a DDR3 SDRAM for the external memory. The processing time of the proposed heterogeneous platform is measured by the AXI Timer IP. The clock frequency of the CPU core is 667 MHz.
To estimate the data-transfer time of the proposed heterogeneous platform given by Eq. (15), we measured the values α, β and t ctrl in Eqs. (6), (10) and (12) respectively. We measured data-transfer times between the external memory and memory modules of accelerator cores. From experimental results, the values of α, β and t ctrl are measured to be 186.06 (ns), 213.02 (ns) and 430.00 (ns) respectively, the maximum clock frequency of the accelerator cores is measured to be 100 MHz. Table 1 shows the difference between the estimated processing time and the measured processing time for different windows sizes. The estimated processing time is calculated using Eq. (15) for a given set of design parameters. We implement the architecture described by the same set of parameters on FPGA and measure the processing time. This is called the measured processing time. According to the results, the error rate calculated using Eq. (17) is less than 1%. This small error percentage shows that the estimated processing time is sufficiently accurate to optimize the processor architecture.
Exploration of the Design Parameter Space to Find the Minimum Total Processing Time
In this section, we show how the design parameter space is explored to obtain the optimal ones that give the minimum processing time for filter computation. The specifications of the filter computation are given in Fig. 15 . We assume that the maximum degree of parallelism (W P × P P ) is limited to 16 by the resource constraints in Eqs. (3) and (4). Based on the specifications, the scope of the design parameters are determined as shown in Fig. 15(b) . We estimate the total processing time for all the combinations of the design parameters in order to find the optimal parameters that gives the minimum processing time. Figure 16 shows an example of problem formulation for a given set of design parameters. The design parameters are shown in Fig. 16(a) . According to the parameters P W and P H , we partition the image into 16 partial images as shown in Fig. 16(b) . Since the number of cores (N C ) is 4, we assign four different partial images to each core. The scan areas of a partial image is shown in Fig. 16(c) . Since N W = 4, one core processes four scan areas belongs to four different partial images in parallel. Figure 17 shows the time chart of processing. As explained in Sect. 3.2, the total processing time is the summation of t init , t mid and t total . Each term is calculated using Eqs. (11), (13) and (14) respectively. During t init , the first scan area belongs to a partial image is transferred to the accelerator cores. Since the bus-width (B B ) is 32 bits and the input data width B CA is 8 bits, The data of four windows are transferred to the accelerator core in parallel. However, the output data width B AC is 16 bits so that only the processing results of only two windows are transferred in parallel during t mid and t f inal . During t mid , the data transfers and the computations of scan areas 2 to 117 are done. During t f inal , the remaining computations and the output data transfer corresponds to the last scan area is done.
We estimated the total processing time for all combinations of design parameters by doing an exhaustive search. The search could be done in few minutes on an Intel CPU at 3.2 GHz. Table 2 shows the processing time for some of those combinations. The total processing time is minimized when W P = 16, P P = 1, N C = 4, N W = 4, P W = 94 and P H = 246. Usually, it would take several days of designing and compilation time to design an FPGA architecture.
In the proposed method, we can design a reasonably good FPGA-based heterogeneous processor architecture by just searching for the optimal design parameters. Even using exhaustive search, the optimal design parameters are found in a very short time. Therefore, we can reduce the FPGA architecture design effort and time dramatically by the proposed method. When exploring the design parameter space, we consider all possible situations that belongs two different cases shown in Fig. 12 . For example, in the last column of Table 2 , t comp = 0.4020 and (N C − 1) × t trans = 0.994. Therefore, t comp < (N C − 1) × t trans so that this design belongs to case A1, which is shown in Fig. 12(a) .
Evaluation of the Optimized Design for Different Fil-
ter Sizes Table 3 shows the optimized design parameters for different filter sizes. Note that the window size equals to the filter size. We assume that the maximum degree of parallelism is 16 due to the resource constraints. According to the results, optimized design parameters vary with different filter sizes. That means, the partitioning, scheduling and the hardware design are different according to the specifications of the application. However, in conventional heterogeneous multicore architectures such as [1] , we cannot optimize the design parameters since the number of accelerator cores, the number PEs, etc are fixed. Therefore, the proposed FPGA-based heterogeneous platform has a high degree of flexibility, and can be optimized for different applications. The optimized design (decided by the design parameters) in Table 3 belongs to either the case A1 in Fig. 12(a) or the case A2 in Fig. 12(b) . This shows that considering both cases is important to obtain the optimal design. Although not shown in Table 3 , each of the above cases are further divided in to another three cases, B1∼B3 as shown in Fig. 13 . When exploring the design parameter space, we considered these cases also. Tables 4 and 5 shows the optimized design parameters when the maximum degree of parallelism is 32 and 64 respectively. When the degree of parallelism increases, the computation time reduces. As a result, the data transfer time could be larger than the computation time. Therefore, most of the optimized designs belong to the case A1 where the data transfer time is not fully hidden by the computation time.
For some optimized designs, the degree of parallelism is smaller than the maximum value. For example, when the window size is 8 × 8 in Table 5 , the maximum degree of parallelism allowed is 64. However, the degree of window and pixel parallelisms in the optimized design are 8 and 2 respectively. This give a total degree of parallelism of 16 which is just 25% of the maximum available. In this case, the computation time is much smaller than the data transfer time due to the parallel computations. Therefore, the total processing time is decided by the data transfer time as shown in Fig. 12(a) , so that reducing the computation time further does not make any impact on the total processing time. Table 6 shows the comparison of the total processing time of the proposed method against the method given in [10] . The window-based processing model used in both methods are the same. However, the work in [10] does not consider the overlap of data transfers with the computation. In [10] , the total processing time is calculated simply by adding the total computation time, the total data transfer time and the total control time together. It uses one large accelerator core that can process the data in parallel using multiple PEs. The comparison is done for three different resource constraints. The maximum degree of parallelism (W P × P P ) is calculated to be 16, 32 and 64 in each of the three constraints. According to the results in Table 6 , the total processing time is reduced up to 37% using the proposed method. The computation amount increases with the window size. For small window sizes, the computation time is smaller compared to the data transfer time so that most of the data transfer time is not hidden. For larger window sizes, the computation time is larger compared to the data transfer time due to large computation amount. In this case, the data transfer time is hidden, so that the total processing time is decided by the computation time. Therefore, the largest total processing time reduction is achieved when the computation time nearly equals the data transfer time. When we increase the degree of parallelism, the computation time decreases. However, both the data amount and the number of parallel data transfers are unchanged (the maximum amount of parallel data transfers is already reached), so that the data transfer time remains the same. Nevertheless, the total processing time is reduced in both methods due to the reduction of the computation time.
According to the results in Table 6 , the total processing time in the proposed method is smaller than that in [10] . Even in the extreme cases where one of the data transfer time or the computation time is negligibly smaller compared to the other, the total processing time of the proposed method would be at least equals to [10] . During the design parameter exploration, the method in [10] is also one of over thousand combinations considered by the proposed method. Therefore the proposed method would never get worse than that in [10] .
Conclusion
We have proposed a design methodology for FPGA-based heterogeneous multi-core platform with custom accelerator cores. The proposed approach optimizes the total processing time by considering this overlap of data-transfers and computations. According to the evaluation, the processing time estimation has a sufficient accuracy, so that the architecture of the FPGA-based heterogeneous platform can be optimized. FDTD (finite-difference time-domain) computation, which is basically a stencil computation, has already been implemented in FPGAs [21] , [22] . Therefore, we believe that the proposed architecture model could be optimized for such applications grid-based HPC (high perfor-mance computing) applications such as stencil computation [23] in future works.
