Abstract-This paper studies the loosely integration of application accelerators consisting of an array of tightly-coupled lightweight reconfigurable processors into a system-on-a-chip. In order to explore a multitude of design variations a C++ simulation model of the accelerator has been integrated with a system-on-a-chip environment consisting of a general purpose processor, a DMA controller, an interrupt controller and a memory module. Dependent on the applications, different kinds of I/O buffers are designed around the processor array and the effects of the buffer size on the overall execution time are evaluated. The evaluations are based on new mathematical estimation models derived from the system and application constraints. The estimations are validated with experimental results with an error less than 1%. Exploring several designs points that using our architecture along with suitable buffer sizes, can improve the system execution time, one to two magnitudes for the selected algorithms.
I. INTRODUCTION
Steady improvements in the semiconductor industry and the growing request for real-time or near real-time speeds in the application area, signal to have more and more processing elements on a system-on-a-chip (SoC). With the need of executing different applications on such hardwares, new generations of hardware architectures were introduced called reconfigurable architectures.
These architectures allow the customization of reconfigurable processing units in order to meet the specific computational requirements of different applications [1] . Consequently, reconfigurable architectures have nearly the flexibility of general purpose processors combined with the performance, speed, and power consumption close to application specific integrated circuits (ASICs). As a result, integrating a reconfigurable architecture with a general purpose processor may offer better performance and flexibility than a general purpose processor alone.
Considering the functionality of a single processing unit, bit width, and configurability of the interconnections, reconfigurable architectures can be classified in two major categories. The first category are fine-grained reconfigurable architectures that are based on look-up tables (LUTs). Together with an extremely flexible interconnection network, look-up tables constitute such modern reconfigurable architectures, like field programmable gate arrays (FPGAs). These architectures suffer from a huge reconfiguration data stream [2] , inefficient area usage [3] and complex computations for placement and routing algorithms. Examples of reconfigurable systems using FPGAs are cryptographic applications [4] , video communications [5] , and neural computing [6] .
The second category are coarse-grained reconfigurable architectures, based on several functional units (FUs), which are capable of executing word or subword level operations instead of bit-level ones found in common FPGAs. The coarse granularity reduces the delay, area, power consumption, and especially the reconfiguration time compared with FPGAs, but at the expense of flexibility.
In this paper we propose a system integration process of a new class of coarse-grained reconfigurable architectures called weakly programmable processor arrays (WPPA) [7] . A mathematical performance analysis is introduced for studying buffer size effects on the system execution time. In our analysis the influences of the general purpose processor (as the master controller of the system) on the system performance are assessed. The mathematical analysis is evaluated using selected case study applications and their results are compared with the experimental results. Finally, the speedup gained by using our architecture compared to pure software implementation of the applications on the general purpose processor is evaluated.
Section II gives an overview of related works. Section III presents the WPPA architecture and other system components. In Section IV, the system operation is explained. Regarding to the system operation, a mathematical model is presented in Section V to estimate the execution, and the results are compared with experiments in Section VI. Finally, the paper is concluded in Section VII.
II. RELATED WORKS
With the increasing interest on reconfigurable computing, many coarse-grained architectures have been proposed [2] . Some examples of coarse-grained architectures are Morphosys [1] , MATRIX [8] , RaPiD [9] and REMARC [10] . Considering the great flexibility features of reconfigurable architectures and their ability of executing different applications, f0 f1   r0  r1  r2  r3  r4  r5  r6  r7  r8  r9  r10  r11  r12  r13  r14  r15   regGP   rPorts   wPorts   regI   regO Instruction Memory pc Fig. 1 . Weakly programmable processor array and a processor element structure [7] .
they can greatly improve the performance and flexibility of a SoC when they are coupled with a general purpose processor. Whereas different configuration information and input/output data have to be transferred to/from the reconfigurable module in the system (under the supervision of the main processor), an efficient interaction platform between the reconfigurable module and the main processor should be prepared. Some approaches have been presented for the efficient system integration of coarse-grained architectures. In [1] , Morphosys, a coarse-grained reconfigurable system, is presented. The system consists of an array of reconfigurable cells, a configuration memory, a control processor (tiny RISC), data buffers and a DMA controller. In order to increase the system performance, a double buffering mechanism is used but no evaluation is done on the effects of this mechanism and its constraints. The REMARC system is presented in [10] . The design of this system is similar to the Morphosys system and targets the same class of data parallel and high throughput applications. Like Morphosys, REMARC also uses a modified MIPS-like ISA for the RISC processor to control the reconfigurable components. In this paper a study of the system performance on multimedia applications is done. The performance of SRC-GE, a reconfigurable computer, is evaluated in [11] . The platform consists of two general purpose processor boards and a reconfigurable processor board. In this paper, a theoretical and experimental study of I/O transmission time -as an important system performance bottleneck-is performed. With the help of mathematical evaluations, conditions where double buffering mechanisms can be applied to the system are investigated. Then, the mathematical formulations of the problem are experimentally verified. Although in this paper the reconfigurable module is coupled with general purpose processors, but the constraints imposed by these processors on the system performance are not taken into account in the mathematical evaluations.
III. SYSTEM COMPONENTS In this section the system integration of the WPPA architecture is explained. As it has been discussed, the WPPA architecture is a coarse-grained reconfigurable architecture consisting of an array of tightly-coupled lightweight reconfigurable processor elements so-called weakly programmable processor elements (WPPEs), see Fig. 1 . The PEs are called weakly programmable because the control overhead of the PEs is optimized and kept small. These processor elements contain several functional units with a very few memory and a regular interconnection structure. In order to efficiently implement an algorithm, each PE may implement only a certain functional range. The instruction set is limited and parameterizable at compile-time. The interconnection network between the WPPEs is reconfigurable at run-time. In order to execute an application on a WPPA, the programs and the interconnection configurations should be loaded into the array. The reconfiguration phase can also be done even at run-time. Consequently, it provides a great flexibility to run different applications on the architecture and switch between them. In order to get use of this feature, we integrated the WPPA architecture with a SoC to transfer the configuration data and input/output data under the supervision of a general purpose processor. To do it, a simulation model of the WPPA architecture was implemented in C++ [12] . This simulation model has been integrated into a SoC that is specified by a virtual prototyping system from VaST [13] . As shown in Fig. 2 , the system consists of an ARM processor (ARM926e), a DMA controller, an interrupt controller, a memory module and the WPPA module. The components are connected together using a bus with a generic standard protocol. Also, there is a data loader/writer module in order to load a data file from the host computer as a test-bench and finally to store the results after doing the computations. The ARM processor is the master controller of the system. A system controlling software is executed on this processor that controls all the activities in the system. This software initiates the modules in the system and configures them to cooperate with the WPPA module. The configurations include defining and dedicating the transmission channels in the DMA controller for the input/output transmissions and configuring the interrupt controller for different incoming interrupt sig- nals. After triggering the first input transmission, the system controlling software enters to the wait state where it waits for incoming interrupt signals. After receiving an interrupt, the interrupt handler procedure reacts to it by detecting the interrupt sender (the WPPA module or DMA controller) with the help of the interrupt controller, and initiating a suitable data transmission depending on the WPPA state. The data transmissions between the WPPA module and the memory (or data loader/writer module) are done by the DMA controller. This device has 16 prioritized transfer channels; the channels with lower number have higher priority. Different transfer channels can be dedicated to different devices (or different transmission paths) in the system. Consequently, the transmission initiation delay will be decreased during the system operation.
IV. SYSTEM OPERATION As aforementioned, controlling the system operation is done by an ARM processor. Whereas limited amount of the data buffers can be used inside the WPPA module, the system controlling software should initiate the data transmissions for the input/output buffers in the case of buffer events (input buffer empty or output buffer full).
Inside the WPPA module, as shown in Fig. 2 , a buffer controller is implemented that supervises the input/output buffers. This controller requests data transmissions depending on the buffer size and the number of read/write values (that are kept in buffer counters) by sending an interrupt signal to the ARM processor. The input/output buffers are accessed by the system components through the I/O controller of the WPPA module. When a buffer event happens, two strategies can be followed: Halting the computations and waiting for the suitable data transmission or continuing the computations without any halt by using a double buffering mechanism.
In the first case, after observing an input buffer empty or output buffer full event, the WPPA module halts the computations and sends an interrupt signal to the ARM processor. After a complete I/O data transfer, the WPPA module will automatically resume the computations (see Fig. 3(a) ). Using the double buffering mechanism makes it possible to read and write on separate buffer blocks simultaneously. That is, the WPPA can read from an input block while it is pre-fetching another block (or writing on an output block while transferring the last output block) without halting the computations. Toggling between these two buffer blocks is done by the buffer controller with the co-operation of the I/O controller. By using this mechanism, the computation time and the transmission time can be overlapped (see Fig. 3(b) ) that can improve the system performance significantly, but -as it will be explained in our mathematical analysis-there are some constraints on using this mechanism.
V. SYSTEM EXECUTION TIME Our main objective in this paper is to find a tradeoff between the buffer size and the system performance. There are some constraints that can affect the system performance; some are forced by system specifications and some are forced by the application nature [11] . The system constraints can be listed in terms of system bus bandwidth, number of concurrent DMA channels and the delay time that elapsed for the interrupt handler procedure in the ARM processor. The input reading and output writing bandwidth of the application are two major factors imposed by the application nature that affect the system performance. We assume that the application is periodic, that means data are fed into the array and generated by it periodically in fixed-sized blocks. This assumption is met by a large range of streaming applications, including encryption [14] , compression, and multimedia (image, sound, One-dimensional transmission: In this mode the data are transmitted in single thread streams. Depending on the data dependencies and the data reusing characteristics of the application, a data overlap factor can be considered. The overlapped data is the amount of data that is retransmitted in two successive data streams. Assuming L D be the original data length, L p be the partial data transmission length (equal to the buffer size) and the L ovrl p be the overlapping data length, then the total data transmission length is:
Where N tr is the number of data transmissions:
Two-dimensional transmission:
In this mode the data are partitioned in rectangular tiles, examples of using this transmission scheme are image processong applications and matrix operations. Here also depending on the application, tiles may be overlapped vertically or horizontally, e.g., boarder treatment of adjacent tiles for the edge detection algorithm. Assuming H D and W D be the height and width of the original data, respectively. H p and W p are assummed to be the height and width of the tile and H ovrl p and W ovrl p are assummed to be the vertically and horizontally overlap factor, respectively. Then, the total height and width of the transmitted tiles will be:
Where N row tiles is the number of rows of tiles and N col tiles is the number of columns of tiles:
And, the total number of tiles is:
As it has been described, two different strategies are considered for the system operation. The execution time of the system without overlapping is equal to the total time duration spent for input/output data transactions and the computations inside a WPPA. As each data transaction consists of a time duration elapsed by the interrupt handler procedure in the ARM processor plus the data transmission time in DMA, the total execution time of the system is:
Where N in tr and N out tr are the number of input and output data transmissions, which can be calculated by Eq. (2) for onedimensional transmissions or by Eq. (7) for two-dimensional transmissions. It should be noted that usually for output data transmissions no data overlap factor is considered. T in p and T out p are the DMA transfer time for input or output data and can be calculated using the following equations:
T out p = D out bu f f B bus (10) In fact the system operation consists of several partial computation durations. Each of them is equal to the time duration needed to do the computations on one complete input data parcel. The number of the partial computations and their duration vary depending on the buffer size, but the total computation time for a certain application and certain amount of input data is fixed, so in Eq. (8) the total computation time is used instead of the partial computation durations. In the non-overlapping mode, the system operates sequentially and for any buffer event the computation is halted untill the end of the data transmission. This has a negative effect on the system performance, especially for small buffer sizes that need many data transactions (including interrupt procedure delay and transmission time). Using a secondary buffer can help the system to transfer the data while it is using the other buffer for reading/writing data. The objective of using this scheme is to overlap the computation time with the data transaction time and consequently, to reduce the total execution time. It means that the partial computation time should cover the transaction time for one input parcel and one output parcel. Depending on the number of concurrent transmission channels for input or output transmissions, T p comp , the partial computation time should fulfill the following conditions:
Where the inequality in Eq. (11) is for single channel and the inequality in Eq. (12) is for multiple channel. If the partial computation time fullfills the mentioned conditions, the system execution time will be:
The system execution in this equation consists of the system initialization and the first input data transmission, doing the computations inside WPPA parallelly with data transmissions and finally the last output data transmission.
VI. EXPERIMENTAL RESULTS
In order to evaluate the system, a platform is setup using the devices that have been explained in Section 3. The frequency of the system core clock and also the bus clock are both 100MHz (T clk = 10nsec). The devices are connected together using a standard bus that transfers 4 bytes every 4 cycles. A single transmission channel for both input and output transmissions is prepared for the WPPA module.
As the case studies, two applications are implemented on the WPPA architecture: A 6-tap FIR filter and an edge detection algorithm. These applications have been selected as our case studies because they need different kind of input/output buffers and different transmission modes. The FIR filter is implemented on 6 WPPEs that are connected together in a single pipelined manner. The single thread input data are fed into the pipeline, one input data per cycle. Also after passing the pipeline length, single output will be generated, one output data per cycle. As input/output buffers, single thread FIFOs are used for input and output data streams; consequently onedimensional data transmission is used as data transmission scheme. In order to evaluate the effects of the FIFO size on the system performance, the application has been executed with different FIFO sizes. The results of the system execution time are shown in Fig. 4 .
Due to the growing increase of the number of data transactions for the small buffer sizes, the execution time for the small buffer sizes increases exponentially. This is because of the interrupt delay time that is needed in each data transaction; as the number of data transactions becomes more, the total delay time elapsed by the interrupt procedure becomes longer. Consequently, for buffer sizes greater than 256 bytes that need less transactions, the execution time does not change significantly.
In the same diagram the system execution time estimation for different buffer sizes is also shown (using Eq. (8)). The estimated time for small buffer sizes, especially for 4 or 8 bytes, differs from the experimental results. This is because of that we have used an average time for the interrupt delay; consequently for small buffer sizes that need many data transactions, any small difference between the used value with the actual value can affect the final result significantly. The average estimation error for the presented equation is about 0.93%. The 6-tap FIR filter has been implemented also on the ARM processor purely in software. The system execution time Fig. 5 . Diagram of the system execution time for the edge detection algorithm without using double buffering mechanism using this implementation is about 35 milliseconds. Using the hardware approach with buffer sizes greater than 256 bytes, we can execute the system more than 10 times faster than the pure software implementation. On the other hand, for buffer sizes smaller than 8 bytes, using the hardware approach is not worthy. For the FIR filter, the application and system constraints do not fulfill the inequality in Eq. (11), because the partial computation time for any buffer size is less than the summation of one input and one output data transaction time.
As the second case study, an edge detection algorithm (Sobel operator) has been implemented on the WPPA architecture. The algorithm is computed over 3*3 pixel windows, consequently for any output, 9 inputs should be read. This window slides over the input picture to generate all the output pixels. Every 3 cycles one input window is read (3 pixels per cycle) and also every 3 cycles one output pixel is generated. For this application, RAM buffers and two-dimensional transmissions are used because of data reusing nature of the application. As the adjacent tiles have shared data in their borders, for this application the vertical and horizontal data overlap factors should be considered. The execution times for this application are shown in Fig. 5 for different buffer sizes. The sizes of the tiles are chosen in a manner that the picture is partitioned in to the complete tiles. As it is shown in the diagram for buffers bigger than 1024 bytes, the execution time does not change significantly. In this diagram the estimation of the execution time (calculated by Eq. (8)) is also presented. Like for the FIR filter, there is a gap between the estimated time and the experimented time due to the use of an average delay value for the interrupt handler. Here, the average error of the estimation is about 1.29%.
To investigate the possibility of using the double buffering mechanism, the inequality in Eq. (11) for a given and should be solved. Whereas, the partial computation time for a given buffer size is equal to:
So, the inequality in Eq. (11) will turn to:
Fig . 6 shows the execution times after applying this mechanism, along with the mathematical estimations. In this diagram the total size of double buffers is considered and the results are compared with corresponding buffer sizes without using double buffering. Using this mechanism improves the execution time of the edge detection algorithm in average about 45%. Despite the results of the single buffer, there is a slight increase for execution time respect to the buffer size. This contradiction can be explained using Eq. (13) where the execution time consists of three parts: First input tile transmission, computation time, and the last output tile transmission. The total computation time for a given application and amount of input data is fixed, so changing the tile size has no effect on it. But the first input and the last output tile transmission times increase with the tile size. Consequently, the total execution time will be increased. It shows the importance of inequalities that mentioned in Eq. (11) and Eq. (12), which can be used to calculate the smallest buffer size that suits for the double buffering mechanism. As an example, as shown in Fig. 6 , for a buffer size equal to 1024 bytes, the best execution time is gained. In addition, the edge detection algorithm is implemented by software on the ARM processor. Here, the total system execution time for this implementation is 412 milliseconds. The hardware approach even with smallest buffer size computes the application about 6 times faster than the software approach. The speedup for a buffer size equal to 100 bytes is about 50 times, for a buffer size equal to 512 bytes is 84, and for a buffer size of 1024 bytes is 92. Using double buffering with a total buffer size of 1024 bytes, the hardware approach can compute the application 145 times faster than the software approach.
VII. CONCLUSION
In this paper the system integration of weakly programmable processor arrays has been studied. The considered Fig. 6 . Diagram of the system execution time for the edge detection algorithm using double buffering mechanism system consists of an ARM processor, a DMA controller, an interrupt controller, memory modules, a standard bus, a WPPA module, and a test bench image loader/writer module. As I/O buffers, depending on applications, FIFO buffers or RAM buffers were implemented. In order to evaluate the buffer size effects on the system performance, a mathematical model, was introduced for system execution time estimation. As case studies, two applications were implemented: A 6-tap FIR filter and an edge detection algorithm. For the FIR filter, changing the buffer size had a great influence on the execution time for buffer sizes less than 256 bytes, while it had not a significant effect for buffer sizes greater than 256 bytes. The mathematical model could estimate the execution time of this application with an error less than 0.93%. Like the FIR filter, the system execution for the edge detection algorithm reached a steady value for the buffer sizes bigger than 1024 bytes. The estimation error using the mathematical model was 1.29%. The system was also evaluated using the double buffering mechanism, which improved the execution time in average 45%. Using our mathematical analysis along with a little knowledge about the timing delay imposed by the general purpose processor (elapsed by interrupt the handler procedure), helps us to explore numerous system designs in a really short time. In addition, for double buffering mechanism, using our methods help to find the smallest buffer size with the best execution time. Our future work focuses on extending the double buffering idea over bigger range of buffer sizes, executing multi-applications on a single array and arranging efficient buffering systems for different applications on an array.
VIII. ACKNOWLEDGEMENT
This work has been supported in part by the German Science Foundation (DFG) in project under contract TE 163/13-2.
