Abstract: Reconfigurable Computers can leverage the synergism between conventional processors and FPGAs to provide both hardware functionalities and general-purpose computers flexibility. In a large class of applications on these platforms, the data-transfer overheads can be comparable or even greater than the useful computations which can degrade the overall performance. In this paper, we perform a theoretical and experimental study of this specific limitation. The mathematical formulation of the problem has been experimentally verified on the state-of-the-art reconfigurable platform, SRC-6E. We demonstrate and quantify the possible solution to this problem that exploits the system-level parallelism within reconfigurable machines.
Introduction
RCs combine the flexibility of traditional microprocessors with the power of FPGAs. The programming model is aimed at separating programmers from the details of the hardware description, and allowing them to focus on an implemented function. This approach allows the use of software programmers and mathematicians in the development of the code, and substantially decreases the time to the solution. The SRC-6E RC is one example of this category of hybrid computers (SRC-6E C-Programming Environment Guide, 2003) .
In this paper we will discuss the existing limitations on the performance of RCs, and propose an optimisation technique that improves this performance. Our experimental results confirm the efficiency of the proposed solution.
SRC-6E Reconfigurable Computer

Hardware architecture
SRC-6E platform consists of two general-purpose microprocessor boards and one MAP â reconfigurable processor board. Each microprocessor board is based on two 1 GHz Pentium 3 microprocessors. The SRC MAP board consists of two MAP reconfigurable processors. Overall, the SRC-6E system provides a 1 : 1 microprocessor to FPGA ratio. Microprocessor boards are connected to the MAP board through the SNAP â interconnect. SNAP card plugs into the DIMM slot on the microprocessor motherboard (SRC-6E C-Programming Environment Guide, 2003) .
Hardware architecture of the SRC MAP processor is shown in Figure 1 . This processor consists of two programmable User FPGAs, six 4 MB banks of the OBM, and a single Control FPGA.
In the typical mode of operation, input data is first transferred through the Control FPGA from the microprocessor memory to OBM. This transfer is followed by computations performed by the User FPGA, which fetches input data from OBM and transfers results back to OBM. Finally, the results are transmitted back from OBM to microprocessor memory.
Programming model
The SRC-6E has a similar compilation process as a conventional microprocessor-based computing system, but needs to support additional tasks in order to produce logic for the MAP reconfigurable processor, as shown in Figure 2 . There are two types of application source files to be compiled. Source files of the first type are compiled targeting execution on the Intel platform. Source files of the second type are compiled targeting execution on the MAP reconfigurable processor. A file that contains a program to be executed on the Intel processor is compiled using the microprocessor compiler. All files containing functions that call hardware macros and thus execute on the MAP are compiled by the MAP compiler. MAP source files contain MAP functions composed of macro calls. Here, macro is defined as a piece of hardware logic designed to implement a certain function. Since users often wish to extend the built-in set of operators, the compiler allows users to integrate their own VHDL/Verilog macros.
Current performance limitations
The total execution time for any application on a reconfigurable machine consists of the computations time and the total I/O time as shown in Figure 3 .
Figure 3 Execution time without overlapping
In a large class of applications, the total I/O time is comparable or even greater than the computations time. As a result, the rate of the DMA transfer between the microprocessor memory and the OBM becomes the performance bottleneck.
One possible solution is the redesign of the system hardware in such a way that it supports a higher data transfer rate. Taking into account the cost of the hardware system upgrade, this solution may not be practical. Additionally, even with the higher data transfer rate, there might be still applications in which the DMA time is comparable or even longer than the computations time.
Therefore, our goal has been to find a general solution to speed-up a large class of applications running on a RC without any changes to the system hardware. Our solution exploits the system-level parallelism within the SRC machine, and requires only small changes in the application code.
The proposed optimisation technique
Model formulation
The objective of our optimisation technique is to overlap computations with the data transfer which substantially reduces the total execution time.
This technique is constrained by both the machine and the nature of the application.
The machine constraints can be in terms of the I/O bandwidth, total number of concurrent DMA channels, the capability of overlapping the input with the output DMA channels, and the asymmetry between the input and output DMA channels bandwidths. In our model, we assume a generic hypothetical machine that has all the above-mentioned constraints (see Figure 4) . In other words, we assume asymmetric I/O transfers, non-equal number of concurrent input and output DMA channels, and varying overlapping ability among the DMA channels.
Figure 4 Model architecture
On the other hand, the application forces some constraints, depending on its nature, which makes it difficult to model all the possible variations. Two essential variations in this context are the nature of data acceptance and data processing by the application. For the data acceptance, our model assumes that the application is periodic, i.e., data are fed into the application sequentially in fixed-size blocks. Periodicity, in general, accommodates for the special nature of pipelined applications as a subset of the range of applications it covers. For the nature of processing, we assume concurrent processing of multiple blocks of data as well as linear dependency between the computations time and the amount of data being processed. These assumptions are met by a large class of applications, including encryption (Fidanci et al., 2002 (Fidanci et al., , 2003 Michalski et al., 2003) , compression, and selected image and data processing algorithms (Parhi, 1999; El-Ghazawi and Le Moigne, 1994; Mallat, 1989) .
The details of the presented technique are illustrated in Figure 5 . Both the DMA-IN and DMA-OUT transfers are divided into a sequence of n data transfers each. Each of these transfer parcels is further divided into a number of concurrent transfer parcels equal to the number of the DMA channels available in each direction. The computation period has been divided into a number of partial computation periods spanning the time interval between the end of the first DMA-IN transfer and the beginning of the last DMA-OUT transfer. The first and the last data parcels are special, as no computations can be performed in parallel with these data transfers.
Analysis
The following notation will be used in our mathematical model:
• n DMA-IN is the number of input data parcels
• n COMP is the number of partial computations
• n DMA-OUT is the number of output data parcels
• K DMA-IN is the input transfer concurrency (multiplicity) factor, i.e. the number of concurrent input channels • K DMA-OUT is the output transfer concurrency (multiplicity) factor, i.e. the number of concurrent output channels • K DMA is the total DMA concurrency (multiplicity)
• K c is the computations concurrency (multiplicity) factor, i.e., the number of concurrent processing units, it is also the number of independent data channels between the OBM and the computations on the FPGA in either direction (e.g., number of OBM memory banks) • D COMP-IN is the total input data size for the computations and it is equal to the total data transferred in by the DMA
• D COMP-OUT is the total output data size from the computations and it is equal to the total data to be transferred out by the DMA
• β is the data production-consumption factor; i.e., β > 1 for data-producing applications, and β < 1 for data-consuming applications
• D DMA-OUT is the total data size for the output transfer
• T DMA-IN is the single-channel DMA transfer time from the microprocessor memory to the OBM
• T DMA-OUT is the single-channel DMA transfer time from the OBM to the microprocessor memory
• T DMA is the single-channel total DMA transfer time
• T COMP is the total computations time for the case of no-overlapping
• T No Overlap is the total execution time for the case of no-overlapping
• T Overlap is the total execution time for the case of overlapping
• V is the DMA channel-overlapping factor; i.e., V = 0 for no overlapping between input and output DMA transfers, V = 1 for maximum overlapping between input and output DMA transfers (see Figure 5 ).
We also introduce the following notation for the ratios of respective times.
Equations (1)- (9) show the sources of asymmetry in DMA transfer times. Asymmetry can be caused by difference in the number of channels, K DMA-IN , K DMA-OUT , and in bandwidths, B DMA-IN , B DMA-OUT , between the transfers, which are machine constraints. It can also be caused by transferring different data sizes in each direction depending on whether the application being either data-producing (D DMA-OUT > D DMA-IN ; i.e., β > 1) or data-consuming (D DMA-OUT < D DMA-IN ; i.e., β < 1). In addition to these factors, asymmetry can be caused by difference in the number of input parcels, n DMA-IN , and the number of output parcels, n DMA-OUT . To limit the asymmetry conditions to those factors which are only forced by the machine and/or application constraints, not by the proposed technique itself, we can deliberately select the number of transfer parcels in both directions to be equal; i.e., n DMA-IN = n DMA-OUT = n. In general, asymmetry can be caused by either one or all of the above three constraints. The resultant effects of all these asymmetry constraints are collectively modelled in equations (2) and (7)-(11). As a conceptual representation (Hwang and Xu, 1998 ) of the model, Figure 5 suggests some bounds on the number of input channels, K DMA-IN , and the number of output channels, K DMA-OUT , when related to the number of concurrent processing units, K c . The existence of OBM, which is very common in almost all types of RCs, serves as a buffering mechanism which relaxes any bounding limits on the relation between K DMA-OUT and K c . In other words, these two factors can boundlessly be changed independently. On the other hand, there can be a lower bound for the relation between K DMA-IN and K c . This lower bound is forced by the fact that the first block of transferred data should be at least the minimum amount of data necessary to start the processing; i.e., (
Equations derived to assess the effectiveness of the proposed overlapping technique are grouped together in Table 1 . Based on Figures 3 and 5 , equations (12) and (13) have been derived to describe the total execution time for both cases of no-overlapping and overlapping. To evaluate the effectiveness of the technique, the speedup in the total execution time, S, is defined by equation (14).
Based on equations (12)- (14), equation (15) gives a simplified formula for S for the different values of the X c ratio. The upper limit on the speedup, and the asymptotic behaviour of this limit, for both cases of non-overlapped DMA channels, V = 0, and maximally overlapped DMA channels, V = 1, are given in equation (16) under the conditions of symmetric DMA transfers.
In Figure 6 , the asymptotic dependence between the speedup, S, and X c , is plotted for different values of the system parameters, K DMA-IN , K DMA-OUT , K c , and V. Based on equation (15) and Figure 6 , our technique, for a given K c , gives the best results for the case of X in = X out , K DMA-IN = K DMA-OUT , i.e., symmetric data transfer where data transfer-in and data transfer-out take the same amount of time. If this is not the case, the speedup, S, shifts downward, shaded areas in Figure 6 , from the peak values when X c varies between X cmin and X cmax , where X cmin and X cmax are defined in Table 1, equation (15) . In other words, when the DMA transfer-in time differs from the DMA transfer-out time, the maximum performance degrades from the peak value, i.e., the DMA asymmetry introduces some speedup losses.
An asymmetry between the DMA-IN throughput and DMA-OUT throughput exists in the current version of the SRC-6E system. The case of X min = X in = 0.4 and X max = X out = 0.6, shown in Figure 6 , corresponds to the experimentally measured difference between the DMA-IN and DMA-OUT times for the SRC system.
The effects of the machine constraints can be studied from Figure 6 , specifically by considering the reference point (R) with the points (1)-(3), and equation (16). It can be seen that the change in the asymptotic maximum in speedup, S ∞max , is in direct proportion to the change in the number of channels, K DMA , while the relative change in its location shifts left to less X c (faster computations), i.e., the computation for which this maximum is achieved, is in inverse proportion to the change in the number of channels. It can also be seen that as the level of DMA channel-overlapping, V, increases, the effect of the number of channels on the speedup increases, and the speedup loss, the shaded areas in Figure 6 , increases. In other words, for applications with certain I/O requirements and computational characteristics (X c , K c ), the design space for the machine parameters can be explored in the direction indicated by the solid arrow.
Table 1
Equations describing the performance of the proposed technique
On the other hand, the effect of the applications constraints can be studied by considering the reference point (R) with point (4) shown in Figure 6 . It can be seen that the change in the asymptotic maximum in speedup, S ∞max , is in direct proportion to the change in the number of concurrent processing units, K c , and the shift in its location, right shift to larger X c (slower computations), is also directly proportional to the change in K c. In other words, for machines with certain constraints on the DMA bandwidths, multiplicity, and channel overlapping capability, the design space for the applications parameters can be explored in the direction indicated by the dashed arrow.
SRC-6E case study
Model parameters
SRC-6E reconfigurable computer has been used as our testbed to verify our model. To apply our model to SRC-6E system we set the model parameters with some experimentally measured values, and others from the machine specifications.
The machine parameters are set to the following values:
• V = 0 (non-overlapped DMA channels). This figure shows the effect on speedup of the DMA asymmetry, as well as the effect of computations concurrency, i.e., the number of concurrent processing units and/or the number of independent data channels between the OBM and the FPGA (see Figure 4) . The peak speedup, S ∞max , can be calculated from the following equation:
In the experimental verification of this model we investigated only applications with the parameter K c set to 1.
The design problem
The design problem can be stated as follows: given the machine constraints and the application constraints, what is the minimum number of transfer parcels that achieves a speedup as close as possible to the asymptotic maximum for that application? In other words, given X in , X out , K DMA-IN , K DMA-OUT , V, and K c , we are trying to find the minimum n, n min , that gives speedup S very close to S ∞ with an efficiency E near to 1, where S ∞ is the asymptotic value of S for this specific application, and E is the ratio between S and S ∞ . From Figure 8 the design problem can be broken down into two cases, namely when 2X min > X max and when 2X min < X max . Table 2 serves as a guideline to finding the required n min . Design values for the minimum number of transfer parcels, n min
Experimental results
The experimental work has been performed, as mentioned earlier, on the SRC-6E.
In our experiments, we selected a certain value of X c , and then repeated the experiment multiple number of times with the different number of transfer parcels, n. We started with n = 1 (no overlap, speedup = 1), then n = 2, 4, 8, 16, 32. Then, we repeated the experiments for different values of X c .
The results of experiments are summarised in Figure 9 . All curves, for any value of X c , start with the unity speedup when n = 1 (no overlap case), then as n increases the speedup increases. After the specific number of stages the speedup starts to saturate. In our experiments, we have obtained the maximum speedup when X c was equal to one and n was equal to 16. The speedup obtained for these parameters was equal to 1.78, and was consistent with our theoretical predictions as in equation (17) for the case of X min = 0.4, X max = 0.6 and K c = 1.
We also confirmed experimentally that the maximum performance could be accomplished when X c was close to one. When X c was larger or lower than one, the speedup deteriorated.
For the case of X c larger than one, the only gain in speedup is to hide the DMA time within the computations time. When the DMA transfer is very short relative to the computations time, the gain will also be very small. Similarly, when X c is smaller than one, the gain is to hide the computations time within the DMA time. So, the idea is always to hide the shorter time within the longer time, and when both times are close to each other, we can obtain a speedup close to 2.
Conclusions
In this paper, a technique for optimising the performance of a RC is introduced. A mathematical model for this technique has been derived for a generic reconfigurable machine, taking into account the constraints imposed by both the system and the application. This technique depends on overlapping the computations on the User FPGAs with the I/O transfer. This overlapping requires dividing data transfers into multiple transfer parcels that can be overlapped with partial computations.
The presented technique has been implemented and experimentally verified on the SRC-6E RC. Both theoretical analysis and experimental results proved that this technique is efficient in speeding up the execution time. The maximum theoretical speedup was shown to be 2 for an application with one processing unit and a system with a single DMA channel perfectly balanced for DMA-IN and DMA-OUT transfers. For the current generation of the SRC system, the theoretical maximum speedup was shown to be 1.83, and the corresponding experimental maximum speed-up was 1.78.
