Abstract. The problem of processing images, i. e., two-dimensional data arrays, was solved through implementing two-dimensional fast Fourier transform (FFT) when using single-type hardware modules -IP-cores in the Virtex-6 FPGA architecture. We have shown the possibility of the parallel implementation of each stage in the two-dimensional FFT, based on four "butterfly"-type transforms (BTr) over four elements of the data array being processed. Estimations were obtained regarding time-and hardware complexity of the IPcore implementing BTrs and used in implementing the one-dimensional FFT. The results obtained can be used in estimating hardware and time consumption when performing a twodimensional FFT over an array of the pre-defined dimensionality in using existing and forthcoming distributed programmable-architecture systems.
Introduction
The problem of real-time image processing is topical today. Software implementation of algorithms employing this problem on a general-purpose computer are limited by the features of the von Neumann architecture.
A way out of the current situation is using special-purpose computers, particularly those embedding the hardware accelerators, both ASIC and FPGA. This paper discusses the distributed implementation of two-dimensional fast Fourier transform (FFT) based on single-type IP-cores in FPGA-architecture. Based on the estimates of hardware complexity and IP-cores functioning delay time, estimates of hardware and time complexity have been obtained regarding the execution of two-dimensional FFT for an image of a given dimensionality.
There is a known algorithm for the calculation of a two-dimensional FFT, based on the onedimensional FFT procedure [1 -4] . A weak point of that algorithm is the fact that it is executed in two stages. At the first stage, the patterns are computed for rows, while at the second stage for columns. Or vice versa, columns at the first stage and rows at the second one. In any case, until all operations have been performed for the first stage, one cannot go to executing the operations at the second stage. Algorithm proposed in this paper does not have this weak point: Operations are performed over the elements of a two-dimensional data array in parallel, without being divided into stages.
The results obtained can be used to solve the problems of the distributed real-time processing of images via the use of special-purpose graphical accelerators based on both existing and promising FPGAs.
Furthermore, in solving the distributed image processing problems, a promising trend is to apply to them distributed programmable-architecture systems [5] , the elements of which architecture are FPGAs, such as in [6] . FPGAs allow organizing the distributed data processing at the level of binary data operations, which makes the above hardware platform match with the distributed implementation of a two-dimensional FFT.
We show this in the present paper. 
Two-dimensional fast Fourier transform

Note 1.
In computing a two-dimensional FFT, low frequencies will be concentrated in the corners of the above matrix, which is not very convenient for further processing the information obtained. To get a representation of the two-dimensional FFT, in which low frequencies would be concentrated within the center of the matrix, a simple procedure can be performed, which consists in multiplying the initial data array by the value of ( 1) mn   . Figure 1 and 2 show the initial image and its Fourier-pattern computed according to (1) , taking Note 1 into consideration.
The system indicated (1) can be represented similarly to one-dimensional FFT, as follows: 
Note 2. Number of matrices D sized 22
dd  was found to be
log aN  . Let us represent the system indicated (3) in matrix form as a complex of single-type "butterfly" transforms (BTr) used in computing the one-dimension FFT:   This operation is a basic one in performing an FFT, both one-and multi-dimensional. Therefore, the complexity of performing BTrs determines the complexity of implementing one-and multidimensional FFTs. In this study, we confine ourselves with considering the implementation of a two-dimensional FFT based on BTr. The following is involved in the Module implementation: Figure 3 . Single-type module diagram to perform a single-type "butterfly" transform.
At the first and second clock periods, the operation of multiplying complex numbers by the predefined constant, W, is performed with operand 2 X , as well as operand 1 X is saved for the complex number addition operation. Y , are saved in registers. Eventually, the result of BTr is computed in a pipelined manner, within three clock periods, with a frequency that does not exceed 95.8 MHz. The slowest operation is the operation of multiplying complex numbers. Therefore, this operation is performed within two clock periods, intermediary results being saved by organizing the pipelined data processing (figure 4). 
Discussion
According to [7, 8] , a FPGA-based combinational circuit close to the optimal implementation requires involving at most 0.5 of the resources of each type, i.e., D-triggers, LUTs, Slices, and input/output units. According to the complexity estimates obtained, no more than two Modules can be placed on one FPGA of the Virtex-6 family. Limiting factor is the number of I/O units. 2 
aN 
FPGAs of the Virtex-6 family are required to calculate a two-dimensional FFT for a number array sized NN  . Moreover, to calculate the operation described according to (4) , one FPGA of the above family is required, each of which accepts at the input and returns by four elements ( DPASes [5] , the one stage of the two-dimensional FFT is implemented within 517 clock periods 10.438 μs each, while 10 stages within 5,170 clock periods, which makes about 5.40 μs to process one stage and about 54 μs to process the entire array. About 18.5 thous of arrays sized 1,024 by 1,024 can be processed within one second.
In case of processing an array sized 2,048 by 2,048 on a DPAS comprising 512 Virtex-6 FPGAs, the lower estimate of processing time, according to Statement 3, is 235 μs, while about 4,255 arrays of the above size can be processed within one second.
Let us compare the two-dimensional FFT implementation proposed, to its known implementation based on "one-dimensional" FFTs [1, 2] aN  Virtex-6 FPGAs are required, each of which implements two BTrs. As a result, executing a two-dimensional FFT based on the algorithm proposed requires two times fewer Virtex-6 FPGAs due to connecting the algorithmic data to be processed at the FPGA input to the output data according to (4) , which allows implementing four BTrs instead of two within the logic resources of one FPGA of the said family, such as D-triggers, LUTs, and Slices.
Due to the parallel-serial input of number array X into the FPGA, the number of the IP-cores implementing the Module and configured on the Virtex-6 FPGA can be increased significantly.
For the Module implemented on FPGA XC6VLX240t-1FF1156, an additional 64-bit register must be allocated to store the elements of number array X. The limiting factor is still the number of Slices involved in implementing the Module. , is about 77.4 μs. Within one second, about 12.9 thousand of arrays sized 1,024 by 1,024 can be processed. In case of processing an array sized 2,048 by 2,048, the lower estimate of the operation time is 238 μs, and about 4,202 arrays of that size can be processed within one second.
Note 3. Function delay-time estimate for the operations represented as (4) and performed on a DPAS per a time unit is the lower estimate computed without considering the delay times of communication lines between the FPGA crystals within the DPAS.
Conclusion
Currently, much attention is paid to solving a wide variety of problems in using multiprocessor computer systems, the elements of which are the general-purpose processor elements [9] [10] [11] [12] [13] . At the same time, the matters of implementing distributed algorithms on DPAS, the elements of which are FPGAs, have been studied insufficiently. Particularly, this is true for the two-dimensional FFT algorithms widely used in image processing, including in the real-time mode. It should also be noted that DPASes are originally intended for the distributed implementation of various algorithms at different times. The present study fills this gap.
Based on estimating the time and hardware complexity of the Module as a single-type IP-core in the FPGA-architecture of the Virtex-6 family, we have evaluated the function delay time and the hardware complexity of a device implementing the pipelined computing of a two-dimensional FFT accompanied by time decimation on DPAS. Relevant estimates for the Module were executed using a special-purpose FPGA CAD, ISE Design Suite 14,7.
A set of single-type IP-cores implementing the two-dimensional FFT allows the implementation of the two-dimensional FFT when placing on the DPAS elements. The estimates of the twodimensional FFT implementation time have been obtained, depending on the number of FPGAs included in a given DPAS, on the number of Modules implemented on a single FPGA, on the number of timed pulses, within which the Module implements the transform represented as (4) , and on the duration of the said timed pulses.
The results obtained in this study allow us to estimate the potential implementation of a twodimensional FFT on DPASes, both existing and promising ones.
