Abstract
Introduction
One characteristic property of applications such as image recognition, digital signal processing and video stream operations, is that they are typically very demanding [ and the same operations are repeated. Many of the above applications have structured data which lie in blocks or in a regular manner in the memory Some computer algorithms use nested loops with array indices where the sequence of data-addresses shows a regular pattern Such properties can be exploited by letting the applications execute in one or more computation modules that receive the data in streams. This type of data-processing is also known as stream-based computation [ Examples of stream-based algorithms are various and filtering operations on video and multimedia data discrete wavelet transforms). Other typical applications are regular numerical computations where multidimensional arrays of data have to be processed in various dimensions.
A lot of work has been done on developing new machines which are controlled by a data stream rather by an instruction stream. Chidi are used to accelerate multimedia applications. Likewise, the SCORE system and [6] are systems for stream based computations. The developed at is a data sequencing machine which computes the data streams in reconfigurable on a FPGA The address stream is generated by a data sequencer where the data is organized in 2 dimensions.
When using a stream of data, the sequence must be generated correctly according to what the computational function would demand. Typically the data have to be delivered every clock cycle to avoid time-gaps and no operations In our case the address patterns are generated in advance for different processing functions and stored in a context memory as parameters. Our address generator (AG) can switch between various addressing algorithms fly, without stopping. No empty time slots are introduced, neither within one addressing scheme, nor when switching to another scheme. The design is performed by using the programming language VHDL Hardware Description Language). The address-generator is specified, simulated and synthesized using Xilinx ISE 6.I . During the process, different reports were generated,giving an overview of the resource usage and speed of the design. The VHDL design has been done in an hierarchicalway, a procedurelending itself perfectly to alterations and evolution to a final solution. The functionality has been tested on a Memec Design development module containing a Pro FPGA The design has also been synthesized and compared to other FPGAs from Xilinx.
The implemented prototype unit demonstrates a framework for the 3-dimensional addressing schemes. One feature of our design is the possibility of expanding the functionality of the AG to be able to cope with more complex addressing schemes. New algorithms can be added by partial reconfiguration of the central state machine.
The functionality and implementation of the address generator is described more closely in section 2. The test re- given in section 3. Finally, the paper is concluded in section 4 .
The Address Generator
The AG is located on a PCI card with connection to the Main Memory via the local bus. From the Main Memory the data are routed to a processing unit performing a function F. The processing unit is also reconfigurable and can be implemented on an FPGA on the same card. The data stored in the memory are typically changed and updated if the function (F) changes, figure Each function in the processing unit has its associated addressing function, f (addressing).
Many algorithms exhibit a regular nature by using the same data in many iterations and the data are stored deliberately regularly in the memory, in blocks. The AG is characterized by a set of parameters which specifies which addresses to be generated. The set of parameters is the result of a transfer the linear address space of the to a virtual data space, the mapping space. This mapping defines the set of parameters, figure 2.
We have limited our first implementation to an address generator which can handle regular mappings in 3 dimensions. As we see on figure 3, the distance between blocks in the memory is constant for each data set. The one shown has prescribed spacings between the lines, and the planes, The address generator can also handle 2-dimensional mappings consisting of a plane in the data space. One line in the data space is then a regular line in the linear memory. To generate an address stream according to a 1 , 2 and 3-dimensional mapping, we need the following 6 parameters in the set:
Base address :This is the first address which are sent to the memory successively as 108 bits of data. For the current implementation the parameters were stored in parallel in 3 FIFO queues. This was done to achieve a parallel loading in minimum time. In a normal processing stream there will always be enough time between the switching of addressing algorithms for a more relaxed loading. Then the parameters can be loaded, one after another, into a single buffer.
The address to Main Memory is 36 bits and is generated in the Addgenerator module. This bus width is chosen to be compatible with the newest processor address buses Intel Pentium and beyond). The Addgenerator module consists of a state machine and several accompanying and designated "processes",connected to it. The parameters from the are loaded into the Addgenerator. When one set is being used, the Addgenerator prepares for the next set to be loaded. Thus the set is written in background to the The next set is ready to be used for reconfiguration when the current set has and can be loaded in one clock period. This feature is essential for the address stream to have no gaps when switching parameters. The implementation shows that this can be obtained if the sets span either 1,2 or 3-dimensions. The state machine can handle successive parameter sets with random dimensionality and still deliver a correct stream without empty bubbles. The Reg-contr module is the top level control unit for parameter loading. The boldface lines in figure 4 are the data transmitting lines whereas the thin lines are the con- Hardware tests show correct functionality with a 3-dimensional parameter set.
Results
Some results from the synthesis report and the postplace and route timing report generated in ISE from the implementation are presented in Table 1 . All the circuits are devices from Xilinx. The main difference is the family. The first three rows are Pro and the rest are Virtex and Virtex Spartan. The main difference in the FPGA packages is the number of gates and input-output buffers, A higher Speed grade indicates a faster device, -7 is the highest for the device The speed-1 estimate is from the synthesis report and the second one from the post place and route timing report.
From row one it can be seen that the implementation with the highest maximum clock speed is on the Pro. The clock speed has been rounded down to the nearest integer and the clock speed in the last column is the most accurate. Since the data bus from the memory is 64 bits, the reported clock speed of 144
gives the processing unit an ideal estimate of the performance of 1.152 This is valid when a new address is presented to the memory every clock cycle and no gaps are present in the address and data streams. The 5th column shows the number of slices used for one addressing unit as a percentage, indicating how many instances of the address generator could be put onto each one of the The total number of logic cells for the devices ranges from about 7,000 to about 70,000, and the cells used for a single implementation is about 1,300 cells has 11,088 logic cells of which 1,330 cells are used). We can reduce the slice usage of the design on this device by optimizing the FIFO usage. Instead of writing the whole parameter set in parallel to 3 the set can be written serially to one FIFO. This will reduce the number of slices used by approximately 20 %.
For our design the state machine has been implemented to cope with the 6 parameters needed for up to 3-dimensional addressing. The mapping could be even more complicated with more dimensions, step values etc. In addition, numerous types of other scanning patterns exist, such as Zig-Zag scanning, 2-dimensional FIR filtering etc. By its reconfigurable possibility the AG can be expanded to support more general types of patterns. One way of doing this is to extend the current state machine in the Addgenerator module with additional states. This complicates the state machine, and generally addition of more states decreases the obtainable speed. However, for minor expansions, for instance when introducing space between each data sample instead of using tive samples, no new states are needed. That feature can be implemented just by adding a specified step value instead of just incrementing the output of the AG. A more general approach to support quite different scan patterns is to use "context switching" whereby a part of the generator is changed according to the addressing algorithm. For our architecture, the state machine is the unit to be changed, and each version of the state machine will represent a context. The other modules of the AG can be kept, such as the ones dealing with the loading and changing of parameters. One way of doing this reconfiguration is by first defining a basic, virtual FPGA, as described in This will make possible fast reconfigurations of new, inner designs, being the real The framework introduces overhead logic circuits, but also gives the designer a greater flexibility to switch between different versions of the AG. All versions are then stored internally and made active in one clock period. Augmenting the address generator with context switching thus requires extra logic to define the virtual FPGA. On the other hand, there will be no need to stop the address generator for a full reconfiguration. The details of this feature is postponed to future work.
In a reconfigurable discrete wavelet transform architecture for advanced multimedia systems is described. The proposed architecture consists of a reconfigurable processing element array and a reconfigurable address generator which is comparable with our solution. Compared to our solution the dual are more adapted to the problem at hand, the wavelet decomposition operation, generating only 2 dimensional or row addresses. Our solution then seems to be more flexible relative to the number of dimensions which can be used and the more general expansion possibilities given by the suitable frame construction. An address generator for stream based computation has been designed and implemented on an Pro FPGA. It can handle mappings in three dimensions and runs at approximately 144
Conclusion
The structure of the address generator allows reconfigurations with different parameter sets without introducing into the processing and data streams. Future work might show that the address generator lends itself to reconfigurations for even more complex addressing schemes. When using some of the circuits on the FPGA to incorporating context switching the address generator will be still more flexible and the time for performing demanding algorithms reduced.
