Mbyte which has to be transmitted via network, is reduced by a factor of two, and the required network bandwidth is reduced similarly.
Introduction
This paper describes and analyses different ways of designing and implementing the arithmetic €unctions X2 + Y2 and e, where X and Y are 16 bit integers and 2 is a 32 bit integer, in FPGA' technology.
In a prototype setup, there is 2.5 ps available for calculating the length of one vector (X,Y) i.e.
d m , and a number of implementation alternatives, which can meet this timing constraint, are studied. This also includes a novel implementation that is highly optimised for a speed efficient realization in the chosen FPGA technology. Some of the innplementations are demonstrated using an experimental setup with one Altera epf81188 chip, and speed measurements are compared to simulated values.
It turns out, that the traditional trade-offs in ASIC circuit design regarding optimisation for area or speed may also be applied in utilising the resources of the *E-mail: ik(Dit.dtu.dk and spait .dtu.dk tThe name was previously: Department of Computer Science 'In this context the term FPGA, Field Programmable Gate -Array, primarily refers to the FLEX8000 series of devices from Altera [3] .
FGPA. The primary difference is the hard limit, i.e. the Altera epf81188 chip has 1008 basic logic building blocks, LE's, and a fixed amount of routing capabilities.
For example, the specially developed fast implementation of a 16 bit squaring unit, which utilises the considered FPGA technology optimally, requires 30 % of the available LE's in one epf81188 chip. But, at the same time it occupies all the routing resources of this chip, and hereby leaving more than two third of the logic cells unused. On the other hand, the use of a high level input description, (in fact an arithmetic equation in VHDL), and a commercial logic synthesis tool will lead to an implementation with much better fitting ability. Three copies of this considerable less efficient synthesised implementation of the 16 bit squaring unit will fit in one chip, and hereby provide more total computational power by using 99 % of the LE's.
The work presented here is a part of a larger codesign case study2, which is performed at the Department of Information Technology in collaboration with the Department of Mathematical Modelling, both at the Technical University of Denmark. This case study deals with implementing a combined hardware/software prototype system for an advanced image analysis method in Optical Flow analysis called I n Betweening, [l, 91 . In this paper the codesign aspects are discussed further in section 4.
The vector length calculation, which is the topic of this paper, is initially situated right on the hardware/software boundary, as the first calculation in software. The aim is to investigate the price/performance relation for the overall hardware/software system by implementing this vector length calculation in hardware. Besides the detailed hardware implementation considerations, this also in- 
The Arithmetic Functions
Different ways of implementing a squaring function for 16 bit integers and a square root function for a 32 bit integer are presented in this section, and a number of designs are compared in the following section.
The Array Squaring Unit
A matrix based squaring unit, which is described in The basic building block in the squaring unit, (CAF), which is a combination of a full adder and a multiplexer, is also shown in Figure la. In this structure each bit of the input operand is broadcast in parallel to all cells in a row. The carry propagation starts at the imaginary upper right corner and proceeds towards the lower left corner, and a7 is the last computed output bit.
The Sliced Squaring Unit
The array squaring unit does however not lead to an efficient FPGA-based solution, as it will be seen in section 3. Therefore, we consider an alternative 
where we assume AI = XI + 4x2, A2 = Y1+ 4Y2, and XI, X2, YI , and Y2 are 2-bit integers. 
The Square Root Extractor
A non-restoring square root extractor array, see [6] , is shown in Figure lb . It accepts A = a,-l..alao as input and calculates Q = a = qnp)-1..q1qo. This block diagram is corrected compared with the original figure in [6] , by adding the triangle with six cells situated in the lower left corner of the matrix. The main building block is a "Controlled Adder-Subtracter" cell, (CAS), as shown in the figure, and N = (n/2I2+(n/2) cells are required.
The data flow in this calculation is seauential. The procedure is repeated until the last output bit, QO, is calculated.
FPGA Implementation Results
The two arithmetic functions described above are implemented in FPGA technology using four different hardware design strategies, which is described briefly here and in more detail together with the obtained results in the following section. Synthesised
The high-level VHDL description is synthesised using Synopsys [7] and transfered to the Altera software [4] , which does the placement and routing for the FPGA. Altera-optimised The squaring array is compiled and optimised by the Altera software [2] . The squaring array is compiled without optimisation. Manually sliced design according to Figure 2 .
Altera-direct

Sliced design
Results from the Squaring Unit
input bits are applied simultaneously, one bit to each column, and the computation starts from the rightmost cell in the upper row. When the first: output bit, q 3 , is produced, this value is also broadcast to all the cells in the next row, and the computation proceeds from the rightmost cell towards the left. This
The squaring unit is implemented using all of the above mentioned methods, and a comparison is shown in Table 1 .
The Synthesised implementation of the squaring unit without splitting was carried out from a VHDL ables of type natural, constrained to the desired data range. The splitting operation, mentioned in Table 1 , means splitting of the input port, so that the equation will look like (1) when splitting in two is desired. Splitting in four is then assuming the input value to be X = (XI + 16x2 + 256x3 + 4096x4) with 16-bit input width.
In the Altera-optimised approach each CAF block is represented as a set of equations only. This definition lets the Altera compiler eliminate the array structure and consider the squaring unit as a set of boolean equations without predefined structure. This means, that the compiler is allowed to insert logic elements.
In the Altera-direct implementation the design specification is followed more directly. The resulting structure is very close to the source definition.
It is clear from Table 1 , that the Sliced design, which is optimised to utilise the LE structure of the FLEX device, is the most efficient implementation. It is also obvious that the Altera-direct definition produces a less efficient design, than the pure logical equations used in the Altera-optimised design. This means, that the original block diagram is not well suited for the FLEX architecture. This is because the requirements of the CAF-block realization do not match the basic LE. The CAF block is too large for one LE, but too small for two LE'S, so some resources in each LE are still unused.
The Synthesised solutions have two nice features: The highest level of input description, (in fact, the arithmetic equation), and a good fitting ability. For instance, though the Sliced design is less area expensive, due to the use of carry chains, it occupies the entire epf81188 chip while using only 30 % of the available LE's. On the other hand, three copies of the considerable less efficient Synthesised (split in four design) could fit in one chip and provide more total computational power by using 99 % of the LE's.
Results from the Square Root Unit
We were not able to produce a working solution to the square root function from a high level VHDL specification using the Synthesised method. This is due to the more complicated algorithm of the square root extraction. The synthesis tool does not have any directives for implementing this function in an efficient way. Also, the Sliced design method, which was developed especially for the squaring function, is not considered here. Table 2 presents six implementations of the square root function. It is seen from the table, that there is some correlation between input width and design area. The Altera-optimised implementation is still more area efficient due to the redundancy of the Altera-direct implementation. On the other hand, the opposite relationship is found for the delay, as the Altera-direct design gives a better speed performance for large square root extractors. 
Measurements
A 16-bit Sliced design squaring unit and a 32-bit Altera-optimised square root unit are chosen for hardware implementation in a prototype system with a PCB board containing one FPGA chip (Altera epf81188) and an interface for down loading configuration data to the FPGA chip. Due to fitting problems, some additional buffers are added to the squaring unit design, which increases the expected delay by 26 %.
At the same time, some timing logic is implemented as shown in the Figure 3 . It contains two framing registers and a programmable delay generator, which provides a controlled delay of the clock signal for the output register relatively to the input register clock. With an oscilloscope this setup allows time interval measurements with an accuracy better than :k3 ns. The prototype board is installed in a VME bus based computer system as a slave device and the FPGA implementations are tested at different speed rates. Two patterns of random data are applied to the input of the device, and by comparing the output with software generated values, the maximal operational speed of the implementation is obtained. Table 3 shows the experimental result from using a random pattern of length lo7 numbers.
Codesign Aspects
The computations in the In Betweening case study, [l, 91, which was briefly mentioned in the introduction, falls in three separate parts.
A
Programmable
First is the most computational intensive part by far, the 3D convolution, which is currently being implemented in a dedicated hardware prototype using ASIC's, (the 3D convolution engine), [8] . Second is an eigenvector analysis to determine a local flow vector for each pixel, [SI. This task is performed using a programmed solution on a traditional high performance workstation. Finally, the resulting local flow vectors are optimised globally by solving a large array of linear equations, also on the workstation. All in all, the combined hardware/software prototype setup gives a speed up from around seven days of CPU time in a pure software solution to around 20 minutes in the combined system. This part of the work now considers the increase in system performance by moving the hardware/software border line one step into the eigenvector analysis by implementing the calculation of vector lengths in FPGA technology.
With the current speed requirements, (2.5 ps per vector), and 16-bit input and output data, the implementation of the vector length calculation requires two epf81188 chips. The first contains two 16-bit Synthesised squaring units, (with splitting in four), and one 32-bit ripple carry adder. Here the propagation delay will be 217 ns + 54 ns = 271 ns. The second chip contains an Altera-direct implementation of the square root extractor. The total propagation delay in the two chips will then be 271 ns + 734 ns N 1 ps per vector.
These simulated numbers are typical, but the actual physical implementations have shown, that the FPGA chip is about 20 % faster than these simulated values.
To compare this FPGA solution to a pure software solution, a C program is written and run on different computers. It includes two 16-bit squaring operations, addition, 32-bit square root extracting and one disk access for each vector length calculation. The fastest execution was found to require 2.6 ps per vector, on a 150 MHz DEC Alpha workstation. However, a detailed comparison also has to take communication aspects into account.
There are two scenarios: A) The workstation receives data directly from the 3D convolution engine, 150 Mwords (16-bit) in three minutes, or B) The workstation receives 75 Mwords (16-bit) in three minutes from the FPGA module.
In scenario A), the workstation is fully occupied during the three minutes by receiving and storing the data, and no other processing can be performed simultaneously. When the calculation of the vector length has to be performed, it will require reading of the data form the disc and the calculation itself, which in this case will take 150 Mwords * 2.6 ps 21 6 minutes. Hereafter, the next steps in the image analysis, the eigenvector analysis and the global optimisation, [9] , will take about 12 minutes in both situations. In scenario A) the total computation time will then be 3 + 6 + 12 = 21 minutes.
In scenario B) it is possible to store the vector length data directly in the memory of the workstation due to the reduction by a factor of two of the total image data size. Therefore, the 6 minutes used in A) for disc operation and calculations are not required here, so the total computation time is reduced to 15 minutes.
Conclusions
Two main conclusions are drawn from this work. The first is about the utilisation of FPGA's in arithmetic calculations, and the second deals with codesign aspects in moving a specific computation from software to hardware.
Arithmetic Functions
This design and implementation study has shown, that for limited word size, 16/32-bit, functions of the type, X 2 + Y 2 and @, can be implemented in the Altera FLEX FPGA in a variety of ways, leaving room for speed/area optimisation. However, if the fastest and often also smallest solution is chosen, this may lead to a low overall utilisation of the hardware resources in the FPGA, due to the high internal communication requirements. Less than 30 % utilisation is observed. If more modest speed requirements are present, a considerable larger part of the resources can be utilised. This means that the design process should include a step where the degree of parallelisation is investigated.
The considered vector length calculation with 16-bit input and output data, can be implemented in two epf81188 chips. The total simulated propagation delay is close to 1 ps, and the measured values are around 20 % faster. This result has a good margin to the available 2.5 ps in the current setup.
The consequence is, that one storage cycle o€ the complete data set is omitted. The input data rate to the workstation is hereby reduced, and the workstation is capable of doing other processing while receiving the data. All in all this will reduce the total processing time from 21 to 15 minutes.
Of course, the memory size in the workstation could be increased by around 200 Mbyte, so that the full set of image data could be stored directly in memory. However, this investment is about 20 times the price of the FPGA solution.
Finally, the investigated calculations could be made considerably faster using the same FPGA technology. A solution for real time image data, which leaves only 25 ns for the computation of one vector length, is within reach by introducing a high degree of pipelining in the considered array structures. The pipelining registers are present, one in each LE, but the overall timing and the required number of units in parallel has not been fully investigated presently.
