Some image processing applications (e.g. computer graphics and robot vision) require the rotation, scaling and translation of digitized images in realtime (25-30 images per second). Today's standard image processors can not meet this timing constraint so other solutions have to be considered. This paper describes a multi-ASIC solution which is capable of doing the image processing tasks in real-time. The first ASIC is a so-called affine transformer which calculates a one-dimensional coordinate every 25 ns. The second ASIC is a bilinear interpolator which calculates an interpolated value from four known surrounding values, again every 25 ns. This ASIC is designed in a modular setup which results in a flexible accuracy of the interpolation. If more accurate interpolation is required, another ASIC (containing an interpolation stage) is used. In this way for each application a proper accuracy is implemented, reaching optimal silicon area utilization and desired accuracy of interpolation. Using two affine transformers (for obtaining a two dimensional coordinate pair) and an interpolator, one can build a system which can translate, rotate and scale an image of size 1024 1024 in real-time (25-30 images per second). In this paper the system as well as the design of the ASICs are presented.
Introduction
Digital image processing has a fast growing flow of applications ( [1] , [2] ), mainly due to the strongly increasing performance of computers. However, real-time applications are still difficult to implement, so we have to look at dedicated solutions. Here we address the problem of the real-time manipulation of images. This means we want to rotate, translate and scale an image of size 1024 1024 at a rate of 25-30 images per second. Figure 1 shows an example of image manipulation.
This technique of manipulating an image in real-time can be used in many applications, like medical applications (e.g. real-time X-ray), industrial applications (e.g. robot vision) and also in consumer electronics (e.g. digital video camera). This paper consists of six sections. Section 2, which follows this introduction, presents the mathematical background of the system. The efficient implementation of the algorithm in VLSI is discussed in section 3. In section 4 we discuss our design methodology. We consider this section to be very important because it will present a methodology which we think will result in a first-time-correct design. In section 5 the results will be shown. We end this paper with a discussion. 
Mathematical Background
In this section the mathematical background of the problem is addressed. The system to be designed must be capable of rotating, translating and scaling an image. This process can be described as a coordinate transformation. If (i,j) are the coordinates of a pixel in the input image and (x,y) the corresponding coordinates in the output image, we can write the following matrix-equation describing the desired operations: Some applications require the constants to be recalculated for every frame. Because the maximum frame rate is 30 frames per second, these 6 constants have to be calculated every 33 ms. The calculation of these constants can easily be done by a general purpose processor (PC or low performance workstation). The actual image addresses however can not be computed in real-time with such a general purpose processor (see section 3).
In equation 1 the input image is scanned, yielding pixel information for the output image. This method is often referred to as source scanning. Source scanning is seldom used, since it results in holes in the output image because of overlapping pixels. Another disadvantage is the necessary implementation of four multiplications and two additions for every pixel in the input image, which would require a large VLSI area (when a VLSI implementation is used).
A more efficient method is screen scanning. In screen scanning the output image is scanned. In equation 3 the inverse transformation is given. 
S x
Again these constants can easily be computed by the computer. Here the difference is that coordinates in the input image are calculated instead of in the output image. The way to create a complete output image is to calculate a corresponding coordinate pair in the input image for each coordinate pair in the output image. A raster-scan sequence is used, which gives some advantages for the design of the transformer. Let's start with (x,y) = (0,0). The following points are (0,1), (0,2), and so on, which results in the following scheme for the i coordinate: As can be seen in the scheme above, the coordinates (i,j) can be calculated incrementally.
Unfortunately the calculated coordinates in the input image will not likely be positioned on gridpoints, since the matrix multiplication in equation 3 yields coordinates with a fraction. The final image is an image with pixel values on a regular image matrix and therefore we need some kind of interpolation to find the actual pixel values. Figure 3 illustrates this situation. The points P 1 through P 4 are known pixel values and point P int is the point where we want to know the pixel value. If a Cartesian coordinate system is used with equal sized directions, the two-dimensional interpolation can be accomplished by a one-dimensional interpolation with respect to each coordinate axis [4] . Therefore we will only discuss the one-dimensional interpolation functions. (6) with f c = T ?1 c being the cutoff frequency and x s being the sampling distance.
Resampling results in a repeated frequency spectrum. If the interpolation function is not an ideal filter, two effects might appear which can cause artifacts in the final image. First an attenuation of the higher frequencies in the original spectrum will occur. Secondly the repeated spectra will alias back into the original spectrum. Aliasing will only occur if the spectrum of the interpolation filter is not zero at frequencies above the cutoff frequency.
From equation 6 it follows that an infinite number of pixels should be used to calculate the new pixel value, which is of course impossible to realize. Therefore other interpolation techniques will have to be considered.
From a computational point of view, the easiest interpolation algorithm is the nearest neighbor interpolation, where each new pixel is given the value of the original pixel which is nearest to it. In the spatial domain this is equivalent to a convolution with a rectangle function. The frequency response is similar to a sincfunction. So higher frequencies are blurred and aliasing also occurs. In certain situations, such as diagonal straight lines, this causes some serious artifacts, like intensity steps instead of a continuous line. In medical applications this technique is not accurate enough.
Because of the disadvantages of nearest neighbor interpolation, a bilinear interpolation technique is often used. This technique calculates a new pixel value by a linear interpolation of the four surrounding pixel values. In the case shown in Figure 3 , P int will be equal to:
Linear interpolation amounts to convolution of the sampled data with a triangle function. This triangle function corresponds to a modest low-pass filter in the frequency domain. Therefore it attenuates frequencies near the cut-off frequency resulting in smoothing of the image and it passes a significant amount of energy above the cut-off frequency.
In principle it is possible to use a higher order interpolation technique, where new points are calculated using at least four points in each direction. More information can be found in [3] [4] and [5] . Although higher interpolation techniques yields a higher quality of the images, the computational effort is significantly more complex. For this reason we have chosen to implement the bilinear interpolation technique.
Technical Approach
The problem of translating, rotating and scaling an image in real-time can be solved by dividing the system in two parts, a transformer and an interpolator. The image has a size of 1024 by 1024 pixels and the frame rate is at maximum 30 images per second (resulting in about 26 10 6 pixel operations per second). Furthermore, a total of 19 multiplications and additions (4 multiplications and 4 additions for the transformer and 8 multiplications and 3 additions for the bilinear interpolations) per pixel are required, which results in a computing requirement of 494 MOPS (Mega Operations Per Second). The use of lookup tables could possibly substantially lower this number, but it is still clear that computers do not have enough computing power to perform this task. Developing one or more ASICs is one solution. The idea is to make a printed circuit board (PCB) containing the ASICs and to put this board into a computer. A user interface is developed so the user can select the six constants A 0 through F 0 using mouse and keyboard. These constants are fed to the board every 25 ms (for each new frame). This task can easily be performed by a general purpose computer, not requiring the development of dedicated hardware.
Because the development of the transformer and interpolator are two different designs we will discuss them separately in the next two subsections. The third subsection will address the total design.
Design of the Transformer
If we look at equation 3 we see that there are two similar equations for i and j. Therefore only one of the equations has to be implemented:
The screen scanning method, discussed in section 2, will be implemented on a VLSI. In Figure 4 a block diagram of this idea is given. The necessary multiplications are realized using one adder, saving silicon area. The operation is as follows (see Figure 4) This addressing scheme is implemented on a VLSI (actually two identical chips, one for the i and one for the j coordinate, resulting in a smaller chip with therefore a higher yield and lower cost). The A 0 and B 0 constants are products of zoomfactors and the cosine or sine of the angle (see equation 5). Therefore the zoomfactor determines the maximum value of these constants. Defining a minimum zoomfactor S x = S y = 1 4 , yields a maximum value of 4 for the two constants A 0 and B 0 , which can be represented by a 3 bit integer part (with MSB being the sign bit) and an 8 bit fractional part (11 bits in total). The constant E 0 represents the translation of the output image with respect to the input image. Defining a maximal translation of a quarter of an image yields a 9 bit integer part (with one of them a sign bit) and a 8 bit fractional part; for a total of 17 bits for the E 0 constant. Intermediate calculations are done with 21 bits. The output address contains 10 bits for the integer part (to address between 0 and 1023) and 2 fractional bits.
Design of the interpolator
As discussed before, it is very unlikely that the calculated coordinates in the input image fall onto gridpoints. However, the pixel values are only known on gridpoints. Therefore an interpolation technique is needed to obtain the corresponding pixel value. As discussed in section 2 we use the bilinear interpolation technique.
A direct implementation of the bilinear interpolation technique (equation 7) requires 8 multipliers and 3 adders (if we do not count the (1 ? a) and (1 ? b) calculations). Directly implementing these calculations in hardware will be difficult because of the speed constraint, similar to the problems with the transformer. Therefore other solutions were considered.
The accuracy of the transformer determines the length of the fractional part of the coordinates (fixed point numbers are used for the coordinates instead of floating point numbers). Therefore only a limited number of coordinate values are possible. When two fraction bits are used, only the points shown in Figure 5 are addressed, leading to a solution where multipliers can be avoided. As stated in the introduction, the system described in this paper can be used in various applications and the accuracy demands of these applications differ. A medical application for instance requires much more accuracy than a consumer application like the digital video camera. A typical address accuracy of two bits is enough for such consumer applications, however, for medical application (e.g. dual energy image processing [3] ) more than 5 fraction bits are sometimes required. Therefore flexibility in accuracy is required. We have designed a modular set up, so a flexible accuracy is easily achieved. A simple block for every fraction bit is designed. So, if four fraction bits are used, four building blocks will do the task. In Figure 6 this idea is shown. In Figure 7 a block diagram of a one-fraction-bit interpolation is given. 
Design of the total system
Now that the transformer and the interpolator have been discussed, the total system will be addressed. Two transformers are used, which calculate coordinates in the input image. If the calculated coordinates are not in the address range of the input image, the interpolation is skipped and the output pixel becomes zero (black when using a standard colormap). Using the four surrounding pixel values, the final value in the output image is calculated using the modular bilinear interpolator. Every 25 ns a pixel value is calculated in the output image. Using conventional memory architecture and one access per pixel requires a memory pixel rate of 25/4 = 6 ns. Because memory with a pixel rate of 6 ns is extremely expensive (if it is even available), therefore four memory banks are used instead of one. The pixels are stored in such a fashion that only one value from each RAM is needed for every pixel in the output image. This can be achieved by making the following division:
RAM1: Pixels from even lines and even columns RAM2: Pixels from even lines and odd columns RAM3: Pixels from odd lines and even columns RAM4: Pixels from odd lines and odd columns
Note that the total amount of memory is the same as in the case where one single memory is used, but that the memory bandwidth is four times higher. The total amount of RAM is 1024 1024 pixels 8 bits 2 (We use a so-called ping-pong memory: One memory is used to write new data in, while another memory is used for display.) = 2 MByte.
The disadvantage of this memory distribution scheme is that some RAM selection logic must be added. Figure 9 illustrates the need for the extra RAM selection logic. In this Figure a part of the source image can be seen. The distribution of the pixel values along the RAMs is also depicted. For P int1 the upper left corner is a value out of RAM1. The values out of the other RAMs can be selected by the same coordinate pair. For P int2 , however, this situation is different. Rounding the coordinate of P int2 , yields the correct address for RAM2 and RAM4. For the correct address of RAM1 and RAM3, a coordinate pair which is one position higher than the coordinate pair for RAM2 and RAM4 is needed. The same discussion holds for situations in the other direction. Therefore some logic is needed to control the correct addresses for the RAMs. Since the situation depends on the value of the computed address (odd or even) the control logic is very easy to design.
In Figure 10 the setup of the total system is shown. A Personal Computer with a user interface generates a few control signals for the system as well as the 
Design Methodology
When developing the algorithm and the organization of the system all possible configurations should be taken into account in order to take full advantage of the use of ASICs. A high-level description of the ASICs and the total system clarifies the interfaces and the functionality of the different parts and has often proved to be necessary to guarantee a first-time-correct design. In our case, the high-level description was made using MoDL [6] which offers similar (and more) facilities to VHDL. With this language the total image processing system was described: an image was taken as input, the memory organization was tested, the developed algorithm (divided over several ASICs) was executed and a bitmap as output was obtained. Using this bitmap output, errors could be visualized much easier compared to general test vectors.
After the high-level description was finished and proven to be correct (with the help of the algebraic simulation capability of MoDL) the design team split in different groups [7] . One group designed the interpolator ASIC while another group developed the transformer ASIC. A third group designed the PCB. Using the pertinent high-level description as a reference, each group could develop their design towards gate level. A major aspect during this detailing/refinement process was the efficient mapping of the algorithm onto the hardware resources.
If an alteration of the algorithm would give a better use of the hardware resources 
Physical Design
Start with the specification of the system.
After several ideas a concept idea is born.
Description of the complete system and simulation with a test image. In this level also the dividing in subtasks is done.
Sequential description of the ASIC (and later concurrent) and simulation with a test image.
Realization of the ASIC with a design tool (i.e. Mentor Graphics Cell Station).
Submission of the tape of the layout to the foundry. Figure 11 : Design flow.
ASIC

= a decision
a modification and test of the system was performed. The high-level description served as a reference to which the detailed designs could be compared. See also Figure 11 .
The designs were realized in a 1.5 m double metal single poly CMOS standard cell technology. Schematic entry of the designs was done with Mentor Graphics' IDEA Station [8] . The designs were simulated with Quicksim [9] and results were compared with the high-level description to ensure a functionally correct design.
In the design process, the algorithms were mapped on the available standard cells. Here the designer tried to make this mapping as efficient as possible taking into account the delay of the different standard cell components, which led to a recursive process of design changes until the timing requirements were met. When the schematic design was correct the layout generation was done using Mentor Graphics' CELL Station [10] . After the generation of the layout the two ASICs were processed at the foundry. The last step was the design of the printed circuit board for the total system. This PCB contains three of the designed ASICs, a transformer ASIC for both the i-and j-address and an interpolator ASIC performing a two step interpolation. Additionally some control logic with the frame and line synchronization data and the system clock were required.
Results
The total design resulted in a system which performs the translation, rotation and scaling operation at a speed of 25-30 frames per second. Each computed image is interpolated in two steps using a bilinear interpolation technique. By dividing the original image over four memories, the memory bandwidth is limited so expensive high speed memory is not needed. With the computed address four surrounding pixel values are selected from each of the four RAMs. The resulting pixel value after interpolation is stored in a (video) memory. We specified and simulated the total system using a high-level design language (in our case MoDL [6] ). From this description we derived a hardware description of the separate ASICs from which the ASICs could be realized. At this moment the ASICs have been processed and the PCB design is ready. The ASICs as well as the memory and the controller (implemented in FPGA) fit on a PC-AT board. In Table  1 the most important parameters of the ASICs are presented. The clock speed in this table is the clock speed needed for manipulating a 1024 1024 images with a 40 ms latency (25 images per second). In Figure 12 the floorplans of the two ASICs is shown and in Figure 13 a photograph of the PCB is shown. The socket of the interpolation chip did not fit on the PCB, so an alternative home-made socket has been used, as can be seen on the photograph. The system is completely functional and meets the specifications. Figure 12 : Floorplans of the ASICs.
Discussion
Starting the design process from a system point of view, we were able to organize the system in such a way that expensive hardware was avoided. The use of adders instead of multipliers proved to be sufficient here. A thorough survey of the system partitioning was made beforehand.
According to the simulations the bilinear interpolation resulted in a considerable improvement in the image quality, in comparison with the nearest neighbor interpolation technique.
The high-level description used was found to be of great help in the development of the algorithm and the specification of the different interfaces. In this way an efficient solution to the problem and a first-time-correct design could be achieved.
The described architecture is not only interesting for image processing applications, but also for (real-time) computer graphics systems. Two algorithms, heavily used in this field, color interpolation and Gouraud Shading, can make use of the developed ASICs. Of course special purpose hardware is designed especially for the computer graphics systems, like in the pixel-planes system [11] . The main difference between the interpolator we discussed in this paper and other implementations is that our solution does not make use of multipliers. This results in a small chip area. Another application field is the visualization of three dimensional data. In this visualization step we want to map a three dimensional scene onto a two dimensional screen [12] . One step in this algorithm is the calculation of sample values by means of interpolation. The developed interpolation ASIC can be used for this purpose.
Further investigations have been done in the field of using the chips for visualizing three dimensional medical data in real-time [12] . Furthermore we have looked at the possibility to extending the idea of scalable interpolation to a higher level interpolation: cubic spline interpolation. Since cubic spline interpolation coefficients are defined by a third order polynomial, it is not trivial to implement the interpolation technique in a scalable fashion. However, we have implemented the cubic spline interpolation method on a VLSI [13] . With this ASIC it is possible to do real-time cubic spline interpolation on a 512 512 image. A prototype system, using this ASIC, is currently under development.
