I. INTRODUCTION
V IDEO motion compensation methods based on the use of phase correlation algorithms are now being successfully used in equipment for professional television applications. These include video slow-motion systems and digital TV standards converters. These contrast with more conventional motion compensation methods, in that motion estimation is performed in the frequency domain, rather than on blocks of pixels in the spatial domain. These methods produce high quality images suitable for studio broadcasting systems [1] - [4] . A key requirement in TV phase correlation systems is the need to perform two-dimensional (2-D) fast Fourier transform (FFT) computations in real-time. Typical computation rates required are in the region of several billion multiply/accumulates per second (MAC's) in standard digital TV (DTV) applications and over 10 billion MAC's in digital HDTV systems. This paper describes a novel 64-point FFT chip which has been developed as part of a larger research program to develop new digital TV standards converters and slow motion systems based on the use of phase correlation algorithms.
The chip, which has been successfully fabricated and tested, performs a forward or inverse 64-point FFT on complex two's complement video data supplied at a rate of 13.5 MHz (one real 16 b word and one imaginary 16 b word, at a clock rate of 27 MHz) and can operate at 18 Megasamples per second (36 MHz clock rate) making it suitable for wide screen TV. It has been fabricated using VLSI Technology's 0.6-m, doublelayer metal, standard cell CMOS technology, contains 535 000 transistors and uses an internal 3.3 V power supply. It has an area of 7.8 8 mm , dissipates 1 W, has 48 I/O pins and is housed in a 84-pin PLCC package, leading to a low power, cost-effective silicon solution.
Manuscript received April 8, 1996 ; revised July 9, 1996 . This work was supported in part by the U.K. Science and Engineering Research Council under grant GR/G64909.
The authors are with the Institute of Advanced Microelectronics, The Ashby Building, Queen's University of Belfast, Belfast BT9 5AH, N. Ireland.
Publisher Item Identifier S 0018-9200(96)07940-1. 
II. PHASE CORRELATION-A BRIEF OVERVIEW
Conventionally, in TV standards conversion, picture resolution conversion is performed by an averaging (filtering) process, whereas frame rate conversion is achieved by repeating fields on the time axis. This type of frame rate conversion works well with still images but can cause significant jerkiness if the program material contains motion. However, with motion compensation, the motion of an object in a scene is tracked so that a new frame is interpolated along the motion trajectory, rather than the time axis, as illustrated in Fig. 1 . A block diagram of a phase correlated motion estimator is shown in Fig. 2 . An image is first partitioned into blocks which are typically 64 pixels by 64 lines. Each image block, in two successive fields, is then phase correlated. This is achieved by first applying a 2-D FFT to each block. The phase of each transformed block is then subtracted from the corresponding block of the previous field. Meanwhile, the amplitudes are normalized to eliminate any variations in illumination which may confuse the motion measurement. The phase differences and the normalized amplitudes are then subjected to a 2-D inverse FFT to produce a phase correlation surface. The surface obtained contains peaks, whose coordinates from the center correspond to the vertical and horizontal velocity components of dominant motions in the scene. A list of trial motion vectors is therefore compiled by locating the peaks. Fig. 3 demonstrates some examples of the correlation surface.
The trial motion vectors are then passed to a vector assignment unit as candidate vectors and are used to derive a valid motion vector for each pixel in the scene. The two image fields are spatially shifted by each candidate vector in turn and correlated by an image correlator. During the spatial correlation, the modulus of the luminance difference is calculated and analyzed. The vector which gives the lowest value is regarded as the valid vector.
Other applications of phase correlation include slow motion generation, noise reduction, correction of film unsteadiness, and HDTV bandwidth reduction. The availability of cost effective real-time FFT VLSI chips therefore becomes a prerequisite for the hardware implementation of such systems.
III. FFT ARCHITECTURE
Prior work on Fourier transform systems typically falls into one of two categories a) methods based on direct discrete Fourier transform implementations [5] - [7] and b) methods based on direct hardware implementations of established FFT signal flow graphs [8] - [18] . The problem with such solutions is that the approach adopted at algorithmic level typically takes little account of implications at architecture, data flow, or chip design levels. Consequently, many such designs may be irregular, dominated by wiring and may have heavy overheads in terms of data storage [12] - [15] . The approach adopted here has been to develop algorithmto-architecture mappings directly from first principles with the aim of producing an efficient silicon solution. In particular, the inherent structure within the DFT matrix has been exploited to tailor the organization and flow of data to that best suited to candidate architectural solutions. In the case of a radix-4 matrix factorization, this involves partitioning the DFT matrix into regular blocks of smaller matrices, with these then being mapped onto a regular silicon structure. The merits of using a radix-4 decomposition (rather than other radices) are well documented [19] . These include regularity, simplicity, and a reduction in the number of complex multiplications required. Conventional Cooley-Tukey radix-4 FFT's algorithms [20] are based on either the decimation of the time or the frequency samples, leaving the other in a natural order. However, investigations undertaken in the course of this research suggested that greater architectural advantage can be achieved by using a base-4 decomposition in which both the time and frequency transform data are reversed in order prior to the matrix factorization [21] . This decimation-in-time-and-frequency (DITF) algorithm leads not only to computational efficiencies similar to that of more conventional Cooley-Tukey algorithms but also to regular architectures where data flow is tailored to that required in typical real-time image processing applications. As will be discussed, it also leads to important reductions in data storage requirements.
The approach adopted here can be explained by considering the example of a 16-point DFT matrix, as seen in (1) at the bottom of the page.
In order to exploit the inherent structure within the matrix, both the time and frequency sequences can be permuted into a base-4 digit-reversed order. Equation (1) can hence be rewritten as (2) , shown at the bottom of the page, where and Since each of the rows within each of the individual matrices and are the same, it follows that the multiplication of these matrices with an input sequence in (2) can effectively be replaced by a 4-by-4 matrix multiplication as shown in (3) (3) where and are the elements with each of the rows of new matrices and which result from the multiplication of the respective submatrices and with the appropriate four elements from the vector. From (3), it follows that the computation in (2) can written as (4) A similar factorization and decimation of a 64-point DFT matrix can be achieved as outlined in the Appendix. This has been used directly to derive the architecture shown in Fig. 4 . This architecture consists of a repetitive cascade of three building blocks: a) a radix-4 computation array (R4CA); b) twiddle multipliers; and c) data commutator circuits. It also contains appropriate input and output formatters to allow the direct interface with DTV signals input in natural order. From an implementation point of view, it was decided that real and imaginary data should be multiplexed onto the same data bus in order to reduce the number of I/O pins by half. The R4CA blocks perform the radix-4 operations described and involve only additions and subtractions (see Fig. 5 ). As data streams are clocked through the circuit, they are reordered using different commutator circuits and then multiplied by twiddle factors stored in ROM. The computed results are then clocked through the system and emerge in natural order at the output formatter.
Detailed investigations were undertaken in order to ascertain the best tradeoffs between a) computational requirements; b) silicon area; and c) data organization/data flow. This led to an internal circuit organization in which data is processed in a word-parallel, digit-serial manner. In this case, data input in a conventional word-serial, bit-parallel format is reorganized into eight parallel (four real and four imaginary) streams of 3-b wide digits. Use of this data organization means that all computational blocks are 100% utilized. This contrasts with the Bi and Jones [22] architecture in which the multiplier blocks exhibit 75% efficiency. The Gold-Bially [14] architecture can achieve 100% efficiency but requires all four data words to be processed to be available in parallel. For the application considered (as in many other FFT systems), where the FFT processor must be interfaced to a continuous word serial stream, it is only possible to achieve a 25% computational efficiency as there is a 4 : 1 mismatch between between the bandwidth of the input data and that of the processor.
The bit-sliced nature of the architecture facilitates tradeoffs between digit size and throughput rates. In this case, a 3-b wide digit stream was found to be sufficient to allow the required chip sampling rates to be achieved in the fabrication technology used. Fig. 6 describes the data conversion and reordering performed by the input formatter. This consists of a converter and a delay commutator. For the purposes of illustration, each digit is numbered in Fig. 6(a) . As illustrated, the data converter distributes the block of 64 bit-parallel samples into four subblocks of 16 digit-serial samples, in effect performing a matrix transposition. The output formatter has the same basic function as the input formatter and is implemented using a similar circuit.
It should be noted that the architecture presented has a storage requirement of complex word registers. This is and less than that of Gold and Bially [14] and Bi and Jones [22] , respectively, and represents an important savings. A more detailed comparison of the relative complexity of the three architectures is given in Table I . In this it is assumed that all architectures are interfaced with word-serial, bit-parallel I/O samples in a natural order. The notation denotes the value when and Here is the number of bits per input sample, and the digit size. The architecture described internally employs digit-serial adders/multipliers. This contrasts with the Gold and Bially and Bi and Jones circuits which use bit-parallel multiplier/adders. 
IV. FFT CHIP DESIGN

A. Overall Architecture
A block diagram of the overall chip showing its various input and output interfaces is illustrated in Fig. 7 , the chip layout is shown in Fig. 8 , and a chip photograph presented in Fig. 9 . These interfaces were developed to facilitate the integration of the chip with existing video processing systems and to ensure that no off-chip data permutation or format conversion circuitry is required. A block diagram of the device (including test circuitry) is shown in Fig. 10 . The chip is synchronized to the start of the image frame (Dsync) and the image block (Bsync), which means that it can be directly interfaced to a raster scan video stream. This also allows two chips to be cascaded directly to perform a 2-D transform. The reliability and testability of the chip are also enhanced using built-in overflow control and test circuitry. Four optional output modes are available. One of these corresponds to the normal operation of the processor, whereas the others are provided for the purposes of testing.
B. Signal-to-Noise Ratio Optimization
Although a block-floating point format can provide a large dynamic range [19] , it is not suited to 2-D transform applications because of the undesirable intermediate normalization which tends to degrade processor accuracy. A fixed-point format has therefore been chosen for its simplicity. The effect of finite arithmetic on the FFT output signal-to-noise ratio (SNR) was investigated [21] . Mathematical error models for various radix-4 FFT configurations were developed and the error analysis was verified through extensive simulations in which real image data was processed using a detailed HDL description of the chip. A dynamic range of 90-96 dB at the FFT output-required in studio DTV applications-means that an output word accuracy of 15-16 b (fixed point) is necessary. This requirement was achieved using 24-b internal precision. Detailed investigations showed that the scalingafter-multiplication, stage-by-stage truncated shifting scheme used produces a better SNR when compared with other alternatives examined. These included overall scaling schemes and stage-by-stage shifting schemes (such as those used in decimation-in-frequency algorithms in which scaling is applied before multiplication). 
C. Overflow Control
To avoid overflow in the computation, each input data value is right shifted by 2 b before entering each R4CA. There are three R4CA's in the processor. The first stage shifting is simply incorporated at the input interface connection without additional hardware, whereas the second and third stage shifting are implemented by inserting a digit-serial shifter at the output of the twiddle multipliers. In addition, when 24-b internal data is rounded to 16 b at the output interface, overflow may result. For this, an overflow detector and saturation circuit are incorporated into the chip. Any value less than is set to whereas, any value larger than is set to
D. Test
The inherently linear structure of the architecture used greatly facilitates chip testing. A combination of a bus-oriented ad-hoc technique and built-in self test (BIST) [24] was used to test the device. This was done by partitioning the design into blocks, as shown in Fig. 10 . Around 80 K test vectors were sufficient to provide a comprehensive test for the chip. The vectors used include data for setting up the initial test circuitry, as well as actual image data. The latter was also used to test the full dynamic range of the device. The entire testing procedure is controlled by five test control pins, Test [4 : 0] . Multiplexers have been inserted to direct the test outputs to the output pins. Modules such as I/O formatters, delay commutators, the overflow detector and the saturation (clipper) circuitry involve little or no computation, and therefore test patterns for these modules can be applied externally. However, for the computational modules (i.e., the twiddle multipliers and the R4CA's), a BIST technique was used. Both the TPG's and signature analyzers use 25-stage linear feedback shift registers. In addition, the chip has been extensively and successfully tested in a full video phase correlation test system using 200 K test vectors of demanding video sequences. This has been done under a wide range of temperatures (i.e., 20 C to 120 C at 5 V (5%), as well as the extremes of timing requirements, detailed in Table II. A number of FFT chip designs have been developed to date. These include the GEC Plessey Semiconductors PS-DSP16510 FFT processor [24] and a recent 8 K point FFT chip described by Bidet et al [25] . A detailed comparison with these devices is complicated by differences in a) transform size; b) wordlengths; c) technology; and d) implementation methods used (e.g., standard cell custom design). However, work undertaken by Hui [21] , which attempts to measure the effective number of normalized FFT computations per unit silicon area per watt, indicates that the performance of the new chip is significantly greater than the GEC Plessey Device. It also compares very favorably with the Bidet chip, despite the fact that this is a 0.5-m CMOS custom design, while that presented is based on a 0.6-m standard cell CMOS technology. This performance is a direct result of the novel architecture used.
V. CONCLUSIONS
A novel radix-4 FFT architecture suitable for real-time video processing applications has been presented. This architecture is based on a direct factorization of the DFT matrix so that the resulting algorithm-to-architecture mapping is well suited to silicon implementation. It exhibits numerous attractive features from a VLSI point of view. These include regularity, modularity, simple wiring, and high data computation rates. The data flow is ideally suited to real-time digital TV/video processing applications.
A new, low-power, high-performance FFT chip for digital television applications has been successfully designed and fabricated based on the architecture described. Its functionality has been successfully verified across a wide range of temperatures and voltages, as well as extremes of timing requirements. It has also been used in the implementation of prototype systems and shown to be fully operational. The chip includes all necessary data and coefficient storage elements to interface directly with video signals so that no off-chip permutation of data is required. The output accuracy of the chip results in a signal-to-noise ratio in excess of 90 dB. If measured in terms of direct DFT computational requirements, the device performs the equivalent of 3.5 billion multiplications and additions per second, yet dissipates only 1 W when clocked at 36 MHz. This, coupled with the low I/O requirements of the architecture used, allows it to be housed in an inexpensive 84-pin PLCC plastic package. The new chip has resulted in a considerable reduction in cost, size, and power consumption of existing phase correlation hardware where the real-time computation of 2-D Fourier transforms represent the main computational bottleneck. The work undertaken demonstrates how combining higher level systems requirements with algorithm-to-architecture methods and advanced DSP chip design can produce efficient silicon solutions for very high quality video processing systems.
VI. APPENDIX
In calculating a 64-point DFT using the decimation-in-timeand-frequency algorithm, the time and frequency sequences of the transform and hence the transform matrix, are first permuted into a base-4, digit-reversed order using (5) (5) From (6), shown at the bottom of the page, the permuted 64-by-64 matrix, which is not appropriate to be shown explicitly here, can be partitioned into four 16-by-16 matrices as illustrated in (7) 
where and Again, since each of the rows within each of the individual matrices and are the same, the multiplications of these matrices with the appropriate four elements from the intermediate vectors and in (7)- (10) can effectively be calculated using (11) (12) Substituting (11) into (7)- (10), the 64-point Fourier transform is hence given as (12) , shown at the top of the page. 
