Real-time video compression is a challenging subject for FPGA implementation because it typically has a large computational complexity and requires high data throughput. Previous implementations have used parallel banks of FPGAs or DSPs [1, 2, 3] to meet these requirements. Using design techniques that maximize FPGA utilization, we have implemented two video compression systems, each of which uses a single FPGA. In the first system, algorithmic optimizations are made to create a low-complexity implementation that exploits the in-system programmability of the FPGA. This low-complexity implementation performs well, but is limited to a single compression algorithm. In the second system, the FPGA is augmented with an external, low-complexity, video signal processor (VSP [4, 5, 6] .) This combination of ASIC and FPGA is flexible enough to implement four common compression algorithms, and powerful enough to execute them in real time.
Introduction
Video compression is used in many applications to reduce the amount of information required to represent a sequence of images. There is a tremendous variety of applications for video compression ranging from highdefinition television, with a compressed data rate of several Mbits/sec., to low-power wireless video transmission at several tens of kbits/sec. Many compression algorithms are available depending on application requirements such as the amount of compression required, the required resiliency to errors, and the type of video source. This wide range of compression techniques makes programmable implementations especially attractive.
Video processing typically requires high data throughput and computational complexity. For example, the discrete cosine transform (DCT) [7] is the basis of many compression systems. An efficient algorithm for computing the DCT on 8 x 8 blocks of a 15 frames / sec. video sequence with 256 x256 byte frames requires a 24MHz multiplier and a 55MHz adder. Since the DCT is usually followed by other algorithms such as run-length encoding (RLE) and Huffman coding, the DCT may require multiplications faster than 50MHz. These data rates are beyond those achievable by most DSP chips, and are challenging for FPGAs.
Parallel banks of FPGAs and DSPs have been used to prototype some video processing routines successfully [1, 2, 3] , but implementation on a single FPGA might lead to a more cost-effective system. In this paper, we describe two implementations utilizing a single FPGA. In the first system, the video compression algorithm is designed for a low-complexity implementation using a single in-system reprogrammable FPGA. Optimizing the algorithm to fit the system results in an efficient implementation, but the system is limited to the single algorithm. In a second implementation, the FPGA is augmented with an external processor. While the first system demonstrates techniques to reduce the algorithm complexity, the second system describes techniques to increase the computational power of the system.
FPGA Implementation Using LowComplexity Video Algorithms
We have identified low-complexity video compression algorithms based upon wavelet transforms for use in low-power wireless communications [8] . A search of wavelet transform filters has identified filters that have integer coefficients and can be implemented with shift-and-add operations rather than multiplications [9] . An adaptive scalar quatization algorithm has also been developed for implementation without multipliers. In some of the steps of the video coding algorithm, we choose approaches that are slightly suboptimal for compression performance, but which allow a large (at least a factor of two) reduction in complexity over a theoretically optimal implementation. For example, the requirements of integer coefficient filters for the wavelet transform and the use of scalar quantization instead of more efficient (and complex) vector quantization results in lower image quality at a given bit rate. Comparisons with other compression algorithms [8] show the low complexity system performing within 1dB PSNR of high complexity systems. A block diagram of the low complexity algorithm we implemented is shown in figure 1.
Commonality and Complexity
The complete compression algorithm requires 15000 gates if placed on a single FPGA. But, because the compression routines are run sequentially, it is possible to sequentially reprogram a smaller FPGA in realtime with each individual routine. Assuming the smaller FPGA can be reprogrammed in 1ms [10] Commonality is exploited wherever possible to reduce the amount of hardware. For example, the high-and low-pass filters are merged into a single, combined filter. The wavelet transform, adaptive quantization, and arithmetic coding designs each require addressing logic to read and write data by subbands. An FPGA that allows partial reconfigurability could exploit this commonality by only configuring the portions of the FPGA that differ.
The techniques to reduce algorithm complexity can (and will) be used to develop low-complexity ASICs. By comparing implementations on ASICs and reconfigurable FPGAs, we hope to identify areas in which FPGAs can and cannot compete with ASICs.
Architecture and Results
The current design contains the hardware required for the wavelet transform, a simplified quantizer, and a run-length encoder. This design runs at 20 frames per second with a frame size of 256 x 256 x 8 bits and consumes less than 1 watt of power. It has been used very successfully as a prototype within a UCLA wireless computing demonstrator. Figure 2 shows an original image, and the same image after being compressed at a ratio of 15:1 and decompressed with the single FPGA video compression system. Currently, reconfiguration is only used to switch between the compression and decompression circuits. Each of these designs fits on a Xilinx XC4008 using only automatic place and route utilities. By the end of '95, we hope to have the adaptive quantizer and arithmetic coder modules also working. We are also working to implement a version on the National Semiconductor CLAy FPGA [10] architecture which supports partial reconfiguration.
FPGA With External Processor (VSP)
The single FPGA implementation of a low-complexity algorithm works, but it is not a general solution to the implementation problems of complex video compression algorithms. To overcome the high computational complexity and high data throughput of video compression algorithms, we have identified the hardware that is common, and moved it to an external ASIC called the video signal processor (VSP [4, 5, 6] .). The VSP consists of a parallel multiply and accumulate data path and small local memories. The combination of a reprogrammable FPGA with the VSP ASIC has both the flexibility to implement four common video compression algorithms and the computational power to execute them in real time. 
Algorithms
The system is designed to implement the discrete cosine transform (DCT,) two dimensional filtering (2DFIR,) vector quantization (VQ,) and the wavelet transform. Each of the computations for the four algorithms can be computed from the general equation:
For example, the DCT can be written in matrix form as:
Where A is the N x N basis matrix defined as: n = (0...N-1), m = (0...N-1), a(n) = sqrt(1/2) for n = 0 and 1 otherwise.
To implement the DCT in the form of equation 1, X represents a matrix row element, Y represents a matrix column element, and W is set to zero. Similarly, the algorithms for 2DFIR, VQ, and the wavelet transform can be performed in a manner consistent with equation 1. The VSP is designed to perform the parallel multiply and accumulate operation that is central to equation 1 and common to all four algorithms. The tasks of updating coefficients, organizing the data, and performing any additional processing is done by the FPGA.
Architecture and Results
The division of operations between the FPGA and VSP is shown in figure 3 . Operations that require flexibility fit naturally onto the reconfigurable FPGA. Placing the operations of addressing and pixel processing on the FPGA also allow for easy adaptation to different frame sizes and pixel formats (i.e. 2's complement and grey-scale.) Since the algorithm state machine is on the FPGA, it is also easy to make algorithm modifications (i.e. change the depth of the VQ search, or the size of the filter mask.) The parallel multiply, accumulate, and memory operations performed
by the VSP are common to all algorithms, and are efficiently implemented as an ASIC.
Implementing both the operations of the VSP and FPGA in a single FPGA would have required approximately 20k gates. Alternatively, there are comparable commercial video signal processors with no FPGA, but requiring > 900k transistors [11, 12] . The combination of the 80k transistor VSP and 3000 gate FPGA seems to be the most efficient method to implement a processor that requires both flexible and static operations. The VSP and FPGA are the basis for a prototyping system operating on a Sun workstation shown in figure  4 . The FPGA used is a Xilinx XC4008 [13] (8000 gate equivalent.) A 3000 gate FPGA could have been used, but there is a problem with the minimizer on the VSP used for VQ. Consequently, the VQ minimizer and VQ offset table (6k bits of ROM) are placed on the FPGA until the problem can be fixed. Some statistics for the working system are shown in figure 5. For this system, each FPGA design is entered into ViewLogic's ViewDraw where it can be simulated with a VHDL description of the VSP. Ideally, we would like to synthesize FPGAs for the system in a more automated fashion, but the problem of creating one chip to use another chip to perform an operation is challenging. (More precisely, the existing VSP hardware places heavy constraints on the methods the FPGA may use to implement a given algorithm. At this time, we do not know how to automatically map a given algorithm onto the FPGA-VSP system.)
Clock Division
When designing the VSP we were attempting to move the common hardware (multipliers, memory) to an ASIC. In retrospect, it is easy to see that some lowlevel operations (clock division, DRAM refresh) are also common to all algorithms and would have resulted in a simpler and more user-friendly system if placed on the ASIC as well.
Perhaps the next step for the FPGA-VSP processor would be the creation of a single, general purpose FPGA-DSP chip. The work on the VSP project suggests that other DSP applications might benefit from the combination of user-programmable logic and dedicated memory and multiplier elements. Of course, memories and multipliers can be synthesized on FPGAs, but dedicated multipliers and SRAMs would consume less silicon area. (As a rough estimate of 1 figure 6 . The FPGA-VSP system is certainly not the first semi-programmable [14] architecture, but it is further testimony to the potential for processors to be designed in multiple technologies.
Conclusions
We have demonstrated two video compression systems that make efficient use of FPGAs. In the first system, the algorithm is low-complexity and uses insystem reprogrammability to fit on a single FPGA. In the second system, operations that do not need in-system reprogrammability are moved to an ASIC where they will not consume valuable FPGA space. The FPGA-VSP system demonstrates how implementations that combine ASIC and FPGA technology can be more efficient than either technology alone. This suggests that future processors can achieve high efficiency by selectively applying reconfigurable and dedicated hardware to the tasks for which they are best suited. 
