I. INTRODUCTION
IGITAL representation of signals has created the D need for efficient coding or data compression techniques that will reduce the storage and channel bandwidth associated with such signals. The efficiency of a data compression algorithm is measured by its data compression ability, the resulting distortion and the implementational complexity, in particular the hardware implementations.
The implementation complexity of the coder becomes a major factor for coding data at low bit rates with an acceptable level of distortion. A careful consideration of the coding algorithms becomes essential when one is faced with real-time implementation of compression techniques for large bandwidth signals such as video signals. Vector quantization (VQ) has emerged as a viable approach for coding speech and image data in recent years [l] , [2] . Various review articles are written for VQ application to speech and image coding [1]- [4] . Low bit rate coding implies bit rates of the order of 1 bit/sample and 0.5 bits/pixel for speech and image data, respectively. Meth- Manuscript received November 4, 1988; revised February 17, 1989 . This work was supported in part by NASA Lewis Research Center, Cleveland, OH, under Grant NAG-582.
The authors are with the Signal Processing and Communications
Group, Department of Electrical and Computer Engineering, University of Cincinnati, Cincinnati, OH 45221.
IEEE Log Number 8929990.
ods such as sub-band coding and transform coding other than VQ are capable of producing acceptable rate but have much higher complexity [2] . Though VQ algorithm is equally applicable to speech and image coding, we concentrate on VQ and in particular multi-stage VQ and its implementation for low bit rate real-time image coding. Low bit rate image coding finds application in image transmission such as broadcast television, remote sensing via satellite, aircraft, radar, sonar, teleconferencing, computer communications, facsimile transmission, etc., and in image storage applications such as educational and business documents and medical images for patient monitoring systems.
VQ encoder algorithm for real-time image transmission and storage applications can be implemented using present day VLSI technology using high throughput systolic architectures [5]. Most implementations of VQ systems which are reviewed in detail in Section IV are for real-time speech coding. They, in general, have a throughput rate that is inversely proportional to the codebook size [7] -[ 151. This results in a substantial decrease in throughput rate for a large-size codebook. Moreover, the sampling frequency for image data is approximately three orders of magnitude higher compared to real-time speech coding. This requires similar orders of magnitude improvement in throughput rate for real-time image encoding systems.
We have reported the mapping of VQ algorithm for real-time image encoding of TV quality images into a The remaining paper is organized as follows. The process of VQ for image encoding, the distortion measures used and conversion of mean-squared error to inner product for simplified VQ implementations are described in Section 11. Multi-stage vector quantization and its advantages over single stage VQ are presented in Section 111. We 0098-4094/89/1000-128 l$Ol .OO 01 989 IEEE have also presented results of computer simulations with TV images. Previous implementations of VQ encoder for speech and image signals are described in Section IV. Section V describes the issues involved in real-time image encoding of TV images using VQ, the mapping of the algorithm to a systolic architecture, functions of the cells required, the implementation of the cells for VQ and the operation of VQ system using timing diagrams. The salient features of the implementation and directions for future work are given in the conclusions in Section VI.
VECTOR QUANTIZATION (VQ)
VQ refers to a family of source coding methods that quantize the signal source in blocks or vectors. Various design techniques for VQ have emerged in the last decade. VQ can be memoryless or with memory such as vector predictive quantizers and finite state vector quantizers. Several of these vector quantizers have been designed and their performance has been studied during the last few
VQ involves decomposing the input signal into vectors or blocks and quantizing each vector to the nearest neighbor vector in a pre-designed optimal codebook. The nearest neighbor vector for a given input vector is the codebook vector or codevector that best matches the input vector according to a given fidelity measure. Thereafter, the codevector's index or address in the codebook is used to identify the input vector. The process of codebook formation and coding and decoding of a particular vector using VQ is illustrated in Fig. 1 It can be observed that the first term of equation (3) is a constant, independent of the codevectors, and hence will not have any effect on the selection of the codevector that gives the minimum distortion. Thus we can modify (3) to obtain an inner product representation for MSE as (Fig. 2) , the original image is encoded by first stage VQ and the difference between the original and the reconstructed image is then encoded again by a second stage VQ. The process is repeated until the desired quality of pictures is obtained. The decoder consists of look-up tables (LUT) and adders to reconstruct the signal. Real-time implementation of MSVQ requires much less hardware compared to singlestage VQ because smaller size codebooks can be used at each stage.
We include some simulation results to show that good quality composite color image encoding can be achieved using MSVQ. The signal chosen is PCM samples of com- general, the codebook generation process is terminated when the decrease in the MSE became very small. Depending on the complexity of the picture, 2 to 5 stages are used in MSVQ resulting in bit rates of 0.4 to 0.5 bpp. The decrease in MSE/pixel with the increase in number of stages is illustrated in Fig. 3 . Also. even with increased size of the codevector, MSVQ does not show the blocking effects evident in other block coding methods such as discrete cosine transform. It should be noted that the bit rate of 0.5 bpp has not been achieved so far to encode TV quality composite color images. MSVQ is also applicable in situations like low bit rate picture phone where we can use few stages and update the bits corresponding to certain stages at a low rate. Examples of the images used for MSVQ and the reconstructed images are shown in Figs. 4 'bit rate = (log,l?X-logl 64)/(16 pixels) = 0.X125 bits/pixel The results obtained so far indicate that. MSVQ has excellent potential for real-time encoding of TV quality images at very low bit rates. Besides employing smaller codebook sizes at each stage. MSVQ also allows us to encode larger size vectors3 resulting in very low bit rates without significantly increasing the overall codebook size.
IV. REVIEW OF PRk,VlOUS IMPLEMENTATIONS
In this section. we describe varioigs hardware re a 1' izations of VQ encoding systems reported for speech and image coding applications. The early implementations are designed with "off-the-shelf" components. Pipelined architectures using VLSI technology have emerged during the recent years. Wherever possible. codebook sires. the bit rate and throughput rate are described for implementations considered. 'Fig, . 4 and 5 are shown in monochrome due to a time restriction in printing color.
The CPU time needed definitely increases as the size of vectors is increased. HoweLer, it is a one time process. performed off-line and steps outlined above can be employed to rcduce CPU time. A hardware realization of real-time full-search vector quantizer for speech waveform coding has been first reported in [7] . The total system is implemented with "offthe-shelf" LSTTL, CMOS, and nMOS components and is interfaced with a microprocessor. The encoding system is pipelined and involves preprocessing of the analog speech waveform and generation of the index of the best matched codevector using VQ. The single codebook implementation uses 8-bit codebooks with an encoding rate of either 1 bit/sample at dimension 8 or 2 bits/sample at dimension 4. Calculation of distortion measure (MSE) is carried out in 500 ns and a full pass of the codebook is done in 1.024 ms. A design is of the MSVQ system with LSTTL devices for speech waveform coding using two stages is reported in [13]. The encoding delay is 4 ms for vectors of dimensions 16 and 8 with compression of 1 and 2 bits/sample, respectively.
An LSI architecture for VQ using two different types of modules is reported in [SI. Full-search or tree-search encoding system can be implemented with these modules, namely, distortion processing module (DPM) and array processing controller (APC). A VQ system with vector dimension m requires m DPM's and one APC. DPM is capable of computing MSE distortion measures at a rate of 10 MHz. Computer simulations and prototype implementation using "off-the-shelf' components have been carried out for an image coding application. A rate of 30,720 vectors/s with compression of 0.5 bpp is achieved with a picture resolution of 256 x 256 and 7.5 frames/s.
The first VLSI chip for real-time implementation of VQ algorithm for speech coding is reported in [9], [lo]. The heart of the system is an nMOS VLSI pattern matching chip (PMC). Various functions are pipelined in PMC and it computes the best matching index of the codevector sequentially. The throughput rate of the implementation for an exhaustive search is inversely proportional to the product of the codebook size and codevector dimension. The time required for computing the squaring and comparison operations is 0.33 ps/sample. Various applications such as vector pulse coded modulation (VPCM), adaptive vector predictive coding (AVPC), and rapid codebook design (RCDP) have been cited for PMC. Second generation of VQ processors for real-time speech coding have been implemented using bit-level systolic arrays [32] in 2-pm NMOS technology [ll]. A VQ system with m dimensions can be built with m inner product processors which are bit-level systolic arrays having 234 full-adders each and a bit-serial comparator processor. The encoding delay is calculated to be around lOON ns where N is the codebook size. Recently, the basic linear systolic architecture implementation described in [ l l ] has been extended to twodimensional architectures [12] using the tradeoff between hardware and throughput rate of the system. A systolic architecture for pattern clustering is also described [14] using MSE distortion measure.
A bit-serial VLSI vector quantizer is designed to test new methodology of structured tiling [15]. A mean residual reflected vector quantizer can be implemented with a set of five chps, with front end chip performing mean extraction and vector orientation, two codebook search chips for tree search, and two ROM chips for codebooks. The codebook search chip contains a bit-serial sequential pipeline of adder, squarer, adder and a comparator for a single pass with a codevector. Most of the chip area of the codebook search chip is occupied by squarer circuits.
V. IMPLEMENTATION OF VQ ENCODER
The computational complexity of VQ is of the order of O ( m N ) with a SISD machine as mN distortion measures have to be computed for a codebook size of N and vector size of rn. Real-time VQ encoding requires much higher and constant throughput rate not achievable by a SISD machine or with general purpose digital signal processor (DSP) chips. This has inhbited hardware realizations of VQ algorithm for sometime. However, as VQ algorithm involves a repetitive process of subtraction, squaring, and addition for each codevector and data vector, it can be mapped on to a systolic architecture that can be implemented with considerable ease using VLSI technology.
A direct mapping of the VQ encoder system onto a word-level systolic architecture is illustrated in Fig. 6 [16] . Referring to Fig. 6 , distortion measures are computed by the two-dimensional systolic architecture that generate distortion measures every clock cycle. The upper two-dimensional array takes elements of the input vectors x , , and partially computed distortion measures or null as inputs and generates elements of delayed input vector and distortion measures 0, as outputs. The index or address of the best matched codevector is generated by the lower linear systolic array at the rate of one codevector index per cycle. It takes distortions measures Dl from the upper array, maximum distortion measure that can be represented D,, and a corresponding dummy index k,, as inputs and generates minimum distortion measure that can be obtained for the codebook search, Dh,, and its corresponding index, kmin as outputs. The architecture has modularity for cascading in both horizontal and vertical directions to accommodate larger size of codebooks and codevectors.
The processing element (PE) or cell A of the two-dimensional array is illustrated in Fig. 7(a) . An element of the codevector, cIJ is pre-loaded in each cell A . Each cell A receives a partially computed MSE distortion, 0;' from the upper cell and one sample of the input, The time available for VQ encoder for real-time image coding depends on the sampling frequency of the signal. Some of the implementation considerations for coding TV signals using VQ for various sizes of codebook and codevector dimension are listed in Table I . With a sampling frequency of 13.5 MHz, the time available for encoding an input vector of dimension 64 (8x8) is 4.736 ys and it increases to 18.944 ys with a vector of dimension 256 (16 X 16). However, the number of cells needed increases as the vector dimension is increased. It can be noted that the time available for VQ encoding increases with an increase in the codevector size but is independent of the codebook size. So it is preferable that the architecture support a throughput rate which is independent of the codebook size.
The two-dimensional systolic architecture consisting of cascaded A cells horizontally as well as vertically is, subsequently referred to as inner product processor (IPP) and the linear systolic array consisting of B cells cascaded horizontally is referred to as comparator and address generator (CAG) processor. The functions of cells A and B are also modified to obtain much simpler implementations. As shown earlier in (4), cell A needs to compute only an inner product instead of subtraction, squaring and addition operations. The issues involved in choosing particular IPP and CAG implementations are described in the subsequent sections.
A . Inner Product Processor (IPP)
A word-parallel two-dimensional systolic architecture is an ideal choice for IPP operating at a very high throughput rate computing distortion measures every clock cycle. However, parallel architectures have limitations such as the following.
Only a small number of PE's can be integrated on a chip. They require large number of input/output pins. They require large amount of interconnection area.
Considering the above limitations, a parallel systolic architecture is not an appropriate choice for our purpose. The second alternative, bit-level systolic architecture can also compute distortion measures every clock cycle. But the number of input/output pins and the hardware complexity of single processor element or cell remain comparable to the word-parallel architecture. A third alternative is to use a two-dimensional (i.e., for input as well as the distortion measure) bit-serial architecture. In MSVQ for image coding, a codebook with large codevector size (rn 2 64) is preferable since it gives a good compression ratio. The large vector size also provides ample time for encoding. For example, a codevector with size 64 (8 X 8) gives us 4.736 ps (at 13.5-MHz sampling), which is sufficient to encode one vector using a two-dimensional bit-serial systolic architecture having a number of PE's. With bit-serial architecture, the hardware complexity of the PE is very low having a few input/output pins and a small interconnection area. It is also possible to pipeline to the lowest level with bit-serial architectures. As only few bit operations are involved per clock cycle, the circuit can operate at a much higher clock rate.
A block diagram of four PE's operating in bit-serial architecture is illustrated with inputs and outputs at various time instances in Fig. 9(a) . For simplicity, a codebook having only 2 two-dimensional vectors is illustrated. A detailed schematic of 4-bit PE is given in Fig. 9(b) to illustrate the operation of each PE. The input and codevector elements are represented in 9-bit two's complement form in the actual implementation and 24 bits are used for the distortion m e a~u r e .~ The elements of each codevector (or scaled codevectors) are pre-stored inside each PE be- fore the operation of VQ. The input vector, xi enters the front end of IPP bit-serially with least significant bit first. An r th pulse clears the latches used for pipelining which appears every 24 clock cycles. Each PE computes the inner product bit-serially and passes it to the adjacent vertical PE. Also, the delayed input bit is passed on to the adjacent horizontal PE in a systolic rhythm. Input bits at the front end are delayed in a triangular fashion to offset the data delay from PE to the next vertical PE. An IPP chip consisting of 32 PE's with a provision for horizontal and vertical cascadability has been designed. This will enable us to use the chips for VQ implementation of arbitrary input vector length and codebook size.5
Referring to Fig. 9(b) , each PE outputs an inner product bit serially after a delay of one clock cycle. This serial mode of operation introduces a delay of 24 clock cycles to output a distortion measure of length 24 bits without considering the vertical latency due to the PE chain. The clock rate is limited by the three carry-save adder-chain (or (5,3) counter) in the last stage of the pipeline. A typical 'It is assumed that the orignal image is digitized at 8 bits per pixel. One more bit is needed to represent the inputs since the error signals can have negative values. A wordlength of 24 bits for distortion measure is arrived with the assumption that the maximum size of the codebook wquld be 256.
We are in the process of building a proof-of-concept VQ encoder using the chips fabricated. delay of 24 ns is estimated for the chain using a 3-pm scalable CMOS process from MOSIS. The pipeline functions the same way for positive and negative numbers until it encounters the sign bit of the input. At this stage, the sign bit is latched in by a separate clock. If the input is negative, the sign bit is used for inverting the codevector element bits in the PE before forming the inner product bit. The feedback in the first stage is required for sign extension. Each PE requires 24 clock cycles to form the inner product before a new input sample can be processed.
As the bit and sign-bit clocks are applied externally, the architecture is independent of the bit lengths of the input and codevector elements. Thus IPP can be operated at lower bit lengths with a higher throughput rate, if required. A maximum input vector length of 256 (16x16) can be used with cascaded PE's producing a distortion measure. Higher vector dimensions can be permitted by scaling the codevectors or by truncating the distortion measures coming out at the interface of cascaded IPP chips. The IPP or cascaded IPP's output the accumulated inner product or the distortion measure at the input port of CAG.
B. Comparator and Address Generator
The considerations for implementing the architecture of CAG are different from IPP. The IPP presents a distortion measure per cycle after a latency t, given by
where T, is the delay due to performing a running sum on the distortion measures using vertically cascaded PE's of the IPP, and T24 is delay due to accumulating 24 bits of the distortion measure. T h s necessitates the comparison of two distortion measures to select the minimum of the two, and generation of the corresponding codevector index to be completed in a clock cycle. The critical step which decides the speed of the CAG is the comparison of two 24-bit distortion measures in two's complement form. To perform high-speed comparison operation, an internal parallel architecture is necessary for CAG. However, the delay due to the bit-serial operation of IPP can be effectively used to reduce the hardware complexity of CAG. This CAG architecture will require only one comparator and one address counter for every 24 clock cycles of operation whereas the word-level systolic architecture in Fig. 6 requires an individual cell having comparator and address counter for each codevector.
The cascadable CAG architecture integrating 24 cells is illustrated in Fig. 10 . It has bit-serial inputs for distortion measures from IPP and parallel inputs for loading a distortion measure and corresponding index from the previous CAG. A start-up count, corresponding to the index of the first codevector to be searched, is also loaded through the parallel inputs. The outputs are the minimum distortion measure for the search involved, and the corresponding codevector index in parallel. The first step of CAG is to assemble the 24 bits of distortion measure presented by IPP into a parallel format. This is accomplished by serial to parallel converters or shift registers at the front end of the CAG. A sync pulse which is a modified version of r th pulse from the IPP, signals the accumulation of last bit of a distortion measure. It also works as a load signal for loading a distortion measure and an index from the previous CAG, and also for loading the start-up count value.
The components of CAG are 24 x 24 register bit-array, a high speed 24-bit 2's complement comparator, a minimum index register (MIR) and a counter for address generation.
The comparator is implemented with carry-look-ahead technique. The newly acquired 24-bit distortion measure is divided into 4-bit groups giving rise to six groups. The last group is of 3 bits only as the sign bit (msb) is not used in the carry-look-ahead chain. The minimum distortion measure of the previous comparisons or from previous CAG is available in register B. The six groups are used only to generate group carry-generate and propagate bits as it is not necessary to compute carry and sum bits. Block carrygenerate and propagate signals are used to compute the carry to the 24th bit. Based on the sign bits of register A and register B and computed carry bit, the minimum of the two values is decided. If the register A has the minimum value it is transferred to register B and the counter passes its value to MIR overwriting the previous value. The parallel load counter will have the correct index of the codevector, as it keeps counting after being loaded with the start-up count. The index from MIR is always available at the output of CAG. It is the index of the best matched codevector, whenever a codebook search is completed. The high-speed comparison can be performed in less than 32 ns in 3-pm CMOS. The total delay for address generation is 25 clock cycles after an input vector is presented. In other words, the encoding delay is 25 clock cycles for a codebook search and it is independent of the codebook size.
C. System Timing
The VQ encoder system timing with the main signals is illustrated in Fig. 11 . The system requires two clocks, a bit-clock and a sign-bit clock. The sign-bit clock is the ninth bit-clock as we have 8-bit positive image pixel data for the first stage of VQ. It also can be defined externally depending on the dynamic range of data without modifying the system. The rth signal which appears with the first bit of input vector clears all the pipelined latches of PE's except the latches between PE's. Input latches of the PE take data when the bit-clock is high and pipelined latches take data when the clock is low. The vertical latency due to propagation of the signals through PE's whch is equal to rn * Tclk is assumed to be unity in Fig. 11 for easier illustration. The first distortion measure Do bits appear after the vertical latency (delay of one clock cycle in the illustration). After the initial latency, the bit-stream of all distortion measures appear continuously every clock cycle. The sync signal loads distortion measure into register B, and corresponding index into MIR from the input. It also loads the parallel counter with the start-up count. The en i, i varying from 0 to 23, signals are generated from sync by appropriately delaying it. During the high state of en i, the distortion measures are compared and depending on the result, modification of the contents of register B and MIR takes place. The current lowest distortion measure arld the corresponding index are always available in register B and MIR, respectively. After a delay of 24 clock cycles all the signals reappear in the same sequence.
D. Fabrication
The VQ encoder system was partitioned to small functional blocks, such as 9-bit PE, high-speed comparator, register array, parallel load counter, and control for CAG. Since we are only interested in demonstrating the basic concepts and due to our limited resources, only the functional blocks have been fabricated through 3-pm double metal, twin-tub scalable CMOS process from MOSIS. Two critical blocks, namely, 9-bit PE and the high-speed comparator, which decide the throughput of the system are presently being evaluated. The 9-bit PE has approximately 1000 devices and the comparator has 420 devices integrated. Both use the smallest standard die of 90 X 133 mil from MOSIS. The die micrographs of 9-bit PE and comparator are given in Fig. 12 1PF"s having 32 PE's each and 6 CAGs. Scaling down the design to 1.5 pm allows integration of much larger number of PE's, and reduced delays to encode even smaller codevector sizes ( < 64) using VQ and MSVQ implementations. The VQ encoder system with the described architecture is also well suited for wafer-scale interconnection and packaging technology [33] .
VI. CONCLUSIONS Vector quantization seems to be a suitable candidate for real-time image encoding at low bit rates. However, VQ requires fairly large-size codebook to encode images with an acceptable level of distortion. Through simulations, we have shown that MSVQ requires only a moderate codebook size in each stage to encode images and performs comparably to a single-stage VQ with large number of codevectors. Some computer simulations of MSVQ-coded NTSC composite images are presented to demonstrate that good images with acceptable level of distortion can be obtained. A practical implementation of high throughput architectures for real-time image coding of TV signals using MSVQ is attempted.
The implementation is based on systolic architectural concepts and two dimensional bit-serial architecture is employed to allow as many PE's as possible in a single chip. The architecture is specifically directed towards MSVQ and features cascadability in both horizontal and vertical directions as it will not be feasible to pack the needed PE's in a single chip. The architecture consists of two functional blocks-IPP and CAG. The IPP is designed with two dimensional bit-serial systolic architecture and CAG is designed with internal parallel architecture. IPP outputs one bit of distortion measure corresponding to each codevector in the codebook every clock cycle. Thus, it requires 24 clock cycles to produce a distortion measure, The delay from one distortion measure (24 bits) to the next one is only one clock cycle. CAG has both bit-serial and bit-parallel inputs. Both the processors are cascadable for various sizes of codebook and codevector. The total encoding delay when both the processors are cascaded is 25 clock cycles and is independent of codebook sue.
All the functional blocks for VQ encoder system are implemented in 3-pm scalable CMOS process from MO-SIS in view of scaling down to 1.5 pm and lower in the future. With a conservative clock rate of 8-10 MHz a real-time image encoder using VQ/MSVQ is possible with codevector sizes of 64 and higher. Scaling down the design allows us to integrate a larger number of processing elements giving reduced delays to encode codevectors of even smaller size. Hardware simulations of the architecture have shown the feasibility of the VQ encoder system for TV quality image coding and viability of an integrated approach using VLSI technology. It is also well-suited for wafer-scale interconnection and packaging technology to build a complete VQ encoder system.
