Abstract| The architecture and implementation of a programmable video signal processor dedicated as building block of a MIMD-based bus-connected multiprocessor system is presented. This system can either be constructed from several single processor chips, or it can be integrated on a large area integrated circuit containing several processors. The processor allows an e cient implementation of di erent video coding standards like H.261, H.263, MPEG-1 and MPEG-2. It consists of a RISC processor supplemented by a coprocessor for computation intensive convolution-like tasks, which provides a peak performance of more than 1 giga arithmetic operations per second GOPS. A large area integrated circuit integrating 9 processor elements PEs on an area of 16.6 cm 2 has been designed. Due to yield considerations redundancy concepts have been implemented, that even in the presence of production defects result in working chips utilizing a lower number of PEs. Each PE has built-in self-test BIST capabilities, which allow for an independent test of itself under the control of its integrated fault-tolerant BIST controller. Defective PEs are switched o . Only the PEs passing the BIST are used for video processing tasks. Prototypes have been fabricated in a 0.8 m CMOS process structured by masks using wafer stepping with overlapping exposures. Employing redundancy, u p t o 6 PEs per chip were functional at 66 MHz, thus providing a peak arithmetic performance of up to 6 GOPS.
I. Introduction F OR real-time coding and transmission of video data, several international standards have been developed. These include ITU H.261 H.263 1, 2 for video telephone, ISO MPEG-1 3 for multimedia and ISO MPEG-2 4 for digital TV. All these standards utilize hybrid coding techniques 5 . They combine computation intensive tasks, like motion estimation or discrete cosine transformation, with data dependent tasks like v ariable length coding, variable threshold or quantization. The performance requirements of all standards are determinded by the complexity o f t h e algorithms and by the image size and the frame rate. For real-time coding the performance requirements range from a few hundred million operations per second MOPS for a video telephone decoder, up to several giga arithmetic operations per second GOPS for a digital TV encoder. Due to these performance requirements, several VLSI circuits are needed for the implementation of a real-time MPEG-2 codec. To a c hieve a higher degree of integration, there are two strategies to be followed. First, an implementation of the codec's architecture dedicated to the envisaged application eld can be used to reduce the required implementation area 6 . The drawback of this approach is, that the implementation of a modi ed coding algorithm often necessitates a redesign of the involved circuits.
On the other hand, a large area integrated circuit consisting of several programmable processors can be used. Then, modi cations in the coding standards may be implemented by software updates. Nevertheless, a large area integration of multiprocessor systems puts severe constraints on the architecture of the involved processors. The main problem for the implementation of large area integrated multiprocessor systems is the achievable production yield. Especially the very low yield for chips with several square centimeters die size enforces the implementation of redundancy concepts and recon guration strategies. Both, redundancy concepts and recon guration strategies, are supported by ne-grained multiprocessor architectures consisting of a large number of small processors connected by a simple interconnection network 7, 8 . Unfortunately, the architecture of this type of multiprocessor system does not match v ery well with the requirements of sophisticated video coding algorithms, like the standard ISO MPEG-2. From the architectural point of view, a coarse-grained large area integrated multiprocessor system consisting of relatively complex processor elements o ers a more e cient solution 9 .
In this paper we present a coarse-grained multiprocessor architecture for implementation on a large area integrated circuit of about 16 cm 2 die size. The implementation of redundancy techniques and the identi cation and isolation of defective processors allows the use of chips even in the presence of defects and this enhances the production yield. Depending on the number of working processors, a di erent o verall performance is delivered. Devices containing a high number of working processors can be used for complex tasks, for example processing of MPEG-2. With a lower number of working processors on the large area integrated circuit, the device has not to be discarded, but can be used for real-time processing of less computation intensive standards like H.263 MPEG-1.
In the following section an introduction into hybrid video coding according to the above mentioned standards is presented. Due to the similarities in all source coders, the development of a multiprocessor architecture suitable for all standards is possible. In section III the processor architecture is presented, which has been veri ed by a VLSI implementation described in section IV. In section V the performance requirements of the coder standards are analyzed, listing the numbers of processors necessary for realtime processing in a multiprocessor system. In order to achieve a compact implementation of a complete multiprocessor system on a chip, a large area integrated circuit has been designed. In section VI the architecture and in section VII the oorplan and layout of this 16.6 cm 2 chip for the fabrication in a mask process are depicted. The employed test concepts including built-in self-test of the processors and the test procedure are described in the sections VIII and IX. Experimental results for the fabricated prototype chips are given in section X.
II. Video Coding Algorithms
For storage and transmission of motion pictures, several video coding standards 1, 2, 3, 4 have been developed, that are based on a common coding scheme. In a rst preprocessing step the PAL or NTSC camera signal is transferred into digital image format. For some coders subsampling is required to achieve smaller image formats or frame rates. Common subsampling formats are for example SIF CIF 352 288 pixel and QCIF 176 144 pixel for multimedia and video telephone applications. Frame rates range from 10 Hz up to 30 Hz. According to this format conversion, appropriate source rates can be provided, as de ned in the coding standards.
A digital video coder is used in order to decrease the number of bits per second for transmission or storage. The coder reduces the temporal and spatial irrelevance in a sequence of frames. This leads to a compression without any quality degradation. Additionally, t o a c hieve a higher data compression, the image quality has to be degraded, i.e. the coding process then becomes irreversible.
The coding and decoding process can further be divided into several steps Fig. 1 . In the rst step, the images, which are are segmented into macro blocks, i.e. 16 16 picture elements pels, are nearly independently coded one by one in the source coder. In the second step, called video multiplex coding, the stream of macro block data is arranged into a data stream of xed bit rate for transmission. For rate control, the multiplex coder uses a bu er, who's lling level is used for compression rate regulation in the source coder. At the receiver, the bit stream is decoded and the video information is reconstructed.
Since the complexity of the source coder and decoder algorithms is much higher than the complexity of the multiplex coder, the overall computational performance for realtime processing is mainly based on the source coding algorithms. Therefore, this article is focused on source coding.
The source coder is based on hybrid coding techniques, which are common to all of the above mentioned coding standards. Fig. 2 gives an overview on a hybrid coder. It consists of a motion compensated feedback loop, that uses the temporal correlations in a sequence of images. The macro blocks of the input image are transformed using discrete cosine transform DCT and quantization Q. The resulting values are transmitted for decoding. In the feedback loop, a decoding, consisting of inverse DCT and inverse quantization is performed and the reconstructed image is stored for motion estimation ME. The motion estimation predicts from the information of one or two previously transmitted images, motion vectors on macro block basis, that are used for motion compensation. The computed vectors have to be transmitted to the decoder, too. The tasks used in the coder and decoder can be divided into low and medium-level tasks. Low level tasks like DCT, IDCT and ME are pixel-oriented and deterministic, with high computation requirements. Over 70 percent of the overall performance is consumed by the processing of low level tasks. The remaining medium-level tasks are characterized by an input data dependent program ow and contain a high number of if-then branches, but are less computation intensive compared to low level tasks.
III. Video Signal Processor Architecture
An architecture for hybrid video coding has to support the processing of two di erent classes of algorithms, which have di erent computation requirements for a real-time processing. Low-level tasks are well suited for parallel processing. Since they have a convolution-like structure and a deterministic program ow, the controlling is very simple. Medium-level tasks require a high degree of controlling, since they have a data dependent program ow. But they have less computation requirements, which eases their implementation.
An e cient implementation of programmable processors for video coding systems can be achieved by an adaptation of the processor architecture to the coding algorithms. One possible approach is the coprocessor concept. Computation intensive operations, which are common to all coding schemes, are mapped onto a specialized coprocessor. A programmable processor controls the coprocessor and executes all other parts of the algorithms.
Based on this approach, a video signal processor architecture has been developed 9 . The processor consists of two main modules: A RISC processor and a low level coprocessor. Both modules are adapted to a subclass of coding tasks. The RISC 10 processor is used for the computation of medium-level tasks, like quantization or Hufman coding and performs all control tasks. For a fast processing of computation intensive, convolution-type low-level coding tasks, the microprogrammable coprocessor is used. Fig. 3 gives an overview of the architecture.
A. Coprocessor
The main modules of the coprocessor are a local memory, a parallel arithmetic processing unit and a microprogrammable control unit. The local memory supports local storage of input data and intermediate results of the arithmetic processing unit. Furthermore, the local memory serves as a 32-bit memory mapped interface between the coprocessor and the RISC processor core.
To support fast processing of computation intensive convolution-like tasks, the arithmetic processing unit features a fourfold parallel ALU multiply pipeline in combination with a common multioperand accumulator and a shifter limiter. In order to achieve a small and e cient implementation and to support both 8-bit and 16-bit data formats, the arithmetic processing unit provides 8-bit and 16-bit processing on a basis of parallel 8-bit processing units. In the fast mode 8-bit operands are processed within one clock cycle.
B. RISC Processor Core
The harvard architecture of the RISC core is specially extended for the processing of medium level tasks. The RISC core processes one 32-bit scalar instruction per clock cycle. Instructions are fetched from a 1024 32-bit onchip program memory and executed in a four stage pipeline.
For an e cient loop processing, the PC&Loop-Control unit supports hardware controlled instruction loops. This feature signi cantly reduces the amount of control instructions in medium-level tasks, which are typically loop-based.
Operands are fetched from a 256 16 bit register le and are either processed by a multiply shift limit pipeline or by the ALU.
The multiply shift limit unit integrates a fast hardware multiplier for an e cient processing of quantization and inverse quantization tasks. These tasks represent the most computation intensive parts of medium-level processing and mainly consist of multiply and division operations.
IV. Processor Implementation
To achieve a compact realization of the processor 11 , standard cell logic has been combined with full-custom modules for memories and regular arithmetic units like ALU, multiplier, multioperand accumulator and barrel shifter. This approach reduces the required silicon area for the complete arithmetic processing unit to 10 mm 2 in a 0.8 m CMOS technology.
The clock rate of the chip is determined by the implemented static RAM modules for program and data memories. All memories including a 256 16-bit quad port RAM for the realization of the RISC processor register le provide an access time of approximately 12 ns. Due to an additional propagation delay and setup time of the registers, the clock rate of the chip is 66 MHz.
The main characteristics of the chip are given in Table  I . Fig. 4 shows a die microphotograph of the circuit. 
V. MIMD-based Multiprocessor System
The performance of one processor is su cient for simple coding schemes at lower source rates. When connecting several processors to a MIMD multiple instruction multiple data multiprocessor system Fig. 5 , the real-time processing of more complex applications is supported. The image data is distributed sequentially to the processors in the multiprocessor system and all processors in parallel perform the hybrid coding scheme on the macro block level. Because each processor features on-chip program and data memories, the overall structure of the multiprocessor system is MIMD based. When using a SPMD scheme single program multiple data an e cient bus arbitration can be applied. One processor loads and stores data from or to the external data memory at a certain time. All other processors compute on locally stored data. This allows a high utilization of the busses and stall cycles can be eliminated. For applications like MPEG-2, where the communication requirements between the processors can be neglected, a linear speedup of the multiprocessor system compared to a single processor is achieved 12 . This speedup is limited by the bandwidth of the input and output bus, only. Table  II gives an overview on the number of processors required for the above mentioned hybrid coding applications. 
VI. Architecture of the Large Area Integrated Circuit
A coarse-grained multiprocessor system implementing nine processor elements PEs of the type described above has been integrated on a large area integrated circuit of 16.6 cm 2 die size 13, 14 Fig. 6 . Like the single chip, it is fabricated in a 0.8 m CMOS standard cell technology with two l a yer metallization. The redundancy concepts implemented on this large chip have previously been tested o n a 1 6 c m 2 chip 15 .
Each PE is connected to the common input bus and to the common output bus. Both bus systems contain redundant i.e. spare lines, which allow for a recon guration of defective lines using arrays of laser fuses and laser links inserted in the bus systems.
All PEs are provided with data from the input bus in parallel with the exception of scan paths, which are connected in series, and a single individual input line per PE. The input bus is an unidirectional bus containing repeaters that take care of signal integrity, which would otherwise su er due to the long lines.
The output bus is an unidirectional bus, too, with the PEs connected in parallel. The repeaters are built as tristate bu ers, which are also used to prevent bus con icts, when more than one PE tries to write to the bus. Each P E has in addition to this unidirectional output bus four individual output lines which are separately routed to the chip's pins.
Since a working clock system is required for the test of the buses and for the localization of defective bus lines, the clock signal is distributed within the input bus using a triple modular redundancy scheme. In each P E a v oter derives the internal clock from three global clock lines and passes its result to a clock bu er. Hence, a single defective clock line can be tolerated. Each PE contains a Scan path for Bus Tests ScBT, highlighted in Fig. 7 , that surrounds each core similar to a boundary scan path, and which is utilized for the bus tests. The ScBT is used to parallely read in from the input bus and then to scan out the resulting patterns, as well as to scan in patterns which can then be observed at the output bus. To localize bus faults for laser repair a working ScBT is required to sample the bus content; the ScBT is therefore tested primarily and defective sections are automatically bypassed, thus rendering the global ScBT multiple-defect tolerant.
Once clock and ScBT work correctly, defective bus lines can be identi ed by alternately switching into serial and parallel ScBT mode. According to the results, defective line sections are replaced by spare lines using laser links described in detail in 16 and 17 . Since all the spare lines are equipped with input or output pins, defective pads can therefore be circumvented, too. The clock as well as the scan path lines are routed through the arrays and can thus be repaired like all the other signal lines.
VII. Floorplan and Layout for the Fabrication in a Mask Process
On the large area integrated circuit nine PEs are implemented in three rows of three PEs each. This requires tripled input and output buses as can be seen in Fig. 8a . Most of the lines of the input bus rows are fed in parallel by the same input pins. The lines of the output buses are merged with the help of multiplexers to form a single bus. The individual lines of the PEs are the only exceptions, since they are all connected separately to the outside.
The chip is fabricated in a 0.8 m CMOS process structured by mask exposure. Due to size restrictions, a 16.6 cm 2 chip cannot be fabricated by using a single mask only, so the demand for wafer-step arises. The large area integrated circuit is fabricated using three di erent slices on a single set of masks Fig. 8b . These are repeated an appropriate number of times to form a complete chip. The exposures of the slices have t o have an overlap to ensure the connection of the structures on the wafer.
The largest mask slice, called the PE slice, contains a single PE plus the accompanying parts of input and output buses including repeaters, together with a single array o f laser fuses and laser links each. The input and the output slices provide the chip with the input and output pads and structures for distributing the input signals to the rows and merging the output signals. These border slices also contain repeaters to ensure signal integrity, since the longest path a signal can take in the borders is about 4 cm long.
During exposure all but one of the slices are masked out. The PE slice is exposed nine times, whereas the input and output slices are exposed three times each. Since all the PEs are exposed from a single mask, they are completely identical. The need to address each PE separately makes it compulsory, that the various PEs can be identi ed without changing the mask. This is done with the help of the individual input lines mentioned above. Within the input bus associated to each PE, these individual lines are cyclically interchanged, thus providing every PE in a row with an individual line. Similar approaches have been used to distinguish between the three identical input slices and the three identical output slices.
VIII. Test Concept and BIST Controller
The PEs on large area integrated circuits have to be provided with a built-in self-test BIST, since access to the modules from the outside is very limited and is also subject to defects. The BIST has to be performed and controlled locally, in order to avoid that defects in the buses hinder the test of the modules during the bring-up of the chip. Of course, critical areas and signals will always remain 18 , but these have to be reduced as far as possible.
The above requirements have led to the design of a exibly programmable fault-tolerant BIST controller, which is integrated within each PE of the large area integrated circuit. Therefore, the test of each PE is controlled and performed locally. A scan path plus four signals secured by a parity bit are su cient for programming and controlling an arbitrary number of BIST controllers. If the BIST controller detects an error in itself or in its control signals, or if the test of the PE fails, it transits to an isolation state. In this state the PE is prevented from accessing the output bus, the scan paths of this module are bypassed and the local clock is frozen, preventing dynamic power dissipation within the disabled PEs. The isolation state can only be left by applying a reset instruction to the BIST controller for at least two clock cycles. Together these measures ensure, that only a su ciently small number of potential defects remains within a PE, that can a ect the complete chip, thus making it possible even in the presence of defective PEs to use nearly all large area integrated circuits.
For the test of the PEs all the logic parts of the video signal processor have been analyzed with respect to testability, that can be achieved by pseudorandom test patterns generated by BILBOs Built-In Logic Block Observers 19 . Selected system registers have been replaced by modi ed BILBOs, which beside their normal function as registers can perform tasks as a TPG Test Pattern Generator, TAE Test Answer Evaluator or as a stage of a scan chain.
BILBOs running in the TPG mode generate pseudorandom test patterns. The analysis of the PE has shown, that not all of the logic blocks could be tested e ectively using these patterns. There are two di erent methods to solve this problem within the scope of a self-test: Either special test vectors have to be provided by special pattern generators, or during parts of the test the logic blocks, which cause the di culties, have t o b e c o n verted to a better testable logic. The second approach w as chosen for the PEs on the large area integrated circuit, since it lead to a much smaller overhead. Neither did it call for an in depth analysis of the required test patterns, nor for the design of special pattern generators for each logic block. For this approach the critical parts of the logic that prevented good testability had to be identi ed. E.g. the shifter limiters within the RISC processor and the arithmetic processing unit inherently caused testing di culties: Most pseudo- II  RISC  BIST controller  III  I O busses  BIST controller and ScBT  IV arith. proc. unit RISC V control unit RISC VI local memory RISC random data is truncated or mapped to the maximum or minimum value possible, thus masking the e ects of many faults most of the time. Therefore, these faults could not be detected at the outputs of the limiter within a reasonable number of clock cycles. In the test mode this behavior had to be changed. By inserting a small number of exclusive-or gates at selected points of the logic, which allows the inversion of signals within the logic under the control of an additional test signal, the limiters become easily testable with pseudorandom patterns generated by BILBOs. Since the insertion of the exclusive-or functions was done before circuit synthesis, the additionally introduced delay could be accounted for. BILBOs acting as TAEs are employed to collect and compress the test answers of the tested logic.
The control signals of all BILBOs in the RISC processor are generated by the BIST controller. All control signals of the BILBOs in the coprocessor as well as some additional coprocessor test control signals are set by the RISC processor.
The PE including BIST properties has an area overhead of about 3.5 on top of a PE containing a scan path as its only test feature.
IX. Test Procedure
The test of the large area integrated circuit and its PEs is performed hierarchically. In phase I of the test, all nine BIST controllers perform a self-test in parallel. Due to the BIST contoller's fault detection and fault tolerance properties, it is nearly guaranteed that the defective BIST controllers jump into the isolation state see section VIII.
In phase II, each RISC processor is tested by its associated BIST controller, which has been previously programmed to execute the required test sequence. The test responses signatures collected by the BILBOs are evaluated on-chip by transferring the signatures into the test register of the BIST controller and checking their inverting distances, i.e checking the number of LFSR cycles required to invert the signature. Again, if the test fails, the BIST controller transits to the isolation state and the defective PE is disabled and separated from the outside.
In phase III which can also be performed prior to phase II, the input and output buses are tested as described in section VI and the necessary recon guration is performed by appropriately processing laser fuses and laser links.
After phases III the RISC processors can be loaded with test programs using the full input bus. During phase IV to VI the remaining modules arithmetic processing unit, control unit, local memory are tested with the help of the inserted BILBOs, while the RISC processor controls the tests and conducts the signature evaluation. After each test, the RISC processor transfers the result to the BIST controller, which on fail jumps into the isolation state and disables the associated PE. Table III contains an overview of the test phases.
After completion of all six test phases, the defective PEs remain disabled and all other PEs can be used for video signal processing. Since the BIST controller and the signature evaluation are programmable, and since the test programs loaded into the RISC processor can be changed completely, the entire test procedure can, when necessary, be rearranged or modi ed without the need to change the test hardware. This makes it possible to freely react to upcoming test problems even after the chip fabrication.
Since the interaction between the chip and the outside during all test phases is sparse and is done asynchronously with exception of the bus test phase III, the built-in selftest can be carried out at system speed without requiring sophisticated external test hardware. The self-test can be executed at each p o wer-on, whereas the recon guration of bus lines is done only once after the fabrication.
X. Experimental Results
The large area integrated circuit is mounted in a customized PGA 174 package with an edge length of 5.3 cm. In this package all I O-signals are provided at the left and the right hand side. All connections of the chip core to VDD and GND are placed on the other two sides of the package Fig. 6 and Fig. 8a .
A set of 15 prototype circuits has been fabricated and tested. The test followed the procedure described in section IX. After the self-test of the BIST controllers a sequence of bus tests was performed. On one of the fabricated chips two of the data input lines seemed to be shorted at input pin level. These lines including their pins were replaced by t wo redundant lines, which improved the number of working PEs on this chip from 0 to 4. On a second chip one defective output bus lines of the rst PE row w as repaired, but no additional working PEs could be reached. On a third chip the test control scan path of the rst PE row one of the critical signals remaining in the system as mentioned in section VIII had been hit by a defect. Employing one of the redundant lines within the laser array o f the a ected PE, the PE's test control scan path could be bypassed, so that the other two PEs in this row w ere made accessible. The number of working PEs was increased by 2 , giving a total of 4.
The maximum operating frequency of the PEs on the large area integrated circuit exceeded 64 MHz in all cases, thus the PEs operated in the same range, as measured for a single PE chip. Each w orking PE consumed about 3.5 W at a clock rate of 66 MHz. Fig. 9 shows a statistic of the test results before and after the bus repair. After the bus repair, all 15 chips had at least one working PE, 7 chips had 4 or more working PEs, and 2 chips even yielded 6 working PEs, providing a peak arithmetic performance of more than 6 GOPS per single chip. Table IV summarizes the main characteristics of the large area integrated circuit.
A comparison between a system realized as a large area integrated circuit versus a solution consisting of multiple single chips shows, that large area integrated circuits can become more and more competitive in the future, if cost factors like mounting size, board design, assembly cost, number of bond wires, and reliability enhancement are taken into account. As compared to MCMs, large area integrated circuits are advantageous in terms of reliability, bonding costs, and test e ort, while the MCMs bene t from their lower overhead for redundancy and their possibility t o i n tegrate various chip types even when processed in di erent technologies.
XI. Summary
The architecture of a complex processor element has been developed, which is specially adapted to video coding standards like H.261, H.263, MPEG-1 and MPEG-2. The processors can be integrated into a MIMD-based multiprocessor system, either consisting of several single processor chips, or a large area integrated circuit containing several processors on a single die. The processor element consists of a RISC processor core for data dependent tasks and a coprocessor for computation intensive convolution-like tasks. The coprocessor, which i s c o n trolled by the RISC core, provides a peak arithmetic performance of more than 1 GOPS at a clock rate of 66 MHz.
The design, the fabrication and the self-test of a 0.8 m CMOS large area integrated circuit with an area of 16.6 cm 2 have been carried out. Defective elements are deactivated automatically during self-test, allowing the use of chips, even in the presence of defects within some processor elements. Up to 6 o f the 9 i n tegrated video signal processor elements on the prototype chips could be used for video signal processing tasks, giving a peak arithmetic performance of up to 6 GOPS per chip. This was achieved by the implemention of redundancy concepts, since the economic production of defect-free chips with an area of 16.6 cm 2 is not possible, due to the otherwise very low production yield. The results of prototype chips show, that large area integrated circuits with a coarse-grained architecture can o er an economical alternative to conventional ways of increasing system performance for multiprocessor systems, if the appropriate redundancy concepts are utilized.
