Abstract-Multimedia processing is becoming increasingly important with a wide variety of applications ranging from multimedia cell phones to high-definition interactive television. Media processing involves the capture, storage, manipulation and transmission of multimedia objects such as text, handwritten data, audio objects, still images, 2-D/3-D graphics, animation, and full-motion video. A number of implementation strategies have been proposed for processing multimedia data. These approaches can be broadly classified based on the evolution of processing architectures and the functionality of the processors. In order to provide media processing solutions to different consumer markets, designers have combined some of the classical features from both the functional and evolution-based classifications resulting in many hybrid solutions. We propose a categorization of existing microprocessors based on a combination of both architectural and functional flavors with examples of each approach from the latest multimedia processing families. The varying processing requirements in multimedia computing for reconfigurable multimedia processing.
I. INTRODUCTION
A VARIETY of media processing techniques are typically used in multimedia processing environments to capture, store, manipulate, and transmit multimedia objects such as text, handwritten data, audio objects, still images, 2-D/3-D graphics, animation, and full-motion video. An example of such techniques include speech analysis and synthesis, character recognition, audio compression, graphics animation, 3-D rendering, image enhancement and restoration, image/video analysis and editing and video transmission. Multimedia computing presents challenges from the perspectives of both hardware and software. For example, multimedia standards such as MPEG-1, MPEG-2, MPEG-4, MPEG-7, H.263, and JPEG 2000 involve the execution of complex media processing tasks in real-time. The need for real-time processing of complex algorithms is further accentuated by the increasing interest in 3-D image and stereoscopic video processing. Each media in a multimedia environment requires different processes, techniques, algorithms and hardware. The complexity and variety of techniques and tools, and the high computation, storage, and input/output (I/O) bandwidths associated with multimedia processing presents opportunities for reconfigurable processing to enables features such as scalability, maximal resource utilization and real-time implementation.
A number of implementation strategies have been proposed for processing multimedia data. These approaches can be broadly classified based on evolution of media processing architectures and functionality. In order to provide media processing solutions to different consumer markets, designers have combined some of the classical features from both the functional and evolution-based classifications, resulting in many hybrid solutions. In this paper, we have proposed a categorization through a judicious mixture of features from both classifications. It is shown that the complexity, real-time constraints, and the need for low power, area, and cost-efficient implementations cannot all be satisfied by the existing solution strategies. Therefore, there is the need to explore the paradigm of reconfigurable computing. To demonstrate the potential for reconfiguration in multimedia computations, we have performed a detailed complexity analysis of the recent multimedia standard (MPEG-4) [54] , which we believe involves multiple media and encompasses a wide range of operations typically found in media processing. The results of our analysis show that there are significant variations in the computational complexity among the various modes/operations of MPEG-4. This points to the potential for extensive opportunities for exploiting reconfigurable implementations of multimedia algorithms. This paper is organized as follows. Section II gives an overview of the classification of existing media processing approaches. Section III provides details on general-purpose programmable processors. Section IV discusses the state of the art media processor architectures and Section V discusses the dedicated hardware implementations, and Section VI justifies the need for a reconfigurable system.
II. MEDIA PROCESSING APPROACHES
The factors that influence the design of cost-effective media processing solutions are as follows.
• Detailed analysis of the computational complexity, variety of techniques, and tools associated with multimedia processing.
• Evaluation of real time constraints of typical media processing applications, including task switching between the various media components.
• Investigation of the parallelism and redundancies associated with media algorithms, stemming from their repetitive and compute intensive nature.
• Understanding of the inter-processor communication patterns for system on chip (SOC) and chipset-based implementations.
• Exploration of memory issues ranging from fast high density and expensive on-chip RAMs to low-cost, high-speed, off-chip RAMs.
• Evaluation of the cost/time tradeoff involved in providing augmentation to existing solutions in order to support media processing.
• Investigation of the tradeoffs between processing power and processor-memory bandwidth for a restricted area and low-power implementation. Existing media processing strategies and solutions can be classified based on either the evolution of processing architectures or the functionality of the processors. Section II-A discusses the classification based on the former approach, while Section II-B categorizes the processors based on the latter approach.
A. Classification Based on Evolution of Media Processing Architectures
Based on evolution of the processing architectures, existing solutions for media processing can be broadly classified as programmable processors, dedicated implementations, and reconfigurable processors (shown in Fig. 1 [3] - [4] . RISC processors with such support include UltraSPARC-I with VIS [5] and Power PC with Altivec [6] . Details of these processors are given in Section III. b) Special-purpose programmable processors: To cater to the need for multimedia-specific applications, special-purpose programmable processors have evolved starting with the early DSP processors (meant only for audio processing) to the state-of-the-art video/graphics processors. The most significant feature of the DSP processors is the hardware support for MAC instruction, whilst the latest media/video processors go beyond the MAC instruction and provide hardware support for complex graphics processing tasks like warping and perspective transformations. Texas Instrument's TMS320C series is an example of the former, while Philipss Trimedia TM32 4 and M-PIRE [7] are examples of the latter. Details of these processors are given in Section IV. 2) Dedicated implementations: Concurrent to the programmable model, there is also a category of dedicated hardware implementations. This class of processors can be classified into single chip (monolithic) and distributed (chip-set) implementations. Examples of the monolithic or function specific processors are dedicated [discrete cosine transform (DCT)] chips [8] - [10] . LSI Logic's L64735 DCT processor chip and Mitsubishi's Advanced Television (ATV) 5 decoder are good examples of the distributed implementation. Details of processors belonging to this class are presented in Section V. 3) Reconfigurable processors: Efforts toward optimizing both the hardware and software approaches for media processing points to the possibilities of adopting a new paradigm of computing architecture, namely "reconfigurable computing" to media-specific applications. The concept of reconfiguration of the processor resources during run time has proliferated due to the extensive use of field-programmable gate arrays (FPGAs) 6 7 [11] for testing and validation of circuits. FPGAs consist of arrays of configurable logic blocks (CLBs) that implement various logical functions. The current set of architectures in this domain broadly fall into two categories: 1) monolithic or 2) modular. a) Monolithic: The MATRIX [57] is a monolithic processor that consists of computing elements that are 8-bits wide, with a memory, ALU and control unit.
The latest FPGAs from vendors like Xilinx and Altera consist of single-bit computing units that can be partially configured in less than a millisecond. GARP [68] consists of a set of parallel reconfigurable data paths that are connected to a main MIPS processor that loads the configurations. PipeRench is a coprocessor for streaming multimedia acceleration consisting of processing elements interconnected through a network. The MorphoSys reconfigurable system combines a reconfigurable array of processor cells with an RISC processor core and a high bandwidth memory interface unit. b) Modular: The Sonic [69] architecture is a modular variant, with a set of plug in processing elements (PIPEs), which can be configured individually and selectively chosen to implement or accelerate a multimedia function. The drawback of these systems is the lack of programmability within the processing elements. The large amounts of data that need to be loaded from the main processor onto the reconfigurable unit place significant restrictions on the time for these operations. Ultimately, computing devices should be capable of adapting the underlying hardware dynamically in response to changes in the input data or processing environment. Research efforts currently underway in the direction of reconfigurable media processing include [12] and [13] .
B. Classification Based on Functionality
Features such as parallelism, flexibility and memory intensive data processing are specific to media processing algorithms, which delineate them from general-purpose applications. These features enable the classification of media processing architectures based on the approach toward extracting the parallelism, exploiting the iterative nature of operations and reduction of off-chip memory transactions, as shown in Fig. 2 2) Iterative nature of operations in media processing involve repetitive use of certain patterns of operations such as MAC, block search, SAD, perspective transformation, etc., in algorithms such as DCT, motion estimation and compensation, and warping and rendering. These repetitions are exploited through specialized hardware support as in MAP100A [18] by Equator Technologies and the TANGRAM processor [19] . This approach is very similar to providing macros for a group of successive instructions in general-purpose processors. 3) Memory transactions between the processor and memory units are a dominating feature of Video and Graphics applications. These applications involve operations over a very large number of pixels or points, which cannot be stored on the chip. This necessitates frequent data transfers from off chip memory units. This approach has two drawbacks: 1) it is a slow process and 2) it results in large power ratings for the chip. Approaches such as the IRAM [20] attempt to reduce the need for such transfers by embedding a significant portion of the memory onto the chip. The Pentium 4 based 850 chipset addresses the first problem by providing a high speed RDRAM based link between the CPU and the off-chip memory. 10 
C. A Hybrid Classification
In order to provide media processing solutions to different consumer markets, designers have combined some of the classical features from both the functional and evolution-based classifications resulting in many hybrid solutions. We recall from Section II that a number of factors affect the design of media processing architectures. We propose a hybrid or a combinational approach to classifying existing media processing solutions, based on the dominating processing flavors.
From an evolutional standpoint, special-purpose programmable processors assimilate the features of DSP and RISC based architectures. But from a functional perspective, they incorporate units that exploit parallelism at many levels, thus adding the flavors of VLIW and SIMD onto the processing core. Therefore, based on the dominating factors, we propose a hybrid classification of the multimedia processing solutions (Fig. 3) . The details of processors that are categorized based on the above classification are presented in Section IV.
III. GENERAL-PURPOSE PROGRAMMABLE PROCESSORS
Details of the CISC-and RISC-based architectures follow.
A. RISC Architecture
The instruction set of the RISC class of processors is characterized by the most frequently used instructions for general-purpose computing. More complex instructions or less frequently used instructions are implemented as sequences of the reduced instruction set. The new generation of processors such as the Sun UltraSPARC (termed as super scalar) [21] - [23] issue up to four instructions simultaneously per cycle. This feature exploits Instruction Level Parallelism. To cater to the specific needs for processing multimedia data, the SPARC family of microprocessors implement the SPARC ISA version 9, a 64-bit ISA with a multimedia extension called VIS [24] . These instructions are used for specialized pixel operations that can operate in parallel on 8-, 16-, or 32-bit integer values packed in a 64-bit floating point (FP) register. It also includes instructions for certain 3-D to 2-D conversions, edge processing, data alignment, pixel distance, packing, etc. Only 3% of the actual chip real estate is assigned to the graphics instructions. These instructions are executed by the FP/graphics units (Fig. 4) .
The FGU is composed of the following five functional units (FUs):
1) FP divide/square root; 2) FP addition/subtraction/absolute value/negative; 3) FP multiplication; 4) graphics addition, align, merge, expand, and logical; 5) graphics multiply, pack, compare, and pdist (pixel distance). Several other RISC-based designs have exploited the concept of subword parallelism to enhance media processing [25] - [27] .
Other leading vendors and designers of RISC processors with support for media processing include ARM processors, PowerPC's AltiVec 11 and the PA-RISC 2.0 architectures [28] . Motorolas AltiVec technology enhances the performance of the PowerPC architecture through the addition of a 128-bit vector execution unit, which operates concurrently with existing integer and FP units. This unit allows simultaneous execution of up to 16 operations in a single clock cycle. It includes a separate register file containing 32 entries, each 128-bits wide. These 128-bit wide registers hold the data sources for the execution units. Registers are loaded and unloaded through vector store and vector load instructions that transfer the contents of a single 128-bit register to and from memory. Vectors can be 4, 8, or 16 elements long. It uses the SIMD feature for exploiting data level parallelism. AltiVec technology supports:
• 16-way parallelism for 8-bit signed and unsigned integers and characters; • 8-way parallelism for 16-bit signed and unsigned integers; • 4-way parallelism for 32-bit signed and unsigned integers and IEEE FP numbers. Each AltiVec instruction specifies up to three source operands and a single destination operand. All operands are vector registers, with the exception of the load and store instructions and a few instruction types that provide operands from immediate fields within the instruction. These instructions can be classified into the following major classes.
• Intra-element arithmetic operations.
• Intra-element nonarithmetic operations.
• Inter-element arithmetic operations.
• Inter-element nonarithmetic operations. A combination of these features enables the AltiVec enabled PowerPC processors to accelerate multimedia applications.
The ARM7500FE, 12 a 32-bit RISC Network computing multimedia processor based on a cached ARM7 32-bit core has memory and I/O controllers, three DMA channels. and stereo audio ports. The on-chip video controller, which includes a color palette, can directly drive either a CRT or LCD display. The evolution from still-image to motion-based video has resulted in the design of several processors [29] , [30] , with instruction sets specifically tailored to accelerate MPEG-based video sequences.
B. CISC Architectures
In contrast to the reduced instruction set architectures, the CISC trend has always been toward more complicated and feature rich ISAs. This has been mainly due to the introduction of high-level languages and the subsequent effort toward minimizing the semantic gap between the HLL constructs and the machine instruction set. Intel, AMD. and VIA are examples of the manufacturers of CISC processors. Intels Pentium family enhances media rich application execution through support for media-specific operations by using MMX and SSE [31] , [32] technologies.
Since FP computation is at the heart of multimedia rich operations such as 3-D geometry, speeding up FP computation is vital to enhancing overall 3-D performance. To provide an enhanced performance in graphics applications, Intel's 32-bit processors-based on the IA-32 architecture-required an increase of 1.5-2 times the native FP performance. It is observed that 3-D applications can execute faster by differentiating between data used repeatedly and streaming data (data used only once and then discarded). The Pentium III's new FP extension lets programmers designate data as streaming and provides instructions that handle this data efficiently.
Intel's Pentium-4 processor [33] has a FP execution cluster which executes the FP, MMX, SSE, and SSE2 instructions and utilizes the following features.
• These instructions have operands ranging from 64 to 128 bits in width.
• Many FP/multimedia applications have a fairly balanced set of multiplies and adds. The FP adder can execute one extended-precision (EP) addition, one double-precision (DP) addition, or two single-precision (SP) additions every clock cycle. This gives a peak six GFLOPS for SP or three GFLOPS for DP FP at 1.5 GHz.
• Many multimedia applications interleave adds, multiplies and pack/unpack/shuffle operations. For integer SIMD operations, which are the 64-bit wide MMX or 128-bit wide SSE2 instructions, there are three execution units that run parallel. The SIMD integer ALU execution hardware can process 64 SIMD integer bits per clock cycle.
• A separate shuffle/unpack execution unit can also process 64 SIMD integer bits per clock cycle. MMX/SSE2 SIMD integer multiply instructions use the FP multiply hardware mentioned above to also do a 128-bit packed integer multiply op every two clock cycles.
• The FP divider executes all divide, square root, and remainder micro-operations ( ops). It is based on a doublepumped SRT radix-2 algorithm, producing two bits of quotient (or square root) every clock cycle.
Achieving significantly higher FP and multimedia performance requires much more than just fast execution units. It requires a balanced set of capabilities that work together. Multimedia algorithms often require the access of data from off-chip memory devices, which results in large latencies. The long latencies often occur due to frequent data fetches from memory. The approaches of deep buffering help ease this problem by exploiting the nature of reuse of the stored data in subsequent instructions. Since the operations are repetitive in nature, certain features have been incorporated to minimize the effects of latency.
• The deep buffering of the Pentium 4 processor (126 ops and 48 loads in flight) allows the machine to examine large sections of the program to determine the dependencies.
• The out-of-order-execution hardware often unrolls the inner execution loop of these programs numerous times in its execution window. This dynamic unrolling allows the Pentium 4 processor to overlap the long-latency FP/SSE and memory instructions. Another CISC processor vendor, AMD, has recently introduced the 3DNow! Technology 13 that enhances media performance of the Athlon and Duron family of processors.
The AMD-K6-2 microprocessor is the first implementation of AMD 3DNow!. It is a set of instructions designed to ease the traditional processing bottlenecks for FP-intensive and multimedia applications. The following is a brief description of the technology.
The Graphics Pipeline consists of the following four stages. 1) Physics: The CPU performs FP-intensive physics calculations to create simulations of the real world and the objects in it. 2) Geometry: Next, the CPU transforms mathematical representations of objects into 3-D representations using FP intensive 3-D geometry. 3) Setup: The CPU starts the process of creating the perspective required for a 3-D view and the graphics accelerator completes it. 4) Rendering: Finally, the graphics accelerator applies realistic textures to computer-generated objects, using perpixel calculations of color, shadow, and position. Some of the salient features of the instruction set followed by the micro-architecture follow.
Instruction set features:
• support for SIMD FP and integer operations;
• PREFETCH instruction to eliminate extra data retrieval time; • fast entry/exit multimedia state (FEMMS) instruction to reduce switching time between MMX and x87 code. Processor Micro-architecture features:
• pipelined dual-execution resources permitting execution of up to two 3DNow! instructions per clock cycle and four FP calculations (add, subtract, multiply) per clock (enables potential peak performance of 1.2 gigaflops at 300 MHz versus potential peak performance of 0.3 gigaflops for 300-MHz processors without 3DNow! technology).
• common FP stack-eliminates task switching between AMD-3DNow! and MMX operations. [34] to permit future expansion by providing sufficient architectural capacity, a full 64-bit address space, large directly accessible register files, enough instruction bits to communicate information from the compiler to the hardware and the ability to express arbitrarily large amounts of ILP. IA-64 realizes parallel execution semantics in the form of instruction groups. The compiler creates instruction groups using optimization techniques such as software pipelining and loop unrolling, so that all instructions in an instruction group can be safely executed in parallel.
While instruction groups allow independent computational instructions to be placed together, expressing parallelism in computation related to program control flow requires additional support. Control parallelism is present when a program needs to select one of several possible branch targets, each of which might be controlled by a different conditional expression. Such cases would normally need a sequence of individual conditions and branches. IA-64 provides multiway branches that allow several normal branches to be grouped together and executed in a single instruction group. The use of parallel compares and multiway branches can substantially decrease the critical path related to control flow computation and branching.
While the compiler can handle some activities, hardware better manages many other areas including branch prediction, instruction caching, data caching and prefetching. For these cases, IA-64 improves on standard instruction sets by providing an extensive set of hints that the compiler uses to tell the hardware about likely branch behavior (taken or not taken, amount to prefetch at branch target) and memory operations (in which level of the memory hierarchy to cache data). The hardware can then manage these resources more effectively, using a combination of compiler-provided information and histories of runtime behavior. To help reduce the effect of branch mispredictions, IA-64 provides predication, a feature that allows the compiler to execute instructions from multiple conditional paths at the same time and to eliminate the branches that could have caused mispredictions.
IA-64 provides a class of load instructions called speculative loads, which can safely be scheduled before one or more prior branches. In the block where the programmer originally placed the load, the compiler schedules a speculation check In IA-64, this process is referred to as control speculation. IA-64 also has instructions that allow the compiler to schedule a load before one or more prior stores, even when the compiler is not sure if the references overlap. This is called data speculation.
The IA-64 FP architecture is a unique combination of features targeted at graphical and scientific applications. It supports both high computation throughput and high-precision formats. The inclusion of integer and logical operations allows extra flexibility to manipulate FP numbers and use the FP FUs for complex integer operations. The primary computation workhorse of the FP architecture is the FMAC instruction, which computes a multiply accumulate operation with a single rounding. Traditional FP add and subtract operations are variants of this general instruction.
Divide and square root is supported using a sequence of FMAC instructions that produce correctly rounded results. Using primitives for divide and square root simplifies the hardware and allows overlapping with other operations. For example, a group of divides can be software pipelined to provide much higher throughput than a dedicated nonpipelined divider. Table I compares the most salient features of some significant state of the art general-purpose microprocessors with specialized media extended instruction sets. The key features are now summarized.
C. Comparison of General-Purpose Processors
• The processors issue and execute two or more multimedia instructions per cycle through the SS architectural feature. Pipelining is also adopted to increase the efficiency of instruction issue and execution.
• These processors have out-of-order issue (OOI) control mechanism, which is supported by clock speeds in excess of 700 MHz. The out-of-order control unit requires additional FUs. These components occupy a large portion of the silicon area and contribute to the power dissipation (ratings ranging from 12 to 74 W).
• These processors employ a dynamic branch prediction technique, which implies the need for large primary and secondary level caches. This occupies valuable silicon area and proves disadvantageous during cache misses due to real time constraints that need to be met for media applications.
• The cache mechanism is designed to use 1-D locality of consecutive addresses, but media rich applications require multidimensional locality of accesses. To increase the probability of cache hits, these processors are equipped with high instruction depths (up to 72).
• The word lengths of the next-generation microprocessors, such as the IA-64 based Itanium, have increased from 32 to 64 bits. But the word lengths needed for multimedia processing applications are 8, 16, or 24 bits, much shorter than the data paths of these processors. This drawback can be overcome to some extent by specialized pack instructions, but is not an optimized solution.
• They implement in excess of 15 million transistors to provide for the large area consuming features mentioned above. In summary, conventional general-purpose programmable processors lack the judicious combination of sufficient processing muscle, low power consumption, and low real estate realizations to support the ever increasing demand for mobile multimedia processing. Hence, there arose a need for new high-performance processors for multimedia applications. These special-purpose programmable processors evolved from the conventional DSP [35] architectures to the state-of-the-art media processors.
IV. SPECIAL-PURPOSE PROGRAMMABLE PROCESSORS
The special-purpose programmable processors exploit the redundancies involved in media processing algorithms through the use of multiple FP and media specific execution units. They also extract parallelism that exists at various levels viz. data, instruction/ subword, and thread.
Processors that extract parallelism at the data level are encompassed under the umbrella of SIMD-based architectures.
A. SIMD-Based Architectures
Typical operations encountered such as SAD require the same operation to be performed on multiple pixels or data units with no dependencies involved. Therefore, the same instruction can be used to perform this task simultaneously on all the data units. Processors exploiting this feature include [18] and [36] . The MAP 1000A from Equator Technologies has a 64-bit partitioned unit (called a graphics unit for partitioned arithmetic) and a 128-bit partitioned unit (called a media unit for SAD, inner product arthmetic) which exploit data-level parallelism. The MPEG-4 processor proposed by [36] has a video unit containing two DCT units (DCT Q), two ME units (coarse and fine), and a MC unit. The audio unit exploits this feature through an inv modified DCT unit.
The single-chip video audio signal processor (VASP) discussed by [37] consists of a video signal processing block The video signal processing block contains a DCT/Q unit and two ME units designed to implement hardwired solution of pixel I/O, full pixel motion estimation, half-pixel motion estimation, discrete cosine transform, and quantization. The D30V/MPEG [14] multimedia processor which supports real time MPEG-2 decoding has a two-way SIMD multimedia core.
The 200-MHz embedded RISC processor for multimedia applications by [38] has a 64-bit SIMD based function unit that contributes a total of five multiply-adders in the processor. The unit is pipelined to six stages. It has a multiply unit, add unit, shift unit, logical function unit, two data type converter units, and a 32-word 64-bit register file. The video multiprocessor (VMP) for image compression and decompression schemes of MPEG-2 proposed by [39] has a signal processor which is SIMD in nature and has eight ALUs (16 bits each) and eight MACs (16 bits each).
Reference [40] discusses the architecture of the multimedia signal processor with SIMD-type-parallel-executing features, byte-aligned-word-access features, and multiinstruction migrating features. The SH4 [41] , the latest SH series microprocessor for multimedia applications from Sharp, has architectural enhancement-based on the unique FP vector instructions that are more effective than the conventional SIMD architecture for 3-D graphics processing. Several other SIMD-based processors were proposed by [42] , [43] .
Parallelism can also be exploited at the instruction level. The most popular approaches are VLIW, SS, and super-pipelined architectures.
B. VLIW Processors
VLIW processors use a long instruction word that usually contains a fixed number of operations that are fetched, decoded, issued and executed synchronously. All operations specified within a VLIW instruction must be independent of one another. Some key issues of a VLIW [44] processor are:
• very long instruction word (128-1024 bits per instruction); • each instruction consists of multiple independent parallel operations; • each operation requires a statically known number of cycles to complete; • a central controller that issues a long instruction word every cycle; • multiple FUs connected through a global shared register file. The compiler group's independent instructions-executable in parallel-using optimization techniques such as software pipelining and loop unrolling and schedules code blocks. Instruction parallelism and data movement in a VLIW processor are completely specified at compile-time. Run-time resource scheduling and synchronization are therefore completely eliminated. Hence, the VLIW processors cannot react to dynamic events such as cache misses.
Media processors exploit ILP by integrating VLIW-based cores into their processing units. The Fujitsu FR500 embedded microprocessor, the first product in the FR-V line, Fujitsu's generic name for VLIW architecture microprocessors [15] offers a VLIW, four-way, variable-length instruction issue. Each instruction has a length of 32 bits. It supports an instruction set consisting of integer, SP FP, media (fixed point) instructions. It has two (non-pipelined) Integer execution units, two (two-stage pipelined) FP execution units, and four (two-stage pipelined) Media execution units.
The MAP1000A [18] VLIM mediaprocessor is a single-chip, programmable media-processor that has a VLIW core with a four-way instruction issue. The instruction length is 34 bits (32 2 bit header)supporting an Instruction set of Integer, SP FP, media(fixed point) instructions. It has four (nonpipelined) integer execution units and the following FP execution units: 32-bit MAC-2 (two-stage pipelined), 32-bit divide/ square root-2 (two-stage pipelined), 64-bit partitioned unit (called a graphics unit for partitioned arithmetic), and a 128-bit partitioned unit (called a media unit for SAD, inner product arthmetic).
M-PIRE [8] is a programmable MPEG-4 multimedia codec VLSI for mobile and stationary applications. It integrates a RISC core, two separate DSPs, a 64-bit dual-issue VLIW macroblock engine, and an autonomous I/O processor on a single chip to cope with the high flexibility and processing demands of the MPEG-4 standard. It supports real-time video and audio processing of MPEG-4 simple profile or ITU H.26x standards. The MacoBlock unit (part of the video unit) has a two-way VLIW instruction issue and an instruction length of 64 bits.
The D30V/MPEG multimedia processor [14] also has a two-way VLIW instruction issue. The TANGRAM VLSI co-processor [19] intended as a building block for use in system-on-chip (SOC) designs for the versatile MPEG-4 multimedia standard is yet another example. It is designed to perform the computation intensive final step of MPEG-4 video decoding: compositing of scenes at the display. This includes warping and alpha blending of multiple full-screen video textures in real-lime. TANGRAM consists of a 16-bit RISC control processor with a VLIW issue and multiple powerful arithmetic units that perform rendering calculations directly in hardware.
The Philips TM1300, a member of the Trimedia family [45] , [46] of processors (Fig. 5) , contains a VLIW processor, as well as a video and audio I/O subsystem. The processor has an instruction set that is optimized for processing audio, video and graphics. It also includes SIMD multimedia operators for 8-and 16-bit signal data types, as well as a full complement of 32-bit IEEE compatible FP operations. TM1300 is intended as a multistandard programmable video, audio and graphics processor. It can either be used standalone, or as an accelerator to a general-purpose processor.
The key features of TM1300 are the following.
• A powerful, general-purpose VLIW processor core (the DSPCPU) that coordinates all on-chip activities. In addition to implementing the nontrivial parts of multimedia algorithms, this processor runs a small real-time operating system that is driven by interrupts from the other units.
• DMA-driven multimedia I/O units that operate independently and format data to increase the efficiency of software media processing. • DMA-driven multimedia coprocessors that operate independently and in parallel with the DSPCPU to perform operations specific to important multimedia algorithms.
• A high-performance bus and memory system that provides communication between TM1300s processing units.
• A flexible external bus interface. The TM1 has been used in a prototype version of the Microsoft Talisman processor architecture [47] , [48] . The latest member of the Trimedia family is the TM32, which issues five instructions per clock, targeting five of the CPU's 27 FUs. It operates at a clock frequency of 250 MHz.
C. Thread Control
Speculative thread control is yet another level of parallelism that can be exploited. This concept is clearly exploited in Suns MAJC architecture through the concepts of space-time computing and vertical multithreading.
1) Space-time computing: Space-time computing in the MAJC [49] architecture is a technique that substantially improves performance and code execution time in many applications using Java Technology. The multiprocessor-on-a-chip configuration in the MAJC architecture allows system level parallelism on a processor cluster to be achieved by having speculative threads (future instruction streams) execute on separate processors. For example, if we have two processors on a chip (Fig.6) , then two threads-head and speculative-execute on separate processors. They operate in a different space (speculative heap) and in a different time (future time).
When programs have loops or method boundaries (as with Java), the MAJC architecture splits the program into instruction groups (threads)that are executed simultaneously on different processors as shown in the figure above. The first set of instructions or instruction group runs on CPU1 and is called the "head thread." The second instruction group executed on the CPU2 is called "speculative thread." The thread is called speculative because it executes the instruction group ahead of time. Thus program instructions that would be executed in "future time" have been made "current." Java technology helps differentiate between stores to heap and stores to stack at the bytecode level. In Java programming language, one can differentiate between stores to heap and stores to stack at the bytecode level. This is a large advantage for space-time computing. It helps compilers in issuing special instructions that help to accurately determine possible rollbacks. Thus, fewer stores need to be monitored.
2) Vertical Multithreading: Vertical Multithreading is a technique where multiple threads operate on a cache miss within a processor unit and reduce the overall CPU idle time. It decreases the total CPU cycles required for program execution and increases throughput. The MAJC architecture substantially increases CPU utilization with vertical multithreading. A thread is a stream of instructions. In the simplest terms, vertical multithreading in the MAJC architecture allows the CPU to switch to a new instruction stream whenever there is a cache miss. This increase in CPU utilization, coupled with instruction level parallelism gained from VLIW, significantly improves overall throughput. In order for vertical multithreading to be effective, the load time from the memory to the cache (upon cache miss) should be much greater than the switch time between instruction streams. To enable fast context switching, references to the threads (states) need to be stored. The large register file in MAJC allows the architecture to maintain reference information of four threads and effect fast switching between instruction streams. The MAJC architecture permits monitoring and trapping any register overlaps that may occur when storing reference information of multiple threads. This is important to ensure accuracy in processing thread information.
D. Comparison of Special-Purpose Programmable Processors
The advantage of realizing an algorithm on silicon is to exploit the speed offered by a direct implementation as compared to a state machine driven execution through a programmable model. In the programmable scenario, a data path that has already been realized on chip can execute only a specific set of operations. Any algorithm that does not have those specific operations needs to be interpreted in terms of those instructions. This consumes time and power.
• These processors do not employ complex control mechanisms such as out-of-order issue, hence they consume lesser power than the general-purpose microprocessors.
Moreover they operate at lower frequencies, which reduces the power consumption. Most of the programmable processors that utilize a truly programmable core have power consumptions beyond 4-5 W (as shown in Table II ).
Only those processors that use a programmable core for the purpose of minimally controlling the functional units on the chip consume lower than 2-3 W.
• Similar to conventional DSPs they use non-cache on-chip memories. Some of them such as the MAP1000A, have features such as TLBs, which enable virtual memory addressing like the general-purpose microprocessors. This feature increases the addressable memory space, but reduces the speed of access.
• Most of these processors have VLIW control rather than SS control. This provides higher performance with lesser control circuit complexity.
• To enable higher arithmetic performance for media-specific applications, they employ the SIMD style of instruction execution. Here, the media processors have more function units than issue slots. They also have features to support operations on parallel-packed data, which exploits parallelism to a large extent in media processing. A few of them like the FR500 and MAJC employ speculative thread control to optimize branch prediction.
• Processors like MAJC have registers that are data agnostic. This reduces the control mechanisms for separately handling the storage of floating and fixed point data. With this feature the processor lends itself to be naturally scalable.
• Some of the processors offer the ease of programming and debugging through the use of high-level language support, but this requires high optimization of the VLIW code. The above-mentioned features offer the advantages of low power and area time-optimized implementations. But the design and debugging phases involved in designing such application-specific programmable processors tends to drive the development cost higher. Moreover, for mobile multimedia applications, it is desirable to have power consumptions in the order of 1 W or less.
V. DEDICATED IMPLEMENTATIONS
The dedicated class of implementations competes and complements the programmable class of processors. In configurations where a central RISC controller coordinates the movement of data between various dedicated units, they act a complementary role. But in configurations where they are used as stand alone accelerators, they compete with the programmable class of processors. This class of processors can be classified 
A. Monolithic
The C-cube DVxpress-MX25 codec 14 is a single-chip, multiformat codec with support for 25 Mbit/s DV formats (including DVCPRO, miniDV and DVCam) and 4:2:2 MPEG-2. An ASIC for motion estimation and compensation which can be used in conjunction with a fixed point DSP chip for MPEG-4 decoding has been presented in [50] . A 2-D DCT processor is detailed in [8] . The processor has about 50 K transistors and is capable of processing upto 400 Mpixels per second and at a clock frequency of 600 MHz. This satisfies the requirements of real time high definition moving pictures in the MPEG-2 standard. Another DCT core processor has been proposed by [10] .
B. Distributed
A distributed implementation uses a cluster of dedicated unified functional units that are coordinated by a controller.
[51] presents a chip-set for video display of multimedia information. Key aspects of the chip-set are a high flexibility and programmability of multiwindow features with multiple Teletext (TXT) pages, Internet pages and video processing up to three live windows. The chip-set contains a micro-controller with peripherals featuring a pixel-based graphics (GFX) and telecommunication interface. The video processor is another chip containing a number of flexible coprocessors for horizontal and vertical scaling, sharpness enhancement, adaptive temporal noise reduction, blending of graphics, mixing of multiple video streams. and 100 Hz up-conversion.
Another example of a multiple-chip-based implementation is discussed in [52] , where an array architecture for MPEG-4 image compositing has been presented. The coprocessor architecture works in parallel to an MPEG-4 video-and audio-decoder and performs computation and bandwidth intensive lowlevel tasks for image compositing. The processor consists of an SIMD array of 16 DSPs to reach the required processing power for real-time image warping, alpha-blending and 3-D rendering tasks. The flexible architecture allows the adaptation of the processing resources to the specific needs of different tasks and applications. The processor has an object-oriented cache architec- 14 [Online] Available: http://www.c-cube.com/product_display.fcfm?ProdID =36 ture with 2-D virtual address space, that allows concurrent and conflict-free access to shared data objects for all 16 DSPs.
C. Key Observations
The complexity and variety of techniques and tools and the high computation, storage, and I/O bandwidths associated with multimedia processing pose challenges, particularly from the points of scalability, resource utilization, and real-time implementation. For example, compression standards such as MPEG-4 and JPEG 2000 that have been recently proposed, offer high interactivity to the user, which translates to a dynamic change in the computing resources at both the encoder and decoder units. Since these standards target mobile applications, there is a need to exploit parallelism at all possible levels and also provide sufficient software support for efficient product development and debugging.
Current solutions for media processing do not provide all the features necessary to obtain a low-cost, low-power, and flexible solution to cater to the needs of mobile media applications. Each class of processors exploit parallelism at only certain levels or provide only certain features for low-power consumption. The main drawbacks of existing media processing approaches are summarized below.
• Recall from Section III-C and Section IV-D that the dedicated approach offers the advantages of high speed and low power, yet their design and debugging phases involve a significant amount of time thus being cost prohibitive making them unsuitable for low cost mobile media applications.
• It can be seen from Tables I and II that although the programmable architectures offer flexibility in implementation, yet the power dissipation varies from 1 to 74 W, which is an expensive solution for mobile applications with power requirements below 500 mW.
• Most of the processor architectures do not offer the facility to exploit parallelism at the thread level, which is essential to cater to flexibility at higher levels of the application as in the case of MPEG-4.
• Only a few of the special-purpose programmable processors have high-level language programmability (which provides ease of programming and debugging), as most of them offer firmware-programming only.
• Current approaches toward building a reconfigurable processor are targeted toward general-purpose computing or a limited range of media specific applications and are not specifically tuned for mobile multimedia applications. The above drawbacks have lead to the exploration of the reconfigurable computing solutions for media processing.
VI. CONCLUSION
Multimedia processing is becoming increasingly important with s wide variety of applications ranging from multimedia cell phones to high-definition interactive television. Media processing involves the capture, storage, manipulation, and transmission of multimedia objects such as text, handwritten data, audio objects, still images, 2-D/3-D graphics, animation, and full-motion video. A number of implementation strategies have been proposed for processing multimedia data. These approaches can be broadly classified based on the evolution of processing architectures and the functionality of the processors. Based on evolution of the processing architectures, existing solutions for media processing can be broadly classified as programmable processors, dedicated implementations, and reconfigurable processors. Programmable processors include general-purpose programmable processors based on a complex instruction set or a reduced instruction set and specialized programmable processors. The instruction sets of general-purpose programmable processors which were originally meant only for general-purpose applications, have now added media specific extensions to their ISAs to enhance the performance of media-specific applications. Special-purpose programmable processors have evolved starting with the early DSP processors (meant only for audio processing) to the state-of-the-art video/ graphics processors. In order to provide media processing solutions to different consumer markets, designers have combined some of the classical features from both the functional and evolution-based classifications resulting in many hybrid solutions. We have proposed a hybrid approach to classifying existing media processing solutions based on certain influential factors such as real-time constraints, memory subsystems, etc. This classification retains the categorization of general-purpose programmable processing architectures, dedicated functional units and the class of reconfigurable processors. But it categorizes the special-purpose programmable processors based on the functionality rather than the evolution standpoint. Based on the functionality, they have been classified into SIMD, VLIW/SS and speculative control architectures. We have also performed a detailed complexity analysis [54] , [67] of the recent multimedia standard (MPEG-4) which has shown the potential for reconfigurable computing that adapts the underlying hardware dynamically in response to changes in the input data or processing environment. We, therefore, conclude that there is a need to explore the domain of reconfigurable computing to satisfy the computing needs of mobile media applications. efficient partitioning of resources for complex and time critical multimedia applications. His research interest is in the area of reconfigurable computing for multimedia applications. 
S. Panchanathan

