ABSTRACT This paper describes the mapping of a two-dimensional inverse discrete cosine transform (2-D IDCT) onto a wordlevel reconfigurable Montium(®) processor. This shows that the IDCT is mapped onto the Montium tile processor (TP) with reasonable effort and presents performance numbers in terms of energy consumption, speed and silicon costs. The Montium results are compared with the IDCT implementation on three other architectures: TI DSP, ASIC and ARM.
is the most computationally intensive [1] . For JPEG and MPEG-4 decoding a 2-D 8x8 IDCT is used.
Being able to map existing well-known algorithms to a new architecture is vital. This 2-D IDCT mapping exercise gives valuable information about:
* ease of mapping -how much effort is required to map algorithms to the Montium? * insight into the architecture -how suitable is the architecture for the mapping of specific algorithms? * performance of the architecture -how many cycles are required to execute the algorithm and what is the power consumption?
MONTIUM TP CORE
The Montium TP is a 16-bit word level reconfigurable architecture that obtains significant lower energy consumption than DSPs for fixed-point digital signal processing algorithms. The Montium TP targets computational intensive algorithm kernels that are dominant in both power consumption and execution time. In contrast to a conventional DSP, the Montium TP does not have a fixed instruction set, but is configured with the functionality required by the algorithm at hand. In particular, the Montium TP does not have to fetch instructions and, hence, does not suffer from the Von Neumann bottleneck. Once configured, the Montium TP resembles more an ASIC than a DSP. The Montium TP can be reconfigured almost instantly, as the size of the configuration binaries is very small. The size of a typical configuration is less than 1 KB and reconfiguration typically takes less than 5 ,us. The Montium TP has a low silicon cost, as the core is very small. For instance, the silicon area of a single Montium TP with 10 KB of embedded SRAM is 2.4 mm2 in 0.13 ,um CMOS technology. The power consumption in this technology is approximately 500 ,uW/MHz (including all memory accesses). A Montium TP consists of 5 identical ALUs (see Figure 1 ) to exploit spatial concurrency in order to enhance performance. See [2] for a detailed architecture description of the Montium TP.
we need another transpose operation (rows to columns) requiring another 16 clock cycles. Therefore, the whole 2-D 8x8 IDCT requires 96 clock cycles on the Montium (= 1.5 cycles per input sample).
Mapping Effort
Studying the IDCT and selecting the most suitable algorithm for implementation required the most effort and took about 2.5 weeks. The mapping of the 2-D 8x8 IDCT to the Montium processor took about 1.5 weeks. The mapping was performed by an MSc student, who had no substantial prior knowledge of IDCTs or the Montium architecture. Afterwards we spend another few days for optimization. We expect that further optimization is possible by pipelining the transform operations. b) An ASIC implementation has extreme characteristics. It is the best choice of all architectures in terms of performance and energy consumption. However, it is the worst choice of all architectures in terms of flexibility, nonrecurring costs and time-to-market. A benchmark of the Montium versus an ASIC gives an idea of how close the performance is to the best lower bound. In other words, what is the price of the flexibility? c) Finally, the Montium is benchmarked against a Texas Instruments DSP. This represents one of today's most likely design choices. This shows how the reconfigurable approach compares to a conventional DSP solution.
The benchmarks only considers computational power and does not consider communication. Much of the communication latency can be hidden by overlapping of communication time and computing time (also referred to as "streaming" communication).
This section explains the results that are presented in Table 2. We benchmark on two criteria: energy and performance. The latter is normalized by chip area to express the silicon costs. These benchmarks are depicted in the last two rows of Table 2 Xi is a temporary intermediate result and f, and F, refer to the variables in Eq. 1 and C, is a constant for the cosine expression in Eq. 1.
fair comparison, the power figures for each architecture are normalized to 0. 13,um technology, using a nominal voltage of 1.2V. Finally, the energy consumption per IDCT is computed by multiplying the number of clock cycles required for the execution of a 2-D 8x8 IDCT with the energy consumption per clock cycle.
2) To benchmark the silicon costs, we first determine the number of 2-D 8x8 IDCTs that can be executed per second. However, it is evident that doubling the chip area will increase the performance. Therefore, we normalize this number to mm2 chip area. Thus, the measure is the number of 2-D 8x8 IDCTs that can be executed per second per mm2 chip area. There is a strong correlation between the chip area and the chip cost price. Therefore, this is a good measure for the production cost effectiveness of an architecture for the IDCT. Note that non-recurring design costs are not included in this measure (which can be substantial). that we used a Montium TP with 10KB memory. This is a lot more memory than required for the 2-D 8x8 IDCT (current memory utilization was below 3%). This means that if the Montium TP memory capacity is tailored to the 2-D 8x8 IDCT, the area will be much smaller.
Dedicated 2-D 8x8 IDCT ASIC
To compare the performance of the Montium with a dedicated IDCT application specific integrated circuit (ASIC) we looked for a state-of-the-art reference implementation in the literature. In [1] an ASIC implementation of an IDCT is presented, which is identical to our IDCT implementation. This ASIC is implemented in TSMC 0.18,um technology and has a power consumption of 634.5 mW at the maximum frequency of 154 MHz. The Montium power estimates are made for 0. 13,um technology. According to [6] it is possible to estimate the energy consumption for a smaller technology. The common dependency of the dynamic power consumption is that it is linearly related to the total capacitance and frequency and quadratically related to the voltage. With reduction from 0.18 ,um to 0. 13 [1] . The area of the ASIC is 12.17 mm2 in 0. 18um technology [1] . In TSMC technology, the gates density is 100 and 200 kgates per mm2 for 0.18,um and 0.13,um technology respectively [7, 8] . After normalization to 0.13,um technology, the area of the ASIC becomes about 12.17 100 = 6.09mm2. According [10] the power consumption is 1.18W @ 1 .2V with 60% utilization. Note that we only consider the power consumption used for the internal logic and not the total power consumption, which is including the I/O (memory access). Assuming a linear function [5] between power consumption and utilization, we expect 1.97W @ 1.2V for 100% utilization. The TMS320C6454-720 is produced in 0.09,um technology [11] . We use the same method as in the previous subsection to normalize to 0. 13,um technology for a fair comparison. This results in an increase of power consumption with a factor of ( 1A2 )2 0.13 193 TI provides a library with imaging functions. We expect that these functions are highly optimized. According to [12] , the number of clock cycles for a 2-D 8x8 IDCT is 72 n + 63, where n is the number of IDCTs. We assume that it is realistic to execute 6 IDCTs sequentially, as MPEG-4 uses 6 IDCTs per macroblock. So, on average 72 6+63 = 82.5 6 clock cycles per 2-D 8x8 IDCT are used. The estimation of the energy consumption for a 2-D 8x8 IDCT in 0. 13,um is 3.95 82.5 = 325 nJ.
TI does not disclose the area of the TMS320C6454-720. Therefore, we have to estimate the area. At the International Electron Devices Meeting in fall 1999, TI presented a roadmap [13] , which reveals that a TMS320 DSP core will contain ca. lOOM transistors in 2005. As the number of transistors follows Moore's law, this prediction should be quite accurate despite the long forecast period. We assume that the TMS320C6454-720 has about lOOM transistors. This is a conservative estimation, as this number was estimated for 2005, while this DSP has been launched in 2006.
To know the number of transistors (T) per mm2, we investigated 0. 13,um TSMC technology parameters [14] . We distinguish between memory and logic gates, because the density is quite different. For SRAM memory the density is 2.43 -2.14 ,um (6 transistors). This means a density of 6/2.43,u = 2.46MT/mm2. TSMC gate density is 219k gates/mm2. With 4 transistors per gate, the transistor density is 219k. 4 = 0.88 MT/mm2. We compared the densities to UMC technology parameters and the numbers are quite similar.
We now make an area estimation of the TMS320C6454-720 using 0. 13,um TSMC technology parameters. This DSP contains 8 Mbit SRAM cache requiring 48M transistors, which is equivalent to an area of 20mm2. The remaining 52M transistors require 5O2A = 59mm2. 
CONCLUSION
The benchmarks in this paper provide important information to make a fair trade-off between different architectures.
The Montium Tile Processor (TP) offers much more flexibility than an ASIC, while being much more energy-efficient than a conventional DSP. The Montium performs near energy efficient as an ASIC but uses more silicon area. However, the Montium area can be reused for different functions by means of time-multiplexing due to the offered flexibility while an ASIC is restricted to one dedicated algorithm. Therefore, the Montium is an attractive alternative to an ASIC due to the offered flexibility, the fast time to market and the lower costs. The Montium outperforms a conventional DSP solution both in terms of energy-efficiency and in silicon costs expressed in number of 2-D 8x8 IDCTs per second per mm2 area. As expected, an ARM is not an efficient solution for the implementation of an IDCT. The mapping of the 2-D 8x8 IDCT did not reveal any shortcomings of the Montium architecture. These kind of mappings give valuable insight into architectural improvements. The reconfigurable Montium architecture is mature and provides a good balance between efficiency and flexibility.
Considering the scenario where the person who did the mapping had no prior knowledge (about algorithm nor architecture), we can conclude that mapping kernels to the Montium TP can be done with reasonable effort (in this example the coding took about eight days). Use of the Montium TP provides a much faster time to market compared to the use of a dedicated ASIC, while being cheaper because the Montium IP can be reused for different applications.
