The module-wise dynamic voltage and frequency scaling (MDVFS) scheme is applied to a single-chip H.264/MPEG-4 audio/visual codec LSI. The power consumption of the target module with controlled supply voltage and frequency is reduced by 40% in comparison with the operation without voltage or frequency scaling. The consumed power of the chip is 63 mW in decoding QVGA H.264 video at 15 fps and MPEG-4 AAC LC audio simultaneously. This LSI keep operating continuously even during the voltage transition of the target module by introducing the newly developed dynamic de-skewing system (DDS) which watches and control the clock edge of the target module. key words: dynamic voltage/frequency scaling, system LSI, H.264, A/V codec LSI, multimedia chip 
Introduction
Digital convergence in mobile phones has been advancing fast recently. Digital broadcasting services, such as ISDB-T and DVB-H, require H.264 decoding capability, video telephony services and video camcorder application require MPEG-4 video coding and decoding capability, where wide variety of video sizes are to be supported. These various applications require not only higher but also wider range of operating performance to LSIs, while for mobile application LSIs, reduction of total energy consumption is always critical for longer battery life time. To reduce total energy for a target application, to supply dynamically lowest voltage to the processor to ensure proper performance, Dynamic Voltage and Frequency Scaling (DVFS), has been proposed and implemented [1] - [6] . In these implementations, supply voltage to the whole chip has been controlled, these are Chip-wise DVFS. This approach is sufficient for single core processor based LSIs. However, recent multimedia LSIs are composed of multiple processing modules, therefore, introducing a capability of Module-wise DVFS (MDVFS), in which optimal voltage and clock frequency is independently supplied to individual module, gives more room to optimize performance and lowers the energy consumption compared to a Chip-wise DVFS. This paper describes the module-wise dynamic voltage/frequency control (MDVFS) adopted in an H.264/MPEG-4 Audio/Visual Codec LSI [7] .
To minimize the overhead of MDVFS, a continuous operation during the transition of the supply voltage to the target module is realized by newly introduced dynamic deskewing system (DDS) and on chip voltage regulator with a slew rate control. Section 2 describes the chip architecture and the over view of the applied lowpower techniques. Section 3 describes module-wise voltage and frequency scaling followed by implementation in Sect. 4. Section 5 presents the measurement result.
Architecture of the H.264/MPEG4 Audio/Visual Codec LSI
Target application of this LSI is H.264 and MPEG-4 audio and video processing [7] . Figure 1 shows module and that of embedded DRAM are separated from other modules. Lowpower approaches for this chip are, (1) Optimal software and hardware partitioning thanks to configurability of MeP architecture [8] and dedicated hardware accelerators. (2) Event driven approach with high performance MeP modules, which avoids wasting power for poling processes. (3) Shared local memory with wide bandwidth which decreases memory access cycles. (4) 3 level hierarchical clock gating. (5) Embedded DRAM [9] . (6) Module-wise dynamic voltage and frequency scaling. This paper focuses on (4) and (6) .
Dedicated hardware acceleration and introducing higher parallelism is the most effective way of reducing video processing. H.264 video decoding process is shown in Fig. 2 . H.264 coding/decoding is more flexible and more processing performance consuming than MPEG-4, and more difficult to run on hardware accelerators. In this LSI, a set of dedicated hardware accelerators with configuration switch parameters is developed to handle the flexibility and reduce the workload of RISC cores. H.264 decoding is carried out by two RISC cores in video frontend and backend modules, and four dedicated hardware accelerators: inverse quantization (IQ)/inverse discrete cosine transform (IDCT), motion compensation (MC)/intra/inter prediction/ 6-tap filter, syntax decode, and de-blocking filter (Post Filter). The distribution of the hardware accelerators to two video modules is shown in Fig. 3 . Each hardware accelerator is allocated to specific operation in H.264 processes, therefore, gate clocking for each accelerator is easily controlled and each accelerator operates during the small time slot in the whole H.264 process. The similar clock gating policy is applied to sub-modules in the modules, and further more, gate level clock gating is also adopted. The power reduction effect of this 3 level hierarchical clock gating is described in Sect. 5.
MPEG-4 video processing is also implemented under same concept in this LSI. There are similar operations both for MPEG-4 and H.264 processes in most operations mapped to hardware accelerators and hardware are shared as much as possible. Local memories are also shared among MPEG-4 and H.264 corresponding accelerators for area sav- ing. Saved area budget is allocated to wide local memory bus width and it minimizes memory access cycles and consequently reduces power.
Introducing these techniques, this LSI can decode QVGA (320 × 240) 15 frames/s H.264 video at only 59.5 mW power consumption, and encode VGA (640 × 480) 30 frames/s MPEG-4 video at 99.4 mW. These powers are measured without audio processing. To save unneeded power consumption, the supply voltage and clock frequency of the audio module is controlled independently from the rest of the chip. Modules for video are separated from a module for audio. This is because audio/speech and video processing have different performance requirements. While the nominal clock frequency and supply voltage are 180 MHz and 1.2 V, the audio module can be geared down to 90 MHz and 0.9 V. To adopt the variable supply voltage and frequency, the module is decoupled from the main bus by a voltage/frequency socket which absorbs the difference of operating voltage and frequency. It is composed of a level shifter, a FIFO and signal rate translator. The size of the FIFO is 512 byte, which is the maximum transfer size in a single burst transfer session, i.e. the size of a macro-block (16 × 16 pixels) in H.264/MPEG-4 processing. The signal rate translator delays the control signals to the slower module, and adjust the length of pulse signals across different frequency domains. Furthermore, the chip is designed to operate even during the transition between low and high supply voltage of the audio module, using the newly adopted dynamic de-skewing system (DDS) and the dedicated on-chip voltage regulator with slew rate control, as explained in the following section. Figure 4 shows the activation diagram of MPEG-4 VGA 30 fps video and MPEG-4 AAC audio encoding. The workloads of the modules of video front-end, of video backend and of audio were analyzed. Upper three lines of Fig. 4 shows work load of each individual module when it runs at maximum performance. The average activation ratio of the audio module is about 50%. There is a much room for the audio module to lower the performance and reduce power consumption. However, when the chip-wise voltage and frequency control is applied to this LSI, the opportunity for voltage and frequency scaling is only when all three modules are not busy and the effect of the power reduction is small. On the other hand, when voltage/frequency is controlled on module-wise basis as in this LSI, the scaling chance of the audio module is determined only by its own workload and not affected by other modules. By adopting MVDFS power consumption of the audio module in AAC audio processing is estimated to be up to 13 mW at the busiest case by the simulation, which is 40% lower than chip-wise control in this LSI. To change the supply voltage of the audio module independently from the rest of the chip, the voltage domain of the module is separated from the rest of the chip and is driven by the dedicated on-chip regulator. One concern of the MDVFS is performance overhead of the voltage transition. If the operation of the module stops during the voltage transition, the effect of power saving by MDVFS is degraded, because the transition with dead time of the module cannot be allowed in a short session. One possible solution can be a high speed voltage transition. But it is difficult to control transition speed below several tens clock cycles in a high speed system for stable operation of the system, due to a dI/dt noise. Therefore the continuous operation during voltage transition is required to suppress power consumption without performance overhead of MDVFS. Another concern of MDVFS is the skew between system clocks at the main voltage domain and controlled target voltage domain. As propagation delay of system clock varies in accordance with the supply voltage, mismatch in the clock timing will occur between the end of the clock tree in the audio module with controlled voltage and other clock trees in the other modules with fixed voltage. By a simulation analysis of synthesized clock trees, the propagation time through the target module varies 0.7 ns between supply voltages to the module, 0.9 V and 1.2 V. With conventional synchronous design, this variation of clock propagation is unacceptable as a clock skew. As the chip operates at 180 MHz, the skew of 0.7 ns is 12% of the cycle time and is difficult to be absorbed in the design margin. One solution to this problem might be the asynchronous inter-module communication with using the globally asynchronous and locally synchronous (GALS) cores. But an asynchronous communication has several overheads in the circuit complexity and consequently the promptness of the communication between modules, which is crucial for the multimedia LSI, might be degraded. Another solution is to adopt an extra PLL or DLL circuit for the audio module to synchronize the clock of the module to that in the other part of the chip. But PLL has disadvantage in taking relatively long time to synchronize a clock. The mirror-type DLL [10] can operate within a few clock cycles but inherently enhances the clock jitter as it uses neighboring two clock edges.
Module-Wise Dynamic Voltage and Frequency Scaling (MDVFS)

Dynamic De-Skewing System (DDS)
Operation Principle and Circuit Configuration
To realize the continuous operation even during the supply voltage transition of the audio module, and to minimize the clock skew between clocks in the voltage controlled module and in the rest of the chip, the dynamic de-skewing system (DDS) is newly introduced in the chip. Consequently the synchronous communication over entire chip is retained. The transition speed of the supply voltage moderately controlled by the dedicated on-chip voltage regulator with slew rate control, not only to suppress the noise on the power source but also to keep the DDS being effective, as discussed in the following section. The DDS has advantages in its fast operation within a clock cycle, and it is also immune from clock jitter. The concept of DDS is that the optimally controlled variable delay is inserted at the root of clock tree in the target module, and its delay time is modified in every clock cycle to minimize the clock skew between the target module and other part of the chip. When the supply voltage to the target module is low, the clock propagation through the module takes longer time and thus the inserted delay time is controlled to be shorter. When the supply voltage is high, vice versa. The block diagram of the DDS circuit is shown in Fig. 5 . The variable delay unit DA, which delays the clock signal from PLL by t D , is inserted at the root of the clock tree of the Audio Module (AM).
The optimal delay value of t D is determined as follows. A typical clock signal at the end of clock tree of AM is selected and delivered to DDS circuit as the loop back signal A. Another loop back signal M is selected from the fixed voltage domain, shown as the main bus (MB) part. The clock propagation time of AM, from the output of DDS to the end of clock tree, is expressed as t A , while t M is the time of clock propagation in MB, from PLL to the end of the clock tree. The second delay element DB, which delay time is set to be same as that of DA, delays the clock from loop back signal M. The skew measurement unit (SM) detects the time difference between the loop back signal A (t D + t A ) and delayed signal M (t M + t D ). The measured skew, ∆t DA , is;
The result is independent from the previous delay time Figure 6 shows the circuit diagram of the delay unit, which consists of delay elements connected in series.
Delay Unit
Each element has input port that is activated by the selector signal. One of the selector signals in the unit is set to "H" while others are set to "L." The input clock, passing through the selected switch, passes through the delay elements of the selected numbers. The resolution of the variable delay is 50 ps, which is the delay time of each delay element at the typical operation condition. As in Fig. 5 , one delay unit DA is inserted at the root of the audio module and the other unit DB delay the clock from loop back signal M of fixed voltage domain. The status of selector signals of two units is identical during a clock cycle, and is modified by the output of skew measurement unit in the next cycle. Figure 7 shows the circuit diagram of the skew measurement unit. The circuit consists of the delay unit and latch circuits. The delay unit has the same structure as used in delay unit DA and DB. The input terminal of the latch circuits are connected to the output of each delay element. The loop back signal A is connected to one end of the delay unit. The delayed loop back signal M by DB (M ) drives the clock terminal of all latch circuits. The outputs of the delay elements are reset to "H" before the clock is input to the delay line. When the loop back signal A is arrived at this circuit, the "L" signal starts to propagate through the delay elements connected in series. When the delayed loop back signal M (M ) is applied as the detect clock, the status of the outputs of delay elements are captured by the latches instantly such as "LLLHH," for example. Compared with the adjacent signal, only one select signal corresponding to the time difference between the input clock and the detect signal is set to "H" while the others are "L." The measurement concludes within one clock cycle and the result is used as the delay select signal in the next cycle. As both delay element and skew measurement element consist of the same resolution of 50 ps, the clock of the audio module can be aligned to the clock in the other modules with accuracy of 50 ps. Figure 8 shows the diagram of the test circuit including DDS circuit to confirm its operation. Two digital variable delays, DVA and DVB of which delay time is controlled by about 200 ps per step are used instead of actual clock paths in the fixed voltage domain and in the audio module. The measured waveforms are shown in Fig. 9 . After selecting intentional skew in DVA, when the the DDS is not activated, the clock outputs of DVB (the audio module) separates in accordance with the difference of the delay time of DVA and DVB as shown in Fig. 9(a) . When DDS is activated, the skew between DVA and DVB is absorbed by DDS and the output clock of DVB aligns to that of DVA as shown in Fig. 9(b) .
Skew Measurement Unit
Test Circuit Result
One concern of the DDS circuit operation is possible meta-stability of the latch circuit in the skew measurement unit. The transition of the input of a latch could occur at the same timing with the clock signal in a marginal case. Even in the worst case, only one latch is exposed to possible meta-stability, because the delay time of serially connected delay elements gives the timing margin for the next latch by itself. If such a coincidence happens at particular latch, the output of the latch might be instable. But this instability do not degrade the DDS operation, because either "L" or "H" for the output of latch at the boundary of clock propagation is acceptable as a result of skew measurement within the supposed accuracy of 50 ps, which is the resolution of delay unit and skew measurement unit. Another concern is the impact of clock jitter on accuracy of DDS operation. Although the operation of the DDS is similar to that of the mirror type delay lock loop (DLL) [10] , it is immune against clock jitter because the setting new delay concludes within one clock cycle. The loop back signals are generated from the same clock pulse input and respectively delayed in each clock trees and delay units. Therefore the DDS does not use other clock edge and the jitter of system clock, fluctuation of time from one clock to next one, do not degrade the operation of DDS.
Voltage Regulator with Slew Rate Control
The propagation time of the system clock in the audio module varies in accordance with the supply voltage. On the other hand, the DDS adjusts delay time using the previous system clock that is earlier by 5.3 ns at 180 MHz operation. When the supply voltage to the audio module is being transferred between high and low voltage, the voltage slope against system clock should be small enough to make the variation of clock propagation through the audio module under reasonable design margin, besides to suppress dI/dt noise on the power source. Figure 10 shows the circuit diagram of the dedicated on-chip voltage regulator with slew rate control. The regulator controls the slew rate of the supply voltage to the audio module (VDDA) to 0.3 V per 300 ns, so that the clock skew degrades no more than one division of the delay units in the DDS, within a cycle. The regulator consists of three parts; the driver circuit with push-pull transistors, the slew rate controller in the reference level is controlled for a transit between low and high supply voltage, and the reference generator with the band gap reference (BGR) circuit to make the reference voltage for the low supply voltage. In a slew rate controller, comparators activate switches for charging or discharging. The constant current source, which is connected VDD or ground, charges or discharges the capacitance connected to the reference node, consequently the linear voltage slope is realized. After the transition, Vref is directly connected to the appropriate reference voltage, Vref H or Vref L. The operation of the regulator on the chip was successfully confirmed as shown in Fig. 11 , where the voltage translation of audio module between 1.2 V and 0.9 V is shown.
Results
Figure 12 shows a micrograph of the chip equipped with four operation specifically configured RISC processors, dedicated hardware accelerators for specific signal processing, 32 Mb embedded DRAM and interfaces for camera, ing. The left two bars show the comparison of this work to the previous one [9] . Power consumption of the audio module was measured separately, and the power was reduced from 10.7 mW to 6.4 mW at MPEG-4 VGA 30 fps AAC 48 kHz encoding case when activating module wise voltage frequency scaling, which is about 40% reduction.
40% is not a straightforward number. To discuss the effect of power reduction techniques, it is essential to keep fare measurement conditions. For this purpose we measured power consumption of video processing modules, not the audio module, because we can easily change the required performance by the video size, which has nearly linear dependence on the required clock cycles. Figure 14 shows the measured power consumption of the logic and SRAM portion of the LSI. Power consumption in eDRAM and I/O are not included. Horizontal axis is proportion to estimated required clock cycles for each video processing. The power consumption is about 75 mW for MPEG-4 VGA 30 fps. For lower performance case, there is a room for power saving. In this LSI voltage domain of video related modules (Mux, video F/E, video B/E are separated from others as shown in Fig. 1 , thus the supply voltage can be controlled for further power reduction. In CIF 30 fps case, when no lowpower technique is applied, power consumption is 75 mW. In this case, by using 3 level hierarchical clock gating, 34 mW (45%) can be saved. And 13 mW more can be reduced by system clock frequency scaling to 60 MHz from 180 MHz, and most aggressive voltage and frequency scaling of 0.8 V 60 MHz operation demonstrated further 29 mW (70%) power reduction. In QCIF 15 fps case voltage and frequency are unable to be scaled and are same as those for CIF 30 fps case, this is because these numbers are the smallest for stable communication with a host processor. In Fig. 14 , horizontal axis is proportional to the required clock cycles, therefore, the power consumption with using clock gating shows linear dependence. In the ideal case, if the clock gating is perfect, the Y intercept of this line should be 0. Remaining about 28 mW is the power consumed where clock gating is not effective, such as clock trees, and bus. This portion of the power consumption can be reduced by frequency scaling, and whole portion can be reduced by voltage and frequency scaling. This result demonstrates the effectiveness of the voltage and frequency scaling for power saving.
Conclusion
The module-wise dynamic voltage and frequency scaling (MDVFS) scheme is applied to the audio module of a singlechip H.264/MPEG-4 audio/visual LSI. While the nominal supply voltage of the LSI is 1.2 V and frequency is 180 MHz, the operation voltage of the module could be lowered to 0.9 V and the frequency is halved to 90 MHz by the application program when the module does not need the highest performance. The power consumed at the module is reduced by 40% in comparison with the operation without the voltage and frequency control. The newly introduced dynamic de-skewing system (DDS) minimizes the skew between the target module and the other part of the LSI in every clock cycle. The clock skew during the transition of the operation voltage is suppressed within the design margin, thanks to the on-chip voltage regulator with slew late control in combination with DDS. The proposed scheme realized the continuous and synchronous operation over the entire chip without performance overhead.
