Abstract-Power saving is becoming one of the major design drivers in electronic systems embedding microcontroller cores. Known microcontrollers typically save power at the expense of reduced computational capability. With reference to an 8051 core, this paper presents a novel clustered clock gating to increase power efficiency at architectural level without performance loss and preserving the reusability of the macrocell. Different from known clustered-gating strategies where the number of clusters is fixed a priori, the optimal cluster organization is derived, considering both the macrocell complexity and switching activity. When implementing the 8051 core in CMOS technology, the proposed approach leads to a 37% power saving, which is higher than the 29% permitted by automatic-clock-gating insertion in commercial computer-aided design tools or the 10% of state-of-the-art clustered-gating strategies. To assess its full functionality, the power-optimized cell has been proved in silicon that is embedded in an automotive system for sensors interface/control.
I. INTRODUCTION
The advances in CMOS technology allow the integration in a single chip of complex control systems, which is up to now realized with several components on printed circuit boards. To cut the high costs and time to market of a system-on-chip design, there is a growing demand for digital macrocells that are reusable in different computer-aided design (CAD) tools and for different target technologies. The focus is on well-established microcontrollers, such as the 8051 or SPARC8 families [1] - [4] , which have become industry standards with being the core of embedded control systems for several applications. The widespread diffusion of battery-powered devices and the reduction of packaging/cooling costs in complex system-on-chip call for the design of power-optimized macrocells. Power consumption in CMOS digital circuits is expressed as [5] P = P switching + P short−circuit
where f clk is the clock frequency, α is the average switching activity, V dd is the supply voltage, and I sc and I leakage are the short-circuit and leakage currents, respectively. The total power consumption is dominated by the switching component due to charging/discharging of the circuit capacitive load C L . The minimization of power consumption involves optimizations at different design levels [5] : from the technology used, to the custom sizing of transistors and clock tree at gate level, up to architectural-level techniques and, at a highest level, to the algorithm to be implemented. All these approaches require a tradeoff between the power optimization and the reduction of the macrocell performance in terms of increased area or decreased speed and/or flexibility. Design optimizations at the lower levels, technology, and gate are CAD and/or technology dependent and difficult to reuse. To ease macrocell database management and maintain portability between different technologies and CAD tools, architectural-level optimizations directly implemented in the register transfer level (RTL) description of the macrocell are desirable. At the RTL level, the major achievable saving concerns dynamic power through switching-activity reduction [5] . Static power can be kept low by choosing a low-leakage CMOS target technology. Many macrocells are available to implement microcontroller cores [1] - [4] , but they typically adopt architecturallevel power-saving techniques, which entail performance reduction: Power consumption is reduced if the microcontroller is in power-down mode, i.e., the clock to the entire core is stopped, or idle mode, i.e., the clock to the main processing unit is stopped, but some peripherals are still active. In this trivial way, power is saved if the macrocell is not used. Power optimization is also achieved by reducing the system clock frequency [3] but at the expense of reduced computational capability.
With reference to an 8051-core case study, this paper presents a clustered clock-gating technique to reduce dynamic power consumption at architectural level without any performance loss and preserving macrocell reusability. Clustered-gating techniques have been proposed in the literature [6] , [7] , but the number of clusters is fixed a priori, independent of the macrocell complexity and switching activity. As further discussed in this paper, only suboptimal solutions are obtained. On the contrary, the proposed approach derives the optimal number of clusters, considering both microcontroller complexity and switching activity. Hereinafter, in Section II, the target 8051 macrocell is analyzed in its typical application scenario to highlight the building blocks with significant power cost. In Section III, the novel clustered clock gating is presented and applied at architectural level. Section IV discusses the results achieved when comparing the optimized core to known 8051 macrocells. The proposed technique is also compared to automatic insertion of clock gating in commercial CAD tools, such as the widely used Synopsys, and to other clustered-gating strategies presented in the literature. To assess the full functionality of the new power-optimized microcontroller, the macrocell has been proved in silicon that is integrated in an embedded system for automotive sensors interface/control described in Section V. Conclusions are drawn in Section VI.
II. MICROCONTROLLER CORE CHARACTERIZATION
The considered 8-bit 8051 core is distributed by Oregano [2] . It features a fully synchronous architecture that executes most of the instructions in one clock cycle up to a maximum of four cycles for multiplication and division. The macrocell can be customized by implementing nonspecific instructions and instantiating a user-defined number of timers/counters and universally asynchronous transmitter/receiver (UART) serial interfaces. As an example, this paper refers to a configuration with one timer/counter, one UART, 4 bidirectional I/O ports, and 2 interrupt levels and that implements all instructions. The relevant hardware design language (HDL) code has been synthesized with Synopsys, targeting a 0.18-µm CMOS low-leakage technology using a standard-cell library. It resulted in a maximum 0278-0046/$25.00 © 2007 IEEE clock speed of roughly 50 MHz, which is typical for applications using an 8051 core [1] . The circuit complexity amounts to 10.5 kgates distributed among the main blocks of the core as follows: 7% for the timer/counter, 8% for the UART, 13% for the arithmetic logic unit (ALU), and 72% for the main control unit. The latter block includes the interrupt controller, the instruction decoder, a finite-state machine (FSM), the memory interface, and a block of registers. To identify the most power-consuming block and to evaluate power reduction on the modified architecture, the well-known Dhrystone 1.1 test code has been used. It provides a balanced instruction mix, as it would be in a generic microcontroller application. Power consumption has been estimated at gate level with a Synopsys Power Compiler, using the aforementioned 0.18-µm CMOS library at 1.95-V supply voltage. The overall macrocell consumes roughly 0.125 mW/MHz, with 1.25 and 6.25 mW at 10 and 50 MHz, respectively, relating 64% to the control unit, 4% to the timer/counter, 5% to the UART, 1% to the ALU, and the remaining 26% to the switching activity of the clock distribution net. The contribution of static power is a few µW. Besides Dhrystone 1.1, other custom test codes have been adopted to simulate the macrocell and estimate its power cost that leads to similar results. The impact at system level of the aforementioned power costs can be of orders of magnitude higher: As an example, sensor/control networks in automotive applications are made up of hundreds of nodes, each containing an electronic control unit with relevant microcontroller cores [8] .
III. ARCHITECTURAL-LEVEL POWER SAVING
The macrocell analysis in Section II shows that the ALU power cost is negligible, and no dedicated optimization has been applied to it. The contribution of UART plus the timer/counter is less then 10%, and a simple power-saving strategy is adopted for them: Since they are not used full time in typical applications, the control unit gates all the registers of a timer/counter and/or UART whenever their use is not scheduled. From Section II, it emerges that the control unit is the most power-hungry block; since it is an "always on" module, the simple ON/OFF gating strategy, as in [1] - [4] , is not optimal, and a novel clustered clock-gating technique has been investigated. Conventional clock gating is based on the principle that substantial power saving is obtained if registers are not clocked when they do not have to produce valid outputs. The more inserted gatings, the higher the potential power saving since each register has different clocking conditions. The HDL code of the macrocell is scanned to find descriptions of enable registers, i.e., conditional statements inside a synchronization process with signals that do not change under certain conditions, such as r_sig i in Fig. 1 . The original process, whose equivalent circuit is sketched in Fig. 1(a) , is then split into a combinatorial statement plus a latch process and a synchronous process sensitive to the gated clock gclk [see Fig. 1(b) ].
However, each gating structure also introduces area and power overheads. Therefore, it is required to find an optimal clustering: Dividing the total number of registers R in M clusters of K registers, where each cluster is clocked by its own gclk signal, the optimal value for K has to be found. The optimal number of clusters M is consequently derived, being the number of registers R = K · M already known from the macrocell synthesis. The power saving due to clock gating can be expressed for each register as
where, referring to Fig. 1(b) , C gated is the input capacitance of the gated register, α is the switching activity of gclk, and C add is the clock input port capacitance of the AND and LATCH gates. In case of M subsets of K registers, the power saved is expressed as in (2), where all the K registers belonging to the same ith cluster share the same gclk i signal and hence the same 
Logic synthesis results prove that C add is roughly equal to the capacitance of a 1-bit register clock pin. For the considered core, most of the registers have 8 bits; hence, C gated ≈ 8C add . Thus, the total power saving can be expressed as in (3), where the number KM = R is known a priori from the macrocell synthesis. Thus,
From HDL simulations of the microcontroller core using the Dhrystone test code plus other custom codes, it resulted that the dependence of the parameter c =
M i=1
α i on the arrangement of gating clusters is negligible. Under such assumption, the saved power in (3) can be maximized using clusters of K opt registers, with
The results of such expression are real numbers; for a feasible implementation of the clustered clock-gating approach, the relevant K opt value is rounded to the nearest integer value. Through HDL simulations, it has been evaluated that c = 0.48 for a reference class of applications modeled by the Dhrystone test code. In the considered case study, R = 41, leading to the result of K opt = 3.26, i.e., the optimal solution uses clusters of three registers. Particularly, for the 8051 core, R = 41 registers with conditional statements have been grouped in M = 14 clock-gating clusters: 13 clusters of 3 registers and one cluster of two registers. In general, if R is the integer number of registers to be grouped; K is the integer number of registers for each cluster determined as previously described; and q and r are the integer quotient and remainder of the division R/K, respectively, then the clock-gating clusters are organized as follows: q clusters of K registers plus one cluster of r registers. To assess the robustness of the preceding choice, Fig. 2 presents the achievable power saving versus K for applications different from the Dhrystone test, i.e., c = 0.48. The curves in Fig. 2 are obtained from (3), varying the parameters K and c for the 8051 case study. Reported data are normalized with respect to the maximum achievable power saving P saved max = f clk · V 2 dd · C gated · K · M , which was derived from (2) in the ideal case of null overhead for the gating logic, i.e., C add and α i are null for each ith cluster. Comparing in Fig. 2 the power saving achieved for K = 1 and K = 3 demonstrates the higher efficiency of the clustered-gating approach, using clusters of three registers, with respect to a solution for which a dedicated gating signal is applied to each register, i.e., K = 1. The latter one is performed automatically by commercial CAD tools such as Synopsys. Analyzing in Fig. 2 the results for K = 41, i.e., all registers are grouped in one cluster, shows the power saving achievable with a single gating signal for the whole macrocell as the power-down mode in [1] - [4] . The latter approach, which was already discussed in Section I, is suitable only for applications characterized by a very low value of the parameter c, e.g., c = 0.01 in Fig. 2, i. e., applications where the microcontroller is off most of the time. When the microcontroller is on and is running a typical code such as the Dhrystone test, which is characterized by c = 0.48, powering down the whole cell is not an efficient solution.
IV. POWER OPTIMIZATION RESULTS
The clustered clock gating has been implemented in the original 8051 RTL description. The macrocell has been synthesized using the 0.18-µm low-leakage CMOS library and simulated at gate level for functional/timing check and power-cost evaluation at 1.95-V supply. The power saving permitted by the clustered clock gating amounts to 37%. Area and timing overheads are limited to 3% and 7%, respectively. Table I compares the modified microcontroller core versus known 8051 macrocells in terms of computational performance, which is measured as Dhrystone code executions per megahertz, and power cost, which is expressed in microwatt per megahertz.
The modified 8051 core keeps the same functionalities and computational performances of known microcontrollers [2] , [3] with a remarkable power saving. The macrocell is still technology and CAD independent and highly reusable. Thus, our architectural-level optimization can be combined with other techniques proposed in the literature, at different design levels, to increase microcontroller performance in terms of 1) energy efficiency and 2) reliability for applications in hostile environments as the automotive one (in that respect, as discussed in [10] , reducing the clock switching activity contributes to the reduction of the microcontroller electromagnetic emission spectra).
Clock-gating insertion can also be performed automatically by commercial tools such as Synopsys [7] , which uses a dedicated gating signal for each register, i.e., K = 1 in Fig. 1 . Applied to the 8051 cell, a power reduction of 29% instead of the 37% of our approach using K = 3 is achieved. This clustering optimization cannot be performed automatically by CAD since parameter c is not known a priori. Moreover, most CAD tools insert clock gating after logic synthesis at gate level, while the proposed flow directly modifies the RTL source code and also allows the use of the modified cell for high-level design exploration since RTL simulations are much faster than gate-level ones. Automatic insertion of clustered clock gating has been investigated in the literature, e.g., in [6] , but the number of clusters is fixed a priori, independent of the complexity and switching activity of the macrocell, and the focus is only on FSMs rather than on complete microcontroller circuits. In [6] , the number of clusters is two: The original FSM is divided into two sub-FSMs, where one of them should be significantly smaller than the other and contains the states with high stationary-state probability. The objective is to obtain a small "always on" sub-FSM that disables the rest of the circuit. From the results provided in [6] , it emerges that the potential power saving of this clustering method depends on the number of states, i.e., the complexity of the FSM, and on the percentage of high stationary states. Poor optimization results are expected if the total number of states is small, and/or many of them have a high stationary-state probability. Applying the clustered technique of [6] to the control unit of the 8051 core, a power saving of less than 10% is obtained, which is well below the 37% of our proposed technique. The latter one (see definition of K opt and Fig. 2 ) derives the optimal cluster organization that takes care of both cell circuit complexity and average activity.
V. AUTOMOTIVE SYSTEM EMBEDDING THE 8051 MICROCONTROLLER
The full functionality of the power-optimized macrocell has been proved in silicon, embedding the 8051 core in a gyrosensor conditioning system for automotive applications. The architecture of the sensor control system is sketched in Fig. 3 and briefly described in this section (for further details, see [9] ). It is composed of an analog front end and a digital processing section with a Joint Test Action Group (JTAG) standard interface between the two signal domains. The gyrosensor is used to measure the angular rate, i.e., how quickly an object turns. It bases its principle on the Coriolis force acting on a vibrating mass when a rotary movement is applied. A current source is applied on a couple of electrodes in order to keep the gyrosensor in vibrating mode with a typical resonance frequency in the range of 10−20 kHz, while a current pickup is used to control the amplitude. Whenever a rotational movement is applied, a vibrating effect arises perpendicularly to the primary vibrating mode and to the rotation motion. This effect can be sensed by a further electrode couple that provides an angular rate measure in open-loop mode. An additional couple of electrodes can be used to react to the Coriolis effect, thus allowing a closedloop configuration. Since the cost of an electronic system involving a sensor is mainly due to the sensor itself and to its conditioning analog circuitry, the basic idea behind the architecture in Fig. 3 is the use of a low-cost sensor and as less as possible analog signal processing while compensating nonideality through digital signal conditioning. The analog front end in Fig. 3 only absolves functions of the driving sensor's electrodes through couples of digital-to-analog converters and by signal acquisition by means of successive-approximation-register-type analog-to-digital converters and amplifiers. It also provides a regulated power supply to the digital section. All modules are digitally controlled Fig. 3 . Sensor control system [9] using the 8051 core and layout of the prototyping digital chip.
since the gain coefficients, offset values, and reference voltages are set by means of dedicated registers accessed via the JTAG bridge by the digital processor. All nontrivial signal processing required for sensor conditioning, i.e., filtering, function generation, and demodulation, is performed by the digital section, which also monitors system activity and handles communication with external devices. Both dedicated and general-purpose computing resources are available in the digital part to achieve a good tradeoff between power consumption and flexibility. The DSP unit in Fig. 3 contains dedicated circuits for digital signal processing: FIR/IIR filters to remove noise/interference sources and a digital phase-locked loop for demodulation of the sensor response and function generation, as the sine wave keeps the gyro in resonance. General-purpose tasks are managed by the power-optimized 8051 core provided in a configuration with on-chip program/data memories and standard parallel I/O plus UART and serial peripheral interfaces (SPIs) for communication. In this system, the 8051 core is in charge of monitoring the DSP chain and managing communication/control flows among the mixed-signal part via JTAG, the DSP unit, the internal memories, and the external devices. The silicon prototype features a chip for the digital part and a chip for the analog one, with the latter one also integrating the gyrosensor realized as a microelectromechanical device achieving the following performances [9] : ±300
• /s dynamic range, 5 mV/
• /s sensitivity, and 0.1% of full-scale nonlinearity. A single-chip realization with both analog and digital circuitry is currently being integrated. Fig. 3 reports the layout of a prototype chip implementing, in 0.35-µm CMOS technology, all the digital part. A configuration with a 32-kb program memory and two 4-kb data memories is sketched. The circuit complexity amounts to roughly 100 Kgates of logic, with 12.5 Kgates due to the 8051 core with timer/counter, UART, and SPI. The chip has an overall area of 19.9 mm 2 and works at a 20-MHz clock frequency.
VI. CONCLUSION
The single-chip integration of control systems is pushing the demand for reusable and power-optimized macrocells, particularly for microcontroller cores. Commercially available microcontrollers typically save power at the expense of decreased computational capability. With reference to an 8051 core, in this paper, a novel clustered clockgating technique is applied at RTL level to reduce dynamic power consumption without any performance loss and preserving macrocell reusability. When implementing the 8051 core in CMOS technology, the proposed approach leads to a 37% power saving, which is higher than the 10% permitted by state-of-the-art clustered-gating strategies or the 29% of automatic clock-gating insertion in commercial CAD tools, such as the widely used Synopsys. To assess its full functionality, the power-optimized 8051 has been proved in silicon that is embedded in a gyrosensor conditioning system.
