Abstract-
Introduction
An important class of digital systems includes applications, such as video image processing and speech recognition, which are extremely memoryintensive. In such systems, a much power is consumed by memory accesses. In most of today's microprocessors, an instruction memory including a cache memory is one of the main power consumers. The on-chip caches of the 21164 DEC Alpha chip dissipates 25% of the total power of the processor. The StrongARM SA-110 processor from DEC, which specifically targets low power applications, dissipate about 27% of the power in the instruction cache [l] . Thus, employing low-power memory can greatly reduce the overall power consumption in digital systems. The most effective way to reduce the power dissipation in digital systems is t o lower the supply voltage. However, lowering the supply voltage causes an increase of access delay to the memory. Although lowering threshold voltage is effective to improve access time t o the memory, this can not be applied to a large scale memory. Because, lowering 456 threshold voltage causes explosive increase of subthreshold leakage current. Especially for a large scale memory, an increase of the subthreshold current is critical, because it has a lot of leakage paths from power supply t o ground. There has been proposed some techniques which cut off leakage current of logic circuits with MT-CMOS (Multiple Threshold CMOS) when systems are inactive [2] . However, these techniques are not applicable for memory, because they can't maintain the data in memory when the power source is cut off. Optimization techniques considering both static and dynamic power dissipation is required.
Especially for embedded systems, low power oriented applications also require design flexibility, which results in the need for implementation by using preoptimized cores. Current semiconductor technology allows the integration of processor cores and memory modules on a single die, which enables the implementation of a system on a single chip. Consequently, the design productivity along with the traditional synthesis process has not followed the exponential growth of both applications and implementation technology. The shrunk timeto-market has made this situation worse. There is a wide consensus that only a reuse of highly optimized cores can match the demands of the pending applications and the ultra large scale integration. Therefore, a core-based system design methodology attracts much interest of all silicon vendors.
In this paper, we propose an optimization technique for power reduction by using small subprogram, multiple supply and threshold voltages. Frequently executed sequences of object codes are allocated into the subprogram memory. Low supply and threshold voltage are assigned to the subprogram memory so as to minimize power dissipation under a time constraint. Our technique targets a typical application specific system-on-chip, consisting of a simple processor core and two simple instruction memories : a main program memory and a subprogram memory. A compiler simul-taneously determines size of the two memories, supply voltage, and a threshold voltage.
The rest of the paper is organized in the following way. In Section 2, we discuss the motivations for our work and present our concept t o optimize the instruction memory. A power optimization technique based on the code allocation and voltage scaling is proposed in Section 3. Section 4 presents experimental results and discussion on the effectiveness of the approach. Section 5 concludes this paper. 
Memory Partitioning
In todays memory circuits, an array part is usually partitioned into several segments and only one of them is activated at a time so as to reduce the power consumption as shown in figurel [4, 5, 61 . In such memories, the power consumption can be calculated by the sum of the power dissipated in a single segment and the power dissipated for charging global bit lines. Let us assume that the power dissipated for charging the global bit line is in proportional to the number of partitioned segments, and the power dissipated in a single segment is in proportional to the size of the segment. Under these assumptions, the memory power consumption can be approximated by (5), where N s e g , Nword, a e , pe, and ye denote the number of segments, the number of words, and coefficients for each term, respectively.
First and second terms of ( 5 ) represent the energy dissipated for charging the global bit line and the energy dissipated in a single memory segment, respectively. The last term represents a constant factor in memory power consumption. The number of words From formula(5), it is easy t o derive that the number of memory segments which minimizes the memory power consumption is J ( P e / a e ) . Nwords. Therefore, power consumption of memory whose array part is optimally partitioned into several segments are formulated by Memory access time can also be formulated as (8) We use (6) and (8) in this paper as energy and delay models of memory.
Our Approach
Memory reference locality is well known fact in many kinds of applications [7, 81. This means that only a few address in memory are frequently accessed. Therefore, allocating these frequently executed sequences of object codes into a low power subprogram memory reduces total energy of memory [8, 71 . However, if low Vdd is used to reduce the energy dissipation, we have to employ low Vth to improve performance degradation. In such conditions, the size of subprogram memory should be small so as not to enlarge stand-by leakage current even if low Vth is used for the subprogram memory. Our power optimization technique finds optimal point in these trade-offs, where total energy consumption is minimized under a performance constraint. In many embedded systems, object codes of the application programs need not to be modified after system design is completed. Therefore, the object code of the programs are stored into ROM. Our approach also targets systems which assume to employ embedded ROM. The most important merit of our approach is a suitability for IP-base system design. Our technique requires no modification to the internal processor architecture. Only inserting extra jump instructions at compiling phase for subprogram calls and for returns from subprogram are required to implement our approach. A compiler measures the execution count of each basic block using some sample data.
A basic block is characterized by execution counts and the number of instructions. We present a detailed explanation about the memory optimization problem in section 3.2.
For a given set of basic blocks and models of ROM, a compiler optimizes code allocation, Vth, and Vdd simultaneously under a time constraint.
Problem Formulation
In this section, we present a problem formulation for the memory optimization problem.
Firstly, we give notations used in the formulation. Next, we present a problem formulation. (6) and (8) .
An object function of this optimization problem is (9) . The 0 represents the total energy consumption including dynamic energy consumption and stand-by energy consumption. Constraints are (10) and (11). The first constraint is timing constraint. The variables to be determined are udm, vd,, utm, ut,, and a set of ai.
The memory optimization problem is formally defined as follows. " For a given time constraint Tconst and a given set of basic blocks B , find vdml vd,, ut,, uts, and a set of ai which minimize 0 under a time constraint".
Algorithm
The worst case computation time to solve the memory optimization problem strongly depends on the time to search an optimal code allocation {ai}, because complexity of finding optimal vdm, ud,, ut,, and ut, is trivial in contrast with the complexity of finding optimal code allocation. If a naive algorithm is applied, worst case computation time is 0(21BI), where B is a set of basic blocks appeared in a given program. In this paper we use heuristic algorithm described in Fig. 
The complexity of this heuristic algorithm is
The inputs to algorithm Memory optimization are a set of basic blocks B , where each basic block bi E B is characterized by its execution count Xi, its size Ni, and its allocation ai. All the ai is set to zero at first step. This means that all the basic blocks are allocated in the main program memory. Next, the algorithm selects a basic block which can most improve the object function 0 by relocating this basic block to the subprogram memory and assigning optimal udm, vd,, utmr and ut,. 
Experimental Results
We use three benchmark programs shown in Table 1 , in this experiments. The benchmark programs are compiled by gcc-dlx compiler which is based on GNU CC Ver. 2.7.2 for DLX architecture [9] . Table 2 shows description of three kinds of sample video images used as input to the MPEG2 program. Three kinds of sample input data were also used for Arithmetic calculator and T V remote controller.
At first, we evaluate the following four cases using MPEG2 decoder and Imagel.
Case1
Optimizing Vdd and &h of a program memory without using a subprogram memory.
Case2
Optimizing memory allocation to main program and subprogram memory without voltage scaling. Vdd = 1.5V and &h = 0.4V are used. 
Case4
Solving the memory optimization problem by our algorithm described in section 3.3.
The results are shown in Figure 6 . Vertical axis represents energy consumption normalized at minimized energy consumption in Case4. Horizontal axis represents time constraints normalized at an execution time of a program when Vdd = 1.5V, Gh = 0.5V and the subprogram is not used. The results show that the most effective way to reduce the memory power consumption is using subprogram memory. We can reduce the memory power consumption t o less than one third by using the subprogram memory. And also, voltage scaling can halve the memory power consumption. If the voltages are not scaled, relaxing time constraint does not leads t o an energy reduction, because a code allocation optimized for high performance is almost same from an allocation optimized for power reduction. The results of Case3 and Case4 demonstrates that about 30% power reduction can be achieved by using multiple voltages. In other point of view, we can achieve 10% speed up without increasing energy consumption by using multiple supply voltages. Especially for a very strict time constraint, using different threshold voltages for the two memories is effective, because using low threshold voltage for the large main program memory leads to explosive increase in subthreshold leakage current. In total, an energy reduction by our approach is up to SO%, when the time constraint is a "x 0.9'.
The optimized memory sizes in words, VddS, and V&s of main program and subprogram memories are shown in Table 3 . A " X 0.7" is used as a time constraint. The results indicate that the size of the subprogram memories are from 4% to 10% of main program memories. This is a key reason of a drastic energy reduction in memory. In some cases, the supply voltage of subprogram memory is higher than that of main program memory, because assigning higher voltage to subprogram memory reduces total execution time and this makes slack time which enables supply voltage of main program memory to be lowered. As we can see from the results, the optimal size, V&, and &h of memories strongly depend on time constraints, data, and application programs. Therefore, a design flexibility which enables system designers to easily optimize design trade-offs is very important.
Conclusion
As device size is shrunk, supply voltage of CMOS circuits is also scaled down. In near future, 1.5V or 1.0V supply voltage will become common. In such era, scaling the threshold voltage becomes to have strong impacts on both energy consumption and circuit delay. In this paper, we have proposed a system level memory power reduction technique based on voltage scaling and code allocation. Experimental results demonstrated that the energy consumption in the memories optimized by our approach can be less than 20% of energy in memory which does not employ our approach. Since the optimal size, V&, and I & of memories strongly depend on time constraints, kinds of data, and kinds of applications, easy programmability of design parameters is indispensable for sophisticated design. Design flexibility of our approach must be a key technology for large and low power SoC design.
Our future work will be devoted to extend the proposed optimization technique considering the data memory.
