Abstract-Power consumption is an increasingly pressing problem in modern processor design. Since the on-chip caches usually consume a significant amount of power so power and energy consumption parameters have become one of the most important design constraint. It is one of the most attractive targets for power reduction. This paper presents an approach to enhance the dynamic power consumption of CPU cache using inverted cache architecture. Our assumption tries to reduce dynamic write power dissipation based on number of ones and zeros in the in-coming cache block data using bit to indicate is the block is mostly one or zero. This architecture reduces the dynamic write power by 17 %. We use Proteus Simulator to test that proposed circuit and performed the experiments on a modified version of the cacti6.0 simulator.
I. INTRODUCTION
Modern mi cro process ors employ on-chip caches. This is due to the fact that caches can significantly reduce the speed gap between processor and main memory. For high speed clock frequency, the onchip caches are usually implemented using packed static random access memory (SRAM) cells. The numbers of transistor of these caches are increasing resulting an increaseing in the power consumption.
A. Satatic and Daynamic Power
Electrical power can be defined as the product of the electrical c u r r e n t t h r o u g h t i m e s t h e voltage at the terminals of a power consumer. The static power (P static ) is dissipated due to leakage currents amounts to less than 5% of the total power dissipated at 0.25μm. It has beenobserved that the leakage power increases by about a factor of 7.5 for each technological generation and is expected to account for a significant portion of the total power in deep sub-micron technologies [1] . Therefore, the leakage power component grows to 20-25% at 130 nm [2] . The dynamic component P dynamic of the total power is dissipated during the switching between logic levels, due to charging and discharging of the capacitance, and due to a small short circuit current. For example, when the input signals for the CMOS inverter switches from one level logic level to the opposite, then there will be a short instance when both the pMOS and nMOS transistors are open. During that time instant a small short circuit current I sc flows from V dd to G nd . Short circuit power can consume up to 30% of the total power budget if the circuit is active and the transition times of the transistors are substantially long. However, through a careful design to transition edges, the short circuit power component can be kept below 10-15% [3] . In CMOS circuits, this component accounts for 70-90% of the total power dissipation [3] .
In recent years the heat that resulted from power consumption in cache due to dynamic power caused by transactions in SRAM cache restricts the development of cache. Froe example the DEC21164 dissipates 25% and the StrongArm-110 d i s sip at es 42% of its t o t al power in caches. Moreover, the power consumption for the new emerging nano-scale technology is getting worse. The decrease of threshold voltage affects negatively l e a k a g e , i.e., static power, and hence the total power consumption. For this reason, numerous techniques have been proposed to reduce cache memory power. The techniques target both dynamic power and static power consumption.
B. Power Consumption Trends In Integrated Circuits
Power provides challenges as processors are scaled and power must be brought in and distributed around the chip, while modern processors use hundreds of pins and multiple interconnect layers for just power and ground and power is wasted as heat and must be Higher Cache Levels 10x
Main Memory 100x
What this means is that using data already in a Level 1 (L1) cache is 100 times faster than fetching the data from the main memory.
Important to realize that performance optimizations can be very specific they depend on the exact architecture of the machine (processor, memory, etc), the exact version of the compiler, the exact version of the operating system and the particular configuration of the program that we are trying to optimize.
The rest of the paper is organized as follows: related work is discussed in Section II. In Section III .describe SRAM Cell Design .In Section IV, we describe the design and implementation. In next section, V Simulation and Experiments Finally, conclusions are presented in the last section.
II. RELATED WORK
Many papers are introduced to reduce cache power including Low Power Cache Architecture [5] the idea is to separate Cache into two banks one which stores, mostly data that contains more zeros and the other bank stores data that have mostly Is in its contents. This separation aims to reduce switching activity when replacing d a t a in c a c h e lines. The paper shows up to 35% power reduction for small sized caches and about 6-10% for medium sized caches. However, this paper has a problem in cache size, which is halving or doubling witch in the first case increase the miss rate and in the second case increase, the cost and one of the major problems it is increasing of the miss rate.
A Variable Bitline Data Cache for low power design [5] proposes a Variable Bitline Data Cache (VBDC) which exploits the popularity of NWV stored in the cache. In VBDC design, the cache data array is divided into several sub-arrays to adapt each data pattern with the d i ffe ren t bitline length to access. The VBDC can shut off the corresponding unused high arrays to reduce its dynamic and static power consumption. The VBDC achieves low power consumption through reducing the bitline length.
Low power cache architecture with security mechanism [7] presents a novel easily implemented cache architecture which has an added small cache and adopts certain operation mechanism. Compared with traditional cache a r c h i t e c t u r e , it has reduction in miss rate ranging between 20% and 50%, and has about 8.5% of reduction in power consumption, and is secure at the same time. This paper presents both theoretical analysis and experimental results.
Frequent V a l u e D a t a C a c h e [8] the idea of this paper is how this frequent value phenomenon can be exploited in designing a cache that trades off performance with energy efficiency. It proposed the design of the Frequent Value Cache (FVC) in which storing a frequent value requires few bits as they are stored in the encoded form while all other values are stored in the uuencoded form using 32 bits. The data array is partitioned into two arrays such that if a frequent value is accessed only the first data array is accessed; otherwise an additional cycle is needed to access the second data array.
Low power architecture cache for embedded systems [9] introduce a novel low power cache architecture for embedded system based on low power architecture with modification the idea is to separate cache associatively into two banks mostly zero and mostly ones for reduce cache miss.
Dynamic Zero Compressi on for Cach e Energy
Reduction introduce a novel technique for cache [10] energy reduction, dynamic zero compression (DZC), which exploits the prevalence of zero bytes stored in the cache. DZC adds an additional zero indicator bit (ZIB) to each cache byte that indicates whether the byte contains all zero bits.
Saving register-file static power by monitoring instruction sequence in ROB [11] introduce a monitoring mechanism is built in the ROB and the register file to identify the timing of usage for each register. This mechanism can be integrated with a DVS approach on the datapath to power down (or up) the supply voltages to a register when it is idle (or active). A leakage-aware L2 cache management technique for producer-consumer sharing in low-power chip multiprocessors [12] proposes a novel leakage management technique for applications with producerconsumer sharing patterns. By exploiting particular access sequences observed in producer-consumer sharing patterns and the spatial locality of shared buffers, our technique enables a more aggressive turnoff of L2 cache blocks of these buffers.
On the design of low-power cache memories for homogeneous multi-core processors [13] investigate the impact of level-1 cache (CL1) parameters, level-2 cache (CL2) parameters, and cache organizations on the power consumption and performance of multi-core systems. We simulate two 4-core architectures -both with private CL1s, but one with shared CL2 and the other one with private CL2s.
III. SRAM CELL DESIGN
Static random access memory (SRAM) has been widely used as the representative memory for logic LSIs. This is because SRAM array operates fast as logic circuits operate, and consumes a little power at standby mode. Another advantage of SRAM cell is that it is fabricated by same process as logic, so that it does not need extra process cost. These features of SRAM cannot be attained by the other memories such as DRAM and Flash memories. SRAM memory cell array normally occupies around 40% of logic LSI nowadays, so that the nature of logic LSI such as operating speed, power, supply voltage, and chip size is limited by the characteristics of SRAM memory array. Therefore, the good design of SRAM cell and SRAM cell array is inevitable to obtain high performance, low power, low cost, and reliable logic LSI. An SRAM cell is the key SRAM component storing binary information. A typical SRAM cell uses two cross-coupled inverters forming a latch and access transistors. Access transistors enable access to the cell during read and write operations and provide cell isolation during the not-accessed state. An SRAM cell is designed to provide non-destructive read access, write capability and data storage for as long as cell is powered [13] . A 6T CMOS SRAM cell is the most popular SRAM cell due to its superior robustness, low power and low-voltage operation.
Power consumption in a digital integrated circuit is governed by using (1):
Where α is the average switching activity factor of the transistors, C is capacitance, V is the power supply voltage, f is the clock frequency, and I off is the leakage current. The first term of the equation is dynamic power and the second term is static power.
Copyright © 2013 MECS I.J. Modern Education and Computer Science, 2013, 2, 12-18

A. SRAM Read Operation
In the 6-transistor circuit depicted in Fig.3 during the read operation, one node of the RAM cell pulls the bit line up through the access transistor and the PFET-load and another node pulls the bit line down through the pass transistor and NFET load. 
B. SRAM Write Operation
When the word line is selected, Q5and Q6 is on and the level stored in Q5 and Q6 are passed to the bit lines. Logic -1‖ when Q1 is off andQ2is on, Q5is at Vdd, C6is at Vss. Logic -0‖vwhen Q3vis off and Q4 is on, Q5 is at Vss, Q6 is at V dd .
C. Static Noise Margin of SRAM Cells
The noise margin high and noise margin low are defined as (2) Lower Vth of driver MOSFET decreases the SNM. The ratio of the gate width of driver MOS to that of an access MOS is called ˇ ratio. Larger ˇ ratio increases the SNM. This is because the N0 voltage rise by _VN can be restrained by increasing the driver MOS current and decreasing the access MOS current. Generally, in the SRAM cell, the ˇ ratio is over 1.5.
Because the full CMOS 6-T memory cell statically retains data, it does not need special treatment like refresh. At read operation, fast access time is achieved because complementary bit line signals make it possible to use differential amplifier. Therefore, SRAM is used as cache memory. 
IV. PROPOSED DESIGN AND IMPLEMENTATION
Leakage power consumption becomes a major design issue in realizing low-power microprocessors as the process technology advances. We introduce a novel low power on-chip cache Architecture, which reduces dynamic power dissipation in the Cache. Our design is based on low power architecture [1] with modification.
By assuming that all data that will enter to the cache must be mostly zeroes so if data is mostly ones we try to invert it and add one flag bit in the cache to indicate that data is inverted. We attach new bit we named it, invert bit flag this bit calculated before new data added to t h e cache when cache miss or cache write occurs, and they have value 0 if the data mostly zeros or 1 if the data stream is mostly ones means that the data must be inverted. this value is calculated using comparator (or simple decoder) witch have data bus input and only one bit output as shown in fig.1 .The comparator circuits not exceed than hundreds transistor and today s cache circuit contain millions of a transistor so the power effected of this comparator is approximate nothing compared with the total power of cache .
We can use fig.1 to classify data mostly ones or mostly zeros. Our design depends on voltage divider concept. We connect the v dd of the transistor to the input and connect all of the output to a transistor of the input to this transistor is enough to operate it then the data is mostly zeros if not data is mostly ones. For example if we have 3 input data 110 then T1 and T2 will operate T3 will not operate the input of T4 is enough to operate it the output of T4 (my_bit ) is 0 that s means the data input is mostly ones.
We know that there is no power consumption in inverting operation because we have data and inverted data in the same sram cell. Memory core is composed of memory cells that are arranged in rows and columns. Fig. 2 shows the typical 6-transistor memory cell design.
In our design as shown in fig. 6 , we enhance SRAM cell by adding 2 transistors (M7 & M8) one at the end of Data (M7) and other to Data' (M8) and connect its V DD to inverted bit and inverted bit' respectively. When data input is 1 and inverted bit is 0 then M7 be OFF and M8 ON then data out will be 1 (B"=B), data input is 1 and inverted bit is 1 then M7 be ON and M8 OFF then data out will be 0, data input is 0 and inverted bit is 0 then M7 be OFF and M8 ON then data out will be 0 ,data input is 0 and inverted bit is 1 then M7 be ON and M8 OFF then data out will be 1 as shown in table1. Fig.7 a. shows an example using typical model of SRAM cache and Fig.7 .b shows our enhancement. In our design we assume that block size divided into 8 part each part has its own inverted bit indicator.
V. SIMULATION AND EXPERIMENTS
We evaluate the effectiveness of the proposed approach. We use proteus simulator to test that our circuit and we performed our experiments on a modified version of the cacti6.0 [14] simulator using four threads. In our experiment we use 64 byte block size (512 bit ), we divide the block size into 8 parts, each part has its own inverted bit, actually the overhead of our design is 1/64 (1.5625%), with changing cache size, set of Associative is 2 and technology 32 nanometer and obtain the result as shown in Fig.8 . In other experiment steps we use 64 byte block size (512 bit ), we divide the block size into 8 parts, each part has its own inverted bit, actually the overhead of our design is 1/64 (1.5625%), with changing set of Associative, cache size 64kbyte-and technology 32 nanometer and obtain the result as shown in Fig.9 . Fig. 8 & Fig. 9 show the result optioning from the experiment as shown we achieve 17% over all enchantment.
VI. CONCLUSION
In this paper, we proposed an inverted low power cache architecture which reduces dynamic write power dissipation based on number of ones and zeros in the in-coming cache block data using bit to indicate is the block is mostly one or zero. The architecture reduces the dynamic write power by 17 % and this value increase if the block size decreased.
Power consumption in mW Fig.8 . simulation result using 64 byte block size, set of Associative is 2 and tec hnolo gy 32 nanometer Fig.9 . simulation result using 64 byte block size and technolo gy 32 nano meter
