vec: virtual energy counters by I. Kadayif et al.
vEC: Virtual Energy Counters
I. Kadayif, T. Chinoda, M. Kandemir, N. Vijaykrishnan, M.J. Irwin, and A. Sivasubramaniam  
Microsystems Design Lab 
Pennsylvania State University 
University Park, PA 16802 
1 814 865 9505 
mdl@cse.psu.edu 
 
 
ABSTRACT 
Energy has become a critical issue in processor design, especially 
in embedded environments. Thus, there is a need for tools, which 
provide an accurate and fast estimation of energy.  In this paper, 
we present the design and use of a tool, Virtual Energy Counters 
(vEC), for estimating the energy consumption of user programs. 
vEC is built on top of the Perfmon user library for the 
UltraSPARC platform, and provides a user interface, which can 
be used within user programs to estimate the energy consumption. 
The energy estimates are provided for those consumed in the data, 
instruction and extended caches, main memory, address bus, data 
bus, address pads, and data pads. 
Keywords 
Hardware Performance Counters, System Energy Consumption, 
Embedded Systems, Optimizations, Signal Processing. 
1. INTRODUCTION 
Energy has become a critical issue in processor design, especially 
in embedded environments. When designing for embedded 
systems, designers have to take into account both high 
performance and low energy consumption. With this in mind, 
power/energy estimation tools, which provide fast and accurate 
estimates, are becoming increasingly important and necessary 
during code development.  Simulators, such as SimplePower [2] 
provide accurate energy estimations, yet they generally take a 
significant amount of time to provide designers with the energy 
estimates. Software-based techniques (e.g., [8]), on the other 
hand, are not as robust enough to be relied on. 
In this paper, we present the design and use of a tool, Virtual 
Energy Counters (vEC), for estimating (measuring) the energy 
consumption of user programs. The tool provides a fast estimation 
of the energy consumption of the main components of modern 
processors such as cache, main memory, and buses.  It is built on 
top of the Perfmon user library [5] for the UltraSPARC platform, 
and may be easily extended for other platforms such as the MIPS 
R10K and Intel’s Pentium processors. Perfmon is a tool that can 
be used by the user-level code to access hardware performance 
counters in the Pentium and UltraSPARC series microprocessors. 
Hardware performance counters are special hardware registers of 
modern microprocessors that monitor the occurrence of hardware 
events in a microprocessor without affecting the performance of 
the program [5, 7]. Some of the typical events that may be 
monitored are cycle count, instruction count, cache references and 
hits/misses, main memory writebacks and references, and branch 
mispredictions. By monitoring these events designers can improve 
program performance. For example, using the branch 
mispredictions, a designer can optimize her/his program to reduce 
the number of mispredicted branches and wasted cycles, thereby 
improving the program execution time. Similarly, by monitoring 
the data cache misses, a designer can modify the data layout 
(dynamically at run-time) to improve program performance. For 
the purposes of this work, the events monitored are used to 
estimate  memory  system energy consumption for a running 
program. 
vEC provides a means to profile user programs and measure the 
memory system energy consumption. The power analysis 
performed applies the analytical energy formulas from [1], which 
are primarily based on the number of cache references, hits, 
misses, and capacitance values. The energy estimations provided 
by vEC are absolute values (in Joules) and can also be applied to 
energy consumption comparisons for different code 
implementations such as those presented in [3]. However, since 
the values are based on event occurrences and analytical formulas, 
they differ from the actual values. 
The rest of this paper consists of six sections. Section 2 presents 
related work on energy estimation and optimization. Section 3 
discusses the UltraSPARC hardware performance counters. We 
present the analytical formulas used to compute our energy 
estimations as well as the events monitored in Section 4. Section 5 
presents the vEC user interface; it’s usage and the experiments 
performed to evaluate our tool. Sections 6 and 7 provide a 
discussion and conclusion, respectively, of the work presented in 
this paper. 
2. RELATED  WORK 
Numerous compiler transformations (optimizations) have been 
proposed to make user programs automatically run their fastest. 
Some of these transformations are applied at source-level, where 
program access patterns imposed by loop and other control 
structures are visible. Loop nest transformations constitute an 
 
Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that 
copies bear this notice and the full citation on the first page. To copy 
otherwise, or republish, to post on servers or to redistribute to lists, 
requires prior specific permission and/or a fee. 
PASTE ’01, June 18-19, 2001, Snowbird, Utah, USA. 
Copyright 2001 ACM 1-58113-413-4/01/0006…$5.00. 
 
28important part of source-level transformations and, specifically 
target loop nests where most of execution times is spent 
(especially in multidimensional signal processing and video 
applications). Techniques such as loop permutations, loop tiling, 
and loop fusion have been proven to be very useful in optimizing 
performance of loop nests, e.g., enhancing cache performance 
and/or improving parallelism [4]. 
Switching activities of system components determine the dynamic 
power consumption of CMOS VLSI circuits [9]. The switching 
activity depends on the execution patterns of applications. Source-
level compiler transformations determine the execution patterns of 
these applications. Such transformations were originally designed 
to optimize the performance of code and did not focus on 
optimizing the power consumption, an essential design parameter 
when developing power-conscious systems such as embedded 
systems. There has been little effort to analyze their energy 
impact. In embedded systems we need energy estimation tools, 
which measure the impact of these compiler transformations on 
energy consumption. Previous work, which presents an evaluation 
of compilation techniques on energy consumption, can be found 
in [8]. 
Energy measuring tools can be either transition-sensitive or based 
on analytical formulas. Since transition-sensitive simulators 
estimate the energy consumption based on bit-switching activities, 
they take a significant amount of time to generate energy 
estimates. For example, SimplePower is a transition-sensitive 
energy simulator, which estimates the data path energy 
consumption [2]. 
3.  UltraSPARC HARDWARE  
     PERFORMA-NCE COUNTERS 
The UltraSPARC CPU [6] has two 64-bit registers that can be 
used to monitor and collect statistics about different hardware 
events: the Performance Control Register (PCR) and Performance 
Instrumentation Counter (PIC). These registers collect statistics 
on the major events that occur on a per-processor basis, at the user 
and system levels. The PCR controls which events will be 
monitored and the level of monitoring (i.e., system or user). The 
PIC accumulates the number of occurrences of at most two events. 
The first 32 bits of the PIC are used for one event, and the second 
32 bits are used for the other. Each half of the PIC can monitor 
one of 16 events (at a time) and only 2 events are common to each 
half.  Thus, a total of 30 different events can be monitored. 
Accessing the performance counters requires privileged 
instructions. Perfmon provides a user library component, which 
allows user programs to access the performance counters using C 
functions calls. 
4.  COMPUTING ENERGY VALUES 
4.1 Energy  Formulations 
We use the analytical memory energy model from [1] to estimate 
the energy. This model is found to 2.4% accurate as compared to 
circuit level simulation. 
Energy = Ebus + Ecell + Epad + Emain 
Ebus = Eadd_bus + Edata_bus  
Ecell=β*(Word_line_size)*(Bit_line_size+ 4.8)*(Nhit+2*Nmiss) 
Epad=Eadd_pad+Edata_pad 
Emain=Em*8L*Nmiss*(1+dirty_r) 
Eadd_bus=0.5e-12*Pr1*V
2*(Nhit+Nmiss)*Wadd 
Edata_bus=0.5e-12*Pr2*V
2*(Nhit+Nmiss)*32 
Eadd_pad=20e-12*Pr3*V
2*Nmiss*Wadd 
Edata_pad=20e-12*Pr4*V
2*(1+dirty_r)*Nmiss*64 
Word_line_size=m*(8L+T+St) 
Bit_line_size=C/(m*L) 
β = 1.44e-14 (technology parameter) 
Em = 4.95e-9 (per-access off-chip energy cost) 
where C = cache size, L= Cache line size, 
 m = set associatively, T = tag size in bits,  
St = number of status bits per block, 
Nhit = number of hits, 
Nmiss = number of misses, 
Wadd = the width of an address bus, 
dirty_r = the percentage of blocks written back into memory on 
replacement. 
The Pr1-4 values are the switch rates for the add_bus, data_bus, 
add_bus and data_bus, respectively. Pr1, Pr2, Pr3 and Pr4 are 
assumed to be 0.25 for the purposes of this study.  
In this formulation, Ebus represents data and address bus energy 
between processor and cache, E cell represents cache energy, Epad 
represents data and address pad energy between cache and main 
memory, and finally Emain represents the main memory energy. 
4.2  Events of Interests 
We determined which events would be monitored in order to 
calculate the memory energy consumption (using the model given 
above). The events and their corresponding symbolic names in 
Perfmon are listed below: 
1.  Data Cache read hits – CR_S1_DC_READ_HIT 
2.  Data cache read references – PCR_S0_DC_READ 
3.  Data Cache write hits – PCR_S1_DC_WRITE_HIT 
4.  Data cache read references –PCR_S0_DC_READ 
5.  Instruction cache hits – PCR_S1_IC_HIT 
6.  Instruction cache references –PCR_S0_IC_REF 
7.  Extended cache references – PCR_S1_EC_REF 
8.  Extended cache misses with writebacks –  
 PCR_S1_EC_WRITEBACK 
The Perfmon package used is for the UltraSPARC platform that 
has a two-level cache architecture. Level 1 corresponds to the 
instruction and data caches while Level 2 corresponds to the 
extended cache for both instructions and data (i.e., unified). 
4.3  UltraSPARC Cache Configuration 
It has two a 16KB with 16B line size L1-instruction cache and 
16KB with 32B line size L1-data cache. Its L2-extended cache is 
4MB with 48B line size. The virtual address space is 64 bits so 
the tag sizes were calculated accordingly. Using cache parameters, 
the number of hits and misses returned by Perfmon, and the 
energy formulations given above, we can calculate the energy 
consumption of data cache, instruction cache, extended cache, 
main memory, data/address bus and data/address pad. 
5. vEC  DETAILS 
5.1  User Interface and Library Routines 
vEC provides a user interface, which can be used to compare the 
energy consumption of different code implementations and 
optimizations.  vc_energy(events), is called by the programmer to 
begin the initialization of internal variables. The parameter events 
passed to the function vc_energy() is an  array which has the 
29energy values to be calculated, each representing one of the 
virtual energy counters listed below: 
•  ICACHE_ENERGY 
•  DCACHE_ENERGY 
•  ECACHE_ENERGY 
•  MMEMORY_ENERGY 
•  DBUS_ENERGY 
•  DPAD_ENERGY 
The hardware performance counters provided by the UltraSPARC 
platform allow only two events to be monitored simultaneously. 
Our implementation may require the code (being analyzed) to be 
executed more than once in order to monitor the necessary events 
for the energy calculations. By determining which events will 
need to be monitored to calculate the passed energy values, 
vc_energy() returns the number of times the code will be executed 
(num_events) in order to get the energy estimation results. When 
vc_energy_begin()  is called by the programmer, the PCR is 
initialized to monitor the corresponding events, the PIC is cleared 
and accumulation of the monitored events begins. 
vc_energy_begin() returns 0 upon success and –1 upon failure. 
vc_energy_end() gets the accumulated values and performs the 
energy calculations. vc_quit() is called by the programmer to get 
the resulting energy calculations. It returns a pointer to struct 
vc_data_energy containing the energy values. 
5.2  Experiments and Results 
In our experiments, we evaluate the   ability of vEC to provide 
memory   system energy estimates for different versions of the 
matrix multiplication code. The optimizations include loop 
interchanging, iteration space tiling (blocking), and loop unrolling 
[4]. The charts in Figure 1 show both the resulting number of 
references and misses for mxm (original), mxml (loop 
interchanged), mxmt (tiled), and mxmu (loop unrolled) versions 
of matrix multiplication code and their energy comparisons. The 
summary of results of cache references/misses is provided below. 
Instruction Cache: Tiling increases the number of references since 
the actual dynamic number of instructions increases. The number 
of misses also increases because of an increase in the code size 
(i.e., reduced instruction reuse). Loop interchange has the same 
number of instruction references as the original code (a small 
noticeable difference is coming from array address calculations). 
Unrolling decreases the number of dynamic instructions. 
However, there are a higher number of misses than the original 
due to the increase in code size. 
Data Cache: Tiling increases the number of references. We 
speculate that this behavior is due to the backend optimizations 
that do not interact well with tiling. Loop interchange has almost 
the same number of data references as the original code. Unrolling 
decreases both the number of references and the number of 
misses. All of the optimizations are beneficial in terms of data 
cache misses with tiling being the most beneficial because it 
reduces more misses than other optimizations by exploiting 
temporal reuse across multiple loop levels.  
Extended Cache: All of the optimizations are beneficial in regards 
to extended cache references and misses with tiling being the most 
beneficial. The number of references to the extended cache is not 
necessarily the total number of instruction and data cache misses. 
This is due to buffer stored compression between L1 
(instruction/data) cache and L2 (extended) cache [5]. We also 
make the following energy observations from Figure 1. 
Instruction Cache: For instruction cache, unrolling is most 
effective and tiling is the worst (as it increases the number of 
dynamic instructions executed by the code). 
Data Cache: For data cache, unrolling is most effective and tiling 
is the worst. 
Extended Cache: For extended cache, unrolling is most effective 
and tiling outperforms the original matrix multiplication, which 
has the worst performance. 
Main Memory: For main memory, unrolling is most effective. 
Tiling and loop interchange outperform the original matrix 
multiplication, which has the worst performance. 
Total Energy: All the optimized versions of matrix multiplication 
outperform the original. Even with the significant energy savings 
of the original version in instruction and data cache (compared to 
that of tiling) the absolute energy consumed by extended cache 
gives the original matrix multiplication higher total energy 
consumption than the tiled version. 
6. DISCUSSION 
Our observation is that vEC is a useful tool for determining 
whether optimized code is also energy efficient, since most code 
optimizations fail to strike a balance between low energy 
consumption and high performance [8]. Such a balance is 
important in embedded systems since, most of the time, the 
system demands as minimum energy consumption as possible. 
Yet, the minimum energy consuming code does not necessarily 
have a high runtime performance. Similarly, best performance 
does not necessarily mean minimum energy consumption. For 
example, while latency hiding techniques can help reduce the 
impact of accesses to slower off-chip memory, the higher per 
access energy cost associated with off-chip memory cannot be 
masked. While the goal of finding these tradeoffs is orthogonal to 
the topic of this paper, we believe that the proposed tool could be 
helpful in studying these tradeoffs. Using vEC, an optimizing 
complier could generate multiple versions of the same code. The 
compiler can then select the best energy alternative depending on 
current conditions.  
7.  CONCLUSION AND FUTURE WORK 
vEC provides a means by which the user can get an  estimate of 
energy consumption for a given code segment. Additionally, by 
using it the user can compare different versions in terms of energy 
consumption. While not demonstrated in our experiments in this 
paper (due to lack of space), vEC also provides energy 
estimations for instruction number and clock cycles. The clock 
cycles and the number of instructions can be used to get an 
indication of the energy consumption of the system clock and the 
data path. Additionally, the number of clock cycles gives the 
performance of the code. A drawback of our implementation 
stems from the restriction that only two events can be monitored 
simultaneously. This requires the code versions to be executed 
more than once in order to obtain different energy estimations. 
Other hardware counter interface tools (such as CPC for Solaris 
2.8) could easily be integrated and used without major changes 
made to the vEC interface calls.  Also, we plan to implement vEC 
on Intel’s Pentium platform. 
30 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8. ACKNOWLEDGEMENTS 
This work was supported in part by grants from Pittsburgh Digital 
Green House and NFS grants CCR-0093082, CCR-0093085, 
CCR-0073419 and CCR-0082064. 
9. REFERENCES 
[1]  H.Y. Kim, N. Vijaykrishnan, M. Kandemir, M. J. Irwin. 
Multiple access caches: Energy implications. In Proc. The 
IEE CS Annual Workshop on VLSI, Orlando, FL, April 27-
28, 200, pp. 53-58. 
[2]  W. Ye, N. Vijaykrishnan, M. Kandemir, M.J. Irwin. The 
design and use of SimplePower: A cycle-accurate energy 
estimation tool. In Proc. The 37
th Design Automation 
Conference, Los Angeles, California U.S.A, June 5-9, 2000, 
pp. 340-345  
[3] David A. Padua, Michael J. Wolfe. Advanced compiler 
optimizations for supercomputers. Communications of the 
ACM,  Volume  29, Number 1, December 1996, pp 1184-
1201. 
 
 
[4] Michael E. Wolf, D. Maydan, E. Chen, Ding-Kai. 
Combining loop transformations considering caches and 
scheduling. IEEE 1999, pp. 274-284. 
[5]  Perfmon users guide. 
http://www.cse.msu.edu/~enbody/perfmon/ 
[6]  The UltraSPARC processor – Technology white paper: 
The UltraSPARC architecture. www.sun.com/microelectron 
ics/whitepapers/UltraSPARCtechnology 
[7]  Performance tuning optimization for Origin2000 and Onyx2. 
http://techpubs.sgi.com/library/manuals/3000/007-3511-
001/ html/ 
[8]  M. Kandemir, N. Vijaykrishnan. M. J. Irwin and W. Ye. 
Influence of compiler optimizations on system power. 
Proceedings of the 37
th Design Automation Conference, 
2000, pp. 304-307. 
[9]  A. Chandrakasan, R. Brodersen.  Low  power  digital  CMOS 
design. Kluwer Academic Publishers, 1995. 
I nstructi on C ache R eferences
0
7500000
15000000
22500000
30000000
m xm m xm l m xm t m xm u
I nstructi on C ache M i sses
0
7500
15000
22500
m xm m xm l m xm t m xm u
Da t a  Ca c h e  Re f e r e n c e s
0
5000000
10000000
15000000
20000000
m xm m xm l m xm t m xm u
D ata C ache M i sses
0
200000
400000
600000
800000
m xm m xm l m xm t m xm u
Extended C ache R eferences
0
750000
1500000
2250000
3000000
m xm m xm l m xm t m xm u
Extended C ache M i sses
0
5000
10000
15000
m xm m xm l m xm t m xm u
I nstructi on C ache Energy Percentage 
I ncreas vs.  m xm u
40. 92
47. 73
67. 6
0
25
50
75
m xm m xm l m xm t
D ata C ache Energy Percentage 
I ncrease vs.  m xm u
44. 89 40. 23
65. 75
0
25
50
75
m xm m xm l m xm t
Extended C ache Energy Percentage 
I ncrease vs.  m xm u
93. 46
62. 02
82. 99
0
25
50
75
100
m xm m xm l m xm t
M ai n M em ory Energy Percentage 
I ncrease vs.  m xm u
44. 51
13. 03 14. 97
0
25
50
m xm m xm l m xm t
Total  Energy Percentage I ncrease vs.  
mx mu
81. 94
58. 06
79. 32
0
25
50
75
100
m xm m xm l m xm t
Figure 1. Cache References/Misses and Energy Results With Respect to the Unrolled Matrix Multiplication Code 
31