Power Models for Multicore Processor Simulators with Multiple Levels of Abstraction by Triviño Valls, Josep
Power models for multicore
processor simulators with multiple
levels of abstraction
Author: Josep Trivin˜o Valls
Degree in Informatics Engineering
Specialization in Computer Engineering
Facultat d’Informa`tica de Barcelona (FIB)
Universitat Polite`cnica de Catalunya (UPC) - BarcelonaTech
Director:
Miquel Moreto´
Department:
BSC-CNS
Co-director:
Marc Casas
Department:
BSC-CNS
Deponent:
Mateo Valero
Department:
Computer Architecture
UNIVERSITAT POLITE`CNICA DE CATALUNYA (UPC) - BarcelonaTech
BARCELONA SUPERCOMPUTING CENTER (BSC-CNS)
June 30, 2015
i
Abstract
Nowadays, the power consumption is one of the biggest problems of the supercomputing
centers because of the big consumption those have. In this project, the issue has been
addressed and I have tried to find a way to reduce it. A good manner to reduce it, is
to study the energetic behavior of the applications and adapt them to our hardware. In
order to do this, we can use different tools, for example, an architecture simulator to
obtain the hardware statistics or a framework to obtain power information.
In this project I have created a tool capable of generating power models for different ap-
plications given a determined hardware. With this tool, it is possible to obtain the total
consumption of any application and the different components involved in the execution.
Moreover, we can observe the consumption of the different application tasks and how it
is distributed along the execution. It is possible to use this information to optimize any
application in terms of power consumption or to find the optimal hardware in order to
obtain the best performance.
Resum
Avui en dia, el consum energe`tic e´s un dels grans problemes dels centres de supercom-
putacio´ degut al gran consum que tenen. En aquest treball s’ha adrec¸at el problema
i s’ha intentat buscar algun me`tode per tal de reduir-lo. Una bona manera de reduir
aquest consum e´s estudiar el comportament energe`tic de les aplicacions i adaptar-les
al nostre hardware. Per tal d’estudiar aquest comportament, es poden utilitzar difer-
ents eines, com per exemple, simuladors d’arquitectura per tal d’obtenir estad´ıstiques
hardware o utilitzar un framework per tal d’obtenir informacio´ energe`tica.
En aquest projecte he creat una eina capac¸ de generar models energe`tics per diferents
aplicacions donat un hardware determinat. Amb aquesta eina, es pot obtenir el con-
sum energe`tic total de qualsevol aplicacio´ i dels diferents components involucrats en
l’execucio´. A me´s a me´s de poder observar el consum de les diferents tasques d’una apli-
cacio´ i com aquest esta` distribu¨ıt al llarg de l’execucio´. De manera que es pot utilitzar
aquesta informacio´ per optimitzar qualsevol aplicacio´ en termes de consum energe´tic o
buscar el hardware o`ptim per extreure’n el ma`xim rendiment.
Resumen
Hoy en d´ıa, el consumo energe´tico es uno de los grandes problemas de los centros de
supercomputacio´n debido al gran consumo que tienen. En este trabajo se ha tratado el
problema y se ha intentado buscar algu´n me´todo para reducirlo. Una buena manera de
reducirlo es estudiar el comportamiento energe´tico de nuestras aplicaciones y adaptarlas
a nuestro hardware. Para estudiar este comportamiento, podemos utilizar diferentes
ii
herramientas, como por ejemplo, simuladores de arquitectura para obtener estad´ısticas
hardware o utilizar un framework para obtener informacio´n energe´tica.
En este proyecto he creado una herramienta capaz de generar modelos energe´ticos para
diferentes aplicaciones dado un hardware determinado. Con esta herramienta, se puede
obtener el consumo energe´tico total de cualquier aplicacio´n y de los diferentes com-
ponentes involucrados en la ejecucio´n. Adema´s de poder observar el consumo de las
diferentes tascas de una aplicacio´n y como este esta´ distribuido al largo de la ejecucio´n.
De manera que podemos utilizar esta informacio´n para optimizar cualquier aplicacio´n en
te´rminos de consumo energe´tico o buscar el hardware o´ptimo para conseguir el ma´ximo
rendimiento.
iii
Contents
List of Figures viii
List of Tables xiii
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Developer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Director, co-director and support . . . . . . . . . . . . . . . . . . . 2
1.2.3 RoMoL team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.4 TaskSim developers . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.5 Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Document structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 State of art 4
2.1 Computer architecture simulators . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Gem5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 TaskSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Profiling tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Paraver and Extrae . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 McPAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Project scope 8
3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.2.1 Jinja2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.2.2 Gnuplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.2.3 GREASY . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Revision control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 Validation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.7 Risk management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iv
4 Project management, budget and sustainability 13
4.1 Project management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1.1 Work environment . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1.2 Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1.3 Documentation . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1.4 Task dependencies . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1.5 Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Alternatives and work plan . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Budget control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1.1 Human resources . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1.4 General expenses . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1.5 Taxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1.6 Total cost . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Control management . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1 Economic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Social . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.3 Environmental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.4 Sustainability evaluation . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Experimental setup 23
5.1 TaskSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.2 Results file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.3 Additional information . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 McPAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.1 Configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.2 Results file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.2.1 Power gating . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.2.2 Processor modeling . . . . . . . . . . . . . . . . . . . . . 28
5.2.2.3 Cache modeling . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 TaskSim-McPAT script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Paraver script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Evaluation 34
6.1 Vector copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1.1 Increasing vector size with private L2 cache . . . . . . . . . . . . . 34
6.1.2 Comparison between private and shared L2 cache . . . . . . . . . . 36
6.1.3 Increasing the memory footprint . . . . . . . . . . . . . . . . . . . 37
6.2 Vector Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 PARSEC Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3.1 Ferret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.2 Blackscholes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
v
6.3.3 Swaptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3.4 Dedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.5 Streamcluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3.6 Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3.7 Further results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3.7.1 Static power . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3.7.2 DL1 dynamic energy consumption . . . . . . . . . . . . . 52
6.3.7.3 L2 dynamic energy consumption . . . . . . . . . . . . . . 54
6.4 Paraver visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4.1 Copy vector with different vector sizes . . . . . . . . . . . . . . . . 55
6.4.2 Swaptions with different cache sizes . . . . . . . . . . . . . . . . . 59
7 Conclusions 64
7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Appendices 67
A TaskSim configuration file 68
B TaskSim results file 70
C McPAT configuration file 73
D McPAT results file 78
E PARSEC benchmark graphics 83
E.1 Ferret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
E.1.1 1 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
E.2 Blackscholes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
E.2.1 1 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
E.2.2 4 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
E.2.3 16 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
E.3 Swaptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
E.3.1 1 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
E.3.2 4 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
E.3.3 16 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
E.4 Dedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
E.4.1 1 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
E.5 Streamcluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
E.5.1 1 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
E.5.2 4 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
E.5.3 16 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
E.6 Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.6.1 1 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.6.2 4 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
E.6.3 16 core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
vi
Bibliography 107
vii
List of Figures
4.1 Gantt chart with the time duration of each task . . . . . . . . . . . . . . . 16
5.1 Block diagram of the McPAT framework . . . . . . . . . . . . . . . . . . . 26
5.2 Static power comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Area comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Peak dynamic power difference . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 Static power comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.6 Gate leakage comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.7 Dynamic power comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1 Processor dynamic energy comparison . . . . . . . . . . . . . . . . . . . . 35
6.2 Processor dynamic energy comparison . . . . . . . . . . . . . . . . . . . . 36
6.3 Processor static power comparison . . . . . . . . . . . . . . . . . . . . . . 37
6.4 Increasing the memory footprint in DL1 dynamic energy consumption . . 37
6.5 Increasing the memory footprint in L2 dynamic energy consumption . . . 38
6.6 Dynamic energy comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.7 Ferret processor dynamic energy consumption with fixed associativity and
latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.8 Ferret processor dynamic energy consumption with unfixed associativity
and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.9 Blackscholes processor dynamic energy consumption with fixed associa-
tivity and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
viii
6.10 Blackscholes processor dynamic energy consumption with unfixed asso-
ciativity and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.11 Swaptions processor dynamic energy consumption with fixed associativity
and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.12 Swaptions processor dynamic energy consumption with unfixed associa-
tivity and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.13 Dedup processor dynamic energy consumption with fixed associativity
and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.14 Dedup processor dynamic energy consumption with unfixed associativity
and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.15 Streamcluster processor dynamic energy consumption with fixed associa-
tivity and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.16 Streamcluster processor dynamic energy consumption with unfixed asso-
ciativity and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.17 Average processor dynamic energy consumption with fixed associativity
and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.18 Average processor dynamic energy consumption with unfixed associativ-
ity and latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.19 Processor static power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.20 DL1 dynamic energy consumption . . . . . . . . . . . . . . . . . . . . . . 53
6.21 Swaptions DL1 dynamic energy consumption . . . . . . . . . . . . . . . . 53
6.22 L2 dynamic energy consumption . . . . . . . . . . . . . . . . . . . . . . . 54
6.23 Streamcluster L2 dynamic energy consumption . . . . . . . . . . . . . . . 55
6.24 Core burst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.25 Function line dynamic power consumption (0W - 0.094W) . . . . . . . . . 57
6.26 Gradient color dynamic power consumption (0W - 0.094W) . . . . . . . . 58
6.27 Aggregated dynamic power consumption (0W - 0.226W) . . . . . . . . . . 59
6.28 Core burst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.29 Function line dynamic power consumption (0W - 0.056W) . . . . . . . . . 61
ix
6.30 Gradient color dynamic power consumption (0W - 0.056W) . . . . . . . . 62
6.31 Aggregated dynamic power consumption (0W - 0.381W) . . . . . . . . . . 63
E.1 Ferret 1 core processor dynamic energy consumption . . . . . . . . . . . . 83
E.2 Ferret 1 core processor dynamic power consumption . . . . . . . . . . . . 84
E.3 Ferret 1 core processor static power consumption . . . . . . . . . . . . . . 84
E.4 Ferret 1 core DL1 dynamic energy consumption . . . . . . . . . . . . . . . 84
E.5 Ferret 1 core L2 dynamic energy consumption . . . . . . . . . . . . . . . . 85
E.6 Blackscholes 1 core processor dynamic energy consumption . . . . . . . . 85
E.7 Blackscholes 1 core processor dynamic power consumption . . . . . . . . . 86
E.8 Blackscholes 1 core processor static power consumption . . . . . . . . . . 86
E.9 Blackscholes 1 core DL1 dynamic energy consumption . . . . . . . . . . . 86
E.10 Blackscholes 1 core L2 dynamic energy consumption . . . . . . . . . . . . 87
E.11 Blackscholes 4 core processor dynamic energy consumption . . . . . . . . 87
E.12 Blackscholes 4 core processor dynamic power consumption . . . . . . . . . 87
E.13 Blackscholes 4 core processor static power consumption . . . . . . . . . . 88
E.14 Blackscholes 4 core DL1 dynamic energy consumption . . . . . . . . . . . 88
E.15 Blackscholes 4 core L2 dynamic energy consumption . . . . . . . . . . . . 88
E.16 Blackscholes 16 core processor dynamic energy consumption . . . . . . . . 89
E.17 Blackscholes 16 core processor dynamic power consumption . . . . . . . . 89
E.18 Blackscholes 16 core processor static power consumption . . . . . . . . . . 89
E.19 Blackscholes 16 core DL1 dynamic energy consumption . . . . . . . . . . 90
E.20 Blackscholes 16 core L2 dynamic energy consumption . . . . . . . . . . . 90
E.21 Swaptions 1 core processor dynamic energy consumption . . . . . . . . . . 91
E.22 Swaptions 1 core processor dynamic power consumption . . . . . . . . . . 91
E.23 Swaptions 1 core processor static power consumption . . . . . . . . . . . . 92
x
E.24 Swaptions 1 core DL1 dynamic energy consumption . . . . . . . . . . . . 92
E.25 Swaptions 1 core L2 dynamic energy consumption . . . . . . . . . . . . . 92
E.26 Swaptions 4 core processor dynamic energy consumption . . . . . . . . . . 93
E.27 Swaptions 4 core processor dynamic power consumption . . . . . . . . . . 93
E.28 Swaptions 4 core processor static power consumption . . . . . . . . . . . . 93
E.29 Swaptions 4 core DL1 dynamic energy consumption . . . . . . . . . . . . 94
E.30 Swaptions 4 core L2 dynamic energy consumption . . . . . . . . . . . . . 94
E.31 Swaptions 16 core processor dynamic energy consumption . . . . . . . . . 94
E.32 Swaptions 16 core processor dynamic power consumption . . . . . . . . . 95
E.33 Swaptions 16 core processor static power consumption . . . . . . . . . . . 95
E.34 Swaptions 16 core DL1 dynamic energy consumption . . . . . . . . . . . . 95
E.35 Swaptions 16 core L2 dynamic energy consumption . . . . . . . . . . . . . 96
E.36 Dedup 1 core processor dynamic energy consumption . . . . . . . . . . . . 96
E.37 Dedup 1 core processor dynamic power consumption . . . . . . . . . . . . 97
E.38 Dedup 1 core processor static power consumption . . . . . . . . . . . . . . 97
E.39 Dedup 1 core DL1 dynamic energy consumption . . . . . . . . . . . . . . 97
E.40 Dedup 1 core L2 dynamic energy consumption . . . . . . . . . . . . . . . 98
E.41 Streamcluster 1 core processor dynamic energy consumption . . . . . . . . 98
E.42 Streamcluster 1 core processor dynamic power consumption . . . . . . . . 99
E.43 Streamcluster 1 core processor static power consumption . . . . . . . . . . 99
E.44 Streamcluster 1 core DL1 dynamic energy consumption . . . . . . . . . . 99
E.45 Streamcluster 1 core L2 dynamic energy consumption . . . . . . . . . . . 100
E.46 Streamcluster 4 core processor dynamic energy consumption . . . . . . . . 100
E.47 Streamcluster 4 core processor dynamic power consumption . . . . . . . . 100
E.48 Streamcluster 4 core processor static power consumption . . . . . . . . . . 101
E.49 Streamcluster 4 core DL1 dynamic energy consumption . . . . . . . . . . 101
xi
E.50 Streamcluster 4 core L2 dynamic energy consumption . . . . . . . . . . . 101
E.51 Streamcluster 16 core processor dynamic energy consumption . . . . . . . 102
E.52 Streamcluster 16 core processor dynamic power consumption . . . . . . . 102
E.53 Streamcluster 16 core processor static power consumption . . . . . . . . . 102
E.54 Streamcluster 16 core DL1 dynamic energy consumption . . . . . . . . . . 103
E.55 Streamcluster 16 core L2 dynamic energy consumption . . . . . . . . . . . 103
E.56 Average 1 core processor dynamic energy consumption . . . . . . . . . . . 104
E.57 Average 1 core processor dynamic power consumption . . . . . . . . . . . 104
E.58 Average 4 core processor dynamic energy consumption . . . . . . . . . . . 105
E.59 Average 4 core processor dynamic power consumption . . . . . . . . . . . 105
E.60 Average 16 core processor dynamic energy consumption . . . . . . . . . . 106
E.61 Average 16 core processor dynamic power consumption . . . . . . . . . . . 106
xii
List of Tables
4.1 Costs of human resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Costs of hardware resources . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Costs of electricity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Costs summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Sustainability matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.1 Increasing vector size with 1 core . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Increasing vector size with 4 core . . . . . . . . . . . . . . . . . . . . . . . 35
6.3 Increasing vector size with 8 core . . . . . . . . . . . . . . . . . . . . . . . 35
xiii
Chapter 1
Introduction
1.1 Context
The project is a TFG (final degree project) of the Facultat d’Informa`tica from Barcelona
and has been developed in collaboration with Barcelona Supercomputing Center (BSC)
inside the RoMoL team (Riding on Moore’s Law) which is a 5-year project funded by
an European Research Council (ERC) Advanced Grant.
The power consumption of the hardware has always been a headache for the supercom-
puting centers and the different hardware companies. In order to reduce this power
consumption we have to know how it is distributed during the application execution or
which are the most power consuming components. This projects aims to create a tool
capable of observing the power consumption using different applications from BSC.
Nowadays, there are different multicore processor simulators which can be used to collect
information from an application running on them. With this statistics we are able to
check the behaviour of the hardware which is being simulated or figure out its most time
consuming parts (dependencies, cache misses...) and improve them. With these kind
of statistics we can go further and use them to obtain more information, for example
power models.
The main idea of this project comes out from McPAT[15], a state-of-the-art framework
which can create power models for a program with the hardware statistics that are ob-
tained by a simulator. With the power models we can observe our application behaviour
in terms of power. This means that with this knowledge we should be able to observe the
power consumption of the hardware components involved in the application’s execution.
Profiling tools allow us to obtain the hardware statistics by task and how they are
distributed along the execution. As we have a framework which allow us to compute the
1
power consumption given hardware statistics, we can also compute how it is distributed
during the execution.
The selected simulator is TaskSim [19], a trace-driven cycle-accurate simulator that is
still being developed at BSC. The profiling tool is Paraver[8], a tool that has been also
developed at BSC. We are going to see them more accurately on the state of art section.
This project is motivated by the fact that power consumption is nowadays a big problem
that it is needed to be solved. Creating a tool capable of showing the power consumption
of the hardware components and applications is a good contribution to the cause.
1.2 Stakeholders
This section aims to describe all the people involved in the project, directly or indirectly.
1.2.1 Developer
The person in charge to carry out the project, is the most important one because he is
the only one. In this case I have been the developer.
1.2.2 Director, co-director and support
These are the persons in charge to help and guide the developer and supervise that the
objectives are being accomplished. In this project the director and co-director have been
Miquel Moreto´ and Marc Casas respectively.
1.2.3 RoMoL team
As I have commented, the project has been done inside the RoMoL team, which is
a research group in BSC involved in the research for new computer architectures and
parallel software. There are some researchers that are using TaskSim and Paraver in
their research, this project can allow them to have more tools in order to obtain more
information from their applications and the hardware used.
2
1.2.4 TaskSim developers
The TaskSim developers can be interested in this project in order to integrate it inside
the TaskSim project. Moreover, having other people working with TaskSim has helped
to find bugs.
1.2.5 Users
Nowadays the only persons who can use TaskSim are the members of BSC, so they are
the users of the project and the main beneficiaries as well. Once TaskSim has been
realised, everybody interested in the project can become a user.
1.3 Document structure
The document can be divided in two different parts. The first part is where all the
project management has been explained, including the budget with all the costs, the
scope with the objectives, methodology and the tools and finally the sustainability. The
second part is the technical report. There I have explained what I have done and the
different results obtained. Finally I have added the project conclusions and the future
work to do.
3
Chapter 2
State of art
2.1 Computer architecture simulators
A computer architecture simulator is a piece of software which is used to predict the
behaviour of hardware models given an input[10]. Architectural simulators have different
purposes, per example:
• Compare different hardware designs without having physically the hardware.
• Obtain detailed performance metrics.
• Debug the code in real-time.
The simulators can be classified depending on the context:
• Micro-architecture or full-system simulators. A simulator can model the behaviour
of the whole computer system (processor, a memory system) or can be more limited
and focus in a specific part.
• Functional or performance simulators. Performance simulators try to reproduce
the performance features of the hardware (when it is done) while functional sim-
ulators try to be as similar as possible to component’s function (what it is done).
• Trace-driven or execution-driven simulators. Traces are prerecorded streams of
instructions with fixed input. Execution-driven simulators allow dynamic change
of instructions to be executed depending on different input data.
4
2.1.1 Gem5
Gem5 simulator[2] is a modular discrete event driven computer system simulator for
computer system architecture research, capable of evaluate different systems. This sim-
ulator can simulate different CPU models, system execution modes and memory system
models.
The gem5 simulator[2] is the union of two simulators, each one with different functions.
The first one is called M5[3] and provides a highly configurable simulation framework,
focused on the network simulation and different CPU models. GEMS[16], a general
execution-driven multiprocessor simulation, complements these features with a detailed
memory system simulation, including support for multiple cache coherence protocols
and interconnect models.
2.1.2 TaskSim
The simulator used in this project is TaskSim[19] which is still under-development.
We can say that TaskSim is a state-of-art trace-driven computer architecture simulator
designed for:
• Architecture exploration of multicore multiprocessors with large numbers of cores.
• Research on parallel programming models.
TaskSim arose from the problem that the current simulators(for example gem5) have
with the increasing gap performance between simulation and real execution. TaskSim
allows different levels of abstraction to deal with this problem. The highest level of
abstraction, burst mode, is useful for scalability studies to hundreds of cores. The
lower-level abstractions, memory mode, provides accurate simulation at a greater level
of detail.
2.2 Profiling tools
In order to obtain more information about our applications we can use profiling tools or
profilers. Profiling is a form of dynamic program analysis that measures our applications
in terms of memory, time or usage of particular instructions... We can use this infor-
mation to optimize our applications. Profiling tools use different techniques to obtain
data, for example hardware interrupts, code instrumentation or performance counters.
5
2.2.1 Paraver and Extrae
Paraver[8] is a performance analyzer based on traces with a great flexibility to explore
the collected data by visual inspection. The principal features of Paraver are:
• Detailed analysis of program performance.
• Customizable semantic of the visualized information.
• Concurrent comparative analysis of different traces.
Paraver has been proofed to be very useful for performance analysis studies, giving much
more details about applications behaviour than most performance tools.
In order to generate Paraver traces, Extrae[5] is required. Extrae is the package created
to generate Paraver traces for a posterior analysis. Extrae is a tool which translates
all internal events into a Paraver events, this way, Extrae gathers all the information
regarding the application performance. Extrae is configured through a XML file.
2.3 McPAT
McPAT (Multicore Power, Area, and Timing)[14] is the first integrated power, area,
and timing modeling framework for multithreaded and multicore/many-core processors.
It is designed to work with a variety of performance simulators over many technology
generations. McPAT allows the user to specify low-level configuration details. Also
includes models for the components of a complete chip multiprocessor, including in-
order and out-of-order processor cores, networks-on-chip, shared caches, and integrated
memory controllers.
The McPAT idea comes out from a tool called Wattch[4] which was created in the
year 2000. This tool is able to compute the CPU power consumption in order to make
power analysis of the applications. However, several factors drive the need for new tools
to address changes in architecture and technology, for example the need to accurately
model multicore and many-core architectures or the need to model and evaluate power,
area, and timing simultaneously. And McPAT addresses these challenges.
2.4 Related work
As this project is specific for a simulator, there is not any project involved in the same
simulator, but there are similar projects for other simulators, for example a project that
6
has focused in gem5 and McPAT[12] instead of TaskSim and McPAT. We can say that
the objectives of both projects are the same (obtain power models) but the code part
is different because of the input and output of each simulator. Although the code, the
tools used in both projects are similar and also the problems that happened. Before
starting the project, I made a thoroughly evaluation of the project in order to learn
about the used tools and the problems which occurred in the similar project mentioned
above.
There are other simulators which have integrated McPAT inside of the program, for
example the Graphite multicore simulator[17]. This simulator, additionally of giving
the statistics of the hardware simulation, gives the power consumption of the processor.
Graphite uses McPAT for core and cache power modeling and DSENT[20] for network
power modeling. Graphite has been created by Massachusetts Institute of Technology
(MIT) and it has been designed for exploration of future multicore processors containing
thousands of cores. It provides high performance for fast design space exploration and
software development.
Moreover, there are other simulators like Graphite, for example Sniper simulator[9].
Sniper multicore simulator has integrated McPAT inside the structure too. Sniper simu-
lator is based on the interval core model[13] and the Graphite simulation infrastructure.
7
Chapter 3
Project scope
This project aims to generate power models and study the results in order to know how
the power consumption is distributed between the different hardware components or the
different application tasks. In order to generate them, it is necessary to create a tool
and different benchmarks in order to test it. To do this, I have divided the project scope
in different steps:
1. The first part is based only on the TaskSim statistics. TaskSim nowadays just
provides hierarchical memory and burst statistics, so the power models are based
only on them. To obtain the power models, a bridge between the simulator and the
framework has been created. This bridge or tool joins all the simulator information
(the output and the simulation configuration file), so that a summary is presented
at the end of the execution and parsed for the McPAT framework.
2. Create different benchmarks in order to generate different power models. Explore
different processor specifications (different sizes of DL1 and L2) for each benchmark
and check the optimal combination for each one.
3. Use Paraver in order to observe how power consumption is distributed along the
execution. Represent it in different plots and study the results.
3.1 Objectives
The main objectives of this project are:
• To create a tool to parse information from TaskSim to McPAT and create power
models.
8
• To study the behaviour of the hardware components in terms of power consump-
tion.
• To create a Paraver configuration file to plot the TaskSim simulations.
3.2 Requirements
The requirements of this project are:
• To develop the tool using open-source tools.
• The tool has to execute McPAT without errors, has to be easy to use, consistent,
fast and adaptable to user input/output.
• The code has to be reliable, maintainable, well-structured and readable.
3.3 Methodology
In order to achieve all the objectives, I have used the scrum methodology. The scrum
methodology has allowed me to split the project in different goals and focus on an specific
goal for 2-4 weeks (in scrum terminology is called sprint). Although in this project I
have not had meetings every day, there have been meetings once a week.
3.4 Tools
The tools which have been used in this project can be categorized as hardware tools or
software tools.
3.4.1 Hardware
The hardware tools that I have used to carry on this project have been the indispensable
ones to develop the code and to launch the executions. I have used a desktop in BSC to
code the scripts. I have also used a virtual machine on MareNostrum III to launch the
TaskSim executions. MareNostrum III has been really useful because I could parallelize
the different executions.
9
3.4.2 Software
The project has been developed in a Linux distribution, OpenSuSe for being precise.
The Linux distribution has all the tools to compile, execute or edit. Apart from the
usual tools, I have used the tools detailed in the state-of-art chapter(TaskSim, McPAT
and Paraver). Moreover, I have used other tools such as:
3.4.2.1 Jinja2
I have used Jinja2 to parse information between a Python script and a xml template.
Jinja2 is a modern and designer-friendly templating language for Python.
3.4.2.2 Gnuplot
In order to generate heat maps I have used Gnuplot, a portable command-line driven
graphing utility for Linux and other SO.
3.4.2.3 GREASY
GREASY is a tool developed by BSC which is able to run in parallel a group of different
tasks, schedule them and run them using the available resources. This is really useful
when we have to execute different tasks in MareNostrum III and we do not want to
execute them sequentially.
3.5 Revision control
There have been two types of revision control in this project:
• In order to control all the application versions, a version control has been used to
be able to recover the data or version in case a black down had happened or an
error had appeared in a late version.
• The project has been supervised and reviewed once a week by the project director.
10
3.6 Validation methods
In order to check if the power models have been created properly and the results make
sense, I have used an example of McPAT configuration file given by HP. These config-
uration files are extracted from real processors specifications. I have filled the example
file with the information from TaskSim and I have generated the power model. I have
compared my power model (generated from my template) with the power model gener-
ated by the example. This comparison has allowed me to generate power models that
are similar to the power models generated by a real processor.
I have also tested my tool with different verified benchmarks in order to check if the
behaviour of the power consumption was correct.
3.6.1 Benchmarks
I have used different benchmarks for this project in order to generate different power
models and to check the correctness of the tool.
• Vector-Matrix multiplication: Useful benchmark to check how the code can
affect the power consumption (going throw a matrix by rows or by columns).
• Copy vector:
– Increasing vector size: Application test that considers different vector sizes
in order to show how the power consumption increases also in the different
components.
– Increasing the memory footprint of DL1 and L2: Increase the number
of misses in both caches and how that affect the power consumption.
• PARSEC: Different tests from a benchmark suite composed of multithreaded
programs.
3.7 Risk management
There have been different risks on this project.
• TaskSim is a tool which is still being developed, this means that it has been con-
stantly evolving during the project so more functions (extra statistic information)
have been added. Besides, the application has not been a stable version so prob-
lems and bugs have been found.
11
• Although I have had almost all the applications in a local desktop, I have used
Marenostrum III to execute TaskSim or to use some tools which are only installed
there. If an Internet black down had happened, I would not have been able to do
so.
• McPAT is a really complex framework, the input of the program is huge and has
a lot of variables in order to configure it. The simulator does not provide all
the necessary information to configure McPAT, so in some fields it was necessary
to make some approximations. These approximations have been taken from real
processors to be as accurate as possible.
12
Chapter 4
Project management, budget and
sustainability
4.1 Project management
4.1.1 Planning
In the following chapter we are going to see the overview of the planning. The project
has been done during the spring semester of 2015. It started on February the first and
has been finished at the end of June. I have disposed of 5 months to work in it. I have
dedicated 5 hours a day from Monday to Friday which makes a total of 500 hours.
I have divided the project in three main tasks:
1. Work environment: Set up the entire work environment in order to start the
project with all the tools working.
2. Coding: Code the tool in a way that satisfies all necessities and achieves all the
objectives. Besides, I had to create power models and study the results.
3. Documentation: Write down the work done in order to be presented as a thesis.
Finally, I have split all the main tasks in sub-tasks in order to detail every step clearly:
4.1.1.1 Work environment
1. Set up work environment: Get into the project, talk about it with the project
director and introduce myself to the work team. Other tasks such as picking up
the key and the credentials can also be included here.
13
The resources used have been the PC and a junior developer.
2. Learn how to use the work environment: Configure the work environment.
Using MareNostrum III is not a trivial matter because it works with a queue
system which has to be learned before using it. The executions have to be sent to
a job queue using specific commands and there are some papers which have to be
read in order to achieve the knowledge to use it.
The resources used have been the PC and a junior developer.
3. Configure the desktop: Configure the desktop in order to make the work easier,
for example, install another console or editor to which I am more used.
The resources used have been the PC and a junior developer.
4. Install and learn McPAT: Install McPAT from the HP website. Read the
papers in order to know how it works. A member of the team has taught me more
features about it. I have also studied the input of the program in order to learn
how to run it.
The resources used have been the PC, a junior developer and a senior developer.
5. Install and learn TaskSim: Install TaskSim from the repository. A member of
the team has showed me how TaskSim works because there is not much information
available. Besides, a more precise study about the output of the program has been
done.
The resources used have been the PC, a junior developer and a senior developer.
6. Learn python: Learn how to code in python and search for a tool to parse
information from TaskSim to a xml file.
The resources used have been the PC and a junior developer.
7. Learn about Paraver: Learn how Paraver works, how to get information from
it and how to create configuration files.
The resources used have been the PC and a junior developer.
4.1.1.2 Coding
8. Create the Template: Create a xml template file with the structure needed to
launch McPAT.
The resources used have been the PC and a junior developer.
14
9. Code the script: Code the tool using the python script language. The tool
generates the input of McPAT given the TaskSim statistics and creates the power
model.
The resources used have been the PC and a junior developer.
10. Debug: Debug the application in order to find bugs or polish some template
values to generate more consistent power models.
The resources used have been the PC and a junior developer.
11. Create power models: Create different benchmarks in order to generate power
models. Study and represent the different results.
The resources used have been the PC and a junior developer.
12. Integrate Paraver with the script: Add the information showed by Paraver
to the project.
The resources used have been the PC and a junior developer.
4.1.1.3 Documentation
13. GEP: Write down the introduction of the project through 7 deliverables at the
beginning of the project.
The resources used have been the PC and a junior developer.
14. Write the thesis: Write down all the work done and the results. The result of
this task has been added to the paper obtained from GEP.
The resources used have been the PC and a junior developer.
15. Defense the project: Create the slides to defense the project, prepare the defense
and finally defense the project in front of the tribunal.
The resources used have been the PC and a junior developer.
4.1.1.4 Task dependencies
Sub-tasks have the following dependencies(dependencies are represented as ←) :
T1 ← T2← T3← T4, T5, T6, T7← T8← T9← T10← T11← T12.
T12, T13 ← T14← T15.
There are some tasks which could have been done at the same time, but because of
resources reasons (the project has been a one person project) they could not have been
done in parallel.
15
4.1.1.5 Gantt chart
Figure 4.1: Gantt chart with the time duration of each task
16
4.1.2 Alternatives and work plan
As in any project, I created an alternative plan before starting the project in order to
deal with the different problems that could had happened during the project. The main
objectives of the project had to be met in order to finish the project, so cutting the
objectives was not contemplated as a solution. In case that a huge problem appeared,
my plan was to work some extra hours in order to deal with it, for example, instead of
assigning 5h a day, increment these hours to 7 or 8.
I had also an alternative plan regarding the simulator. If there had been any problems,
I would have changed it for another one, for example Gem5. Other tools like Paraver or
McPAT were used in similar projects, so that their correct working was demonstrated.
The alternative plan has worked for me. Actually, 500 hours have been enough to finish
the project on time, including the dealing of eventual problems. Moreover, the simulator
has worked as expected.
4.2 Budget control
In order to check the viability of the project, I computed the total cost of the project
before starting. In this way, I was able to see if the budget was enough to cover all the
expenses.
4.2.1 Budget
In this chapter we can see the costs of all the resources involved in the project. The
resources have been split in five types:
• Human: The cost of the staff involved in the project.
• Software: The cost of the software tools used in the project.
• Hardware: The cost of the hardware resources needed to carry out the project.
• General expenses: Involves the indirect resources used in the project.
• Taxes: The taxes paid during the project.
17
4.2.1.1 Human resources
The biggest part of the cost of human resources in the project has been the junior
developer. Also we have to take into account that sometimes a senior developer has
been required to help the junior developer as well as the hours that the project director
has spent supervising the project (half an hour a week).
Human resource Task Hours(h) Cost(e /h) Total(e )
Junior developer
1-Set up work environ-
ment
5 10 50
2-Learn how to use the
work environment
5 10 50
3-Configure the desktop 5 10 50
4-Install and learn Mc-
PAT
10 10 100
5-Install and learn
TaskSim
20 10 200
6-Learn python 2.5 10 25
7-Learn about Paraver 7.5 10 75
8-Create the Template 20 10 200
9-Code the script 50 10 500
10-Debug 30 10 300
11-Create power models 135 10 1350
12-Integrate Paraver
with the script
70 10 700
13-GEP 75 10 750
14-Write the thesis 60 10 600
15-Defense the project 5 10 50
Senior developer
Teaching 10 20 200
Project director
Supervising the project 10 30 300
Total cost 5500
Table 4.1: Costs of human resources
4.2.1.2 Software
All the software needed to carry out the project has been free (from the OS which is
OpenSuse up to the more specific software such McPAT), so the total cost of the software
has been 0.
18
4.2.1.3 Hardware
As we could see on the previous chapter, I have spent 500 hours in the project. All the
hours have been done in the same hardware (BSC desktop). I have not computed the
work done in Marenostrum III because it is a public resource which can be used free by
the members of the team. To compute the deprecation of the hardware, I have chosen
240 working days during the whole year. I have used the following equation:
Deprecation = Time× Cost
UsefulLife(y)
× 1(y)
240(d)
× 1(d)
8(h)
(4.1)
The results can be seen in table 4.2.
Hardware resource Cost(e ) Useful life(y) Time(h) Total(e )
Desktop 900 4 500 58.6
Mouse 8.95 4 500 0.58
Screen 84.95 4 500 5.53
Keyboard 9.95 4 500 0.65
Total cost 65.36
Table 4.2: Costs of hardware resources
4.2.1.4 General expenses
There have been two kinds of expenses here:
• Indirect costs
The most costly part here has been the electricity. Nowadays a kwh costs 0.14e in
Spain. There have been only two hardware resources that can be used to compute
electricity cost, these hardware resources are the desktop and the screen.
Hardware resource Power(W) Price(e /kwh) Hours(h) Total(e )
Desktop 0.4 0.14 500 28
Screen 0.03 0.14 500 2.1
Total cost 30.1
Table 4.3: Costs of electricity
• Unforeseen costs
To deal with eventual problems and unexpected events which could had occurred,
I decided to increment the total cost in a 5%. In this way, If something had gone
19
wrong (a hardware resource breaks down or a payment program) I would have had
some margin.
4.2.1.5 Taxes
The revenue tax in Spain is 21%.
4.2.1.6 Total cost
The total cost can be seen in table 4.4.
Resource type Total(e )
Human 5500
Hardware 65.36
Software 0
General expenses (without contingency) 30.1
Total cost (without contingency and without taxes) 5595.46
Total cost (with 5% of contingency and without taxes) 5875.23
Taxes (21% of the total cost) 1233.80
Total cost 7109.03
Table 4.4: Costs summary
4.2.2 Control management
At the end of each task, in order to calculate the derivation, I compared the planned
cost and the real cost. If the derivation of the task cost was bigger than expected then
I made a more accurately study about what had happened during the task in order to
avoid that the same issues happen again.
As 500 hours have been enough to finish the project and there has not been any problem
involving the hardware, there has not been any derivation of the planned budget.
4.3 Sustainability
It is necessary to evaluate the project’s impact on society and in which way it is going
to improve the people’s quality of life. To study the sustainability, I have focused on
economic, social and environmental points of view. In each point of view, I have obtained
a score which is computed following the Socratic Method.
20
4.3.1 Economic
• As we could see in previous chapters, all the costs have been assessed and properly
explained.
• In order to deal with unexpected issues, a contingency percent was added to the
total cost.
• The project could have been done with less time by an experienced developer
but the cost per hour of an experienced developer would has been higher than
the present developer. So the total cost of the project probably would has been
increased rather than decreased. The resources would not have varied.
• The most important parts have been the most time consuming parts as we can see
in the Gantt diagram.
• The project has been conducted in cooperation with BSC and it is linked to a
bigger project.
4.3.2 Social
• The present social situation of the country is correct and the political situation
stable. The sector which involves the project is healthy and growing.
• The project is not going to influence the social and political situation in any way.
• The project was a need from BSC to have a tool to create power models.
• The project is neither going to improve consumer’s quality of life, nor is it going
to influence the consumer. This project has just added a new feature to the BSC
project.
• No collective is going to be harmed by the project.
4.3.3 Environmental
• The resources involving environmental impact have been electricity and the desk-
top components.
• The environmental impact of the activity if I had not conducted the project would
has been the same.
• The desktop, mouse and keyboard have been reused from older projects.
21
• The same as in the previous point, all the hardware resources involved in the
project are going to be reused in future projects.
• The project has used manufactured products (desktop, mouse, keyboard) which
are made from raw resources extracted from problematic countries(coltan). These
resources have been extracted in poor working conditions.
• The only pollution generated during the project has been for the production of
electricity.
• The project is about the power consumption of our programs, If we know the con-
sumption of our programs, we can improve them to be energetically more efficient,
so the project might influence the energetically footprint by reducing it.
4.3.4 Sustainability evaluation
Sustainable? Economic Social Environmental
Planification Economical viability Improvement in life quality Resource analysis
Valuation 5 8 9
Results Final cost versus prediction Social environment impact Resource usage
Valuation 10 9 9
Risks Adaptation to changes in scenario Social damage Environmental damage
Valuation 0 0 0
Total valuation 50
Table 4.5: Sustainability matrix
22
Chapter 5
Experimental setup
This chapters aim to explain how the different tools have been used and the setup used
in the experiments.
5.1 TaskSim
The architecture simulator to run the tests and obtain the statistics has been TaskSim.
The main information about TaskSim has been explained in previous chapters, but there
is still information which has to be cleared up.
TaskSim is a trace-driven architecture simulator which means that works with traces.
Before executing TaskSim, it is necessary to instrument the code (automatically done by
the compiler Mercurium[6]) and to generate the trace with Nanox[7]) (Nanox provides
a basic tracing infrastructure to ease the task of creating and reading different kinds
of traces). The traces are a collection of events (run-time events, memory accesses, ...)
which are sequentially stored. Once we have the trace, we can execute TaskSim with
the trace and a configuration file in order to simulate the application and obtain the
statistics.
The statistics generated by TaskSim can have different levels of abstractions. In the
TaskSim configuration file, we can select which level we wish. We can select burst
mode or memory mode. The burst mode only generates CPU bursts and memory mode
generates CPU bursts and simulates the memory accesses when available.
Nowadays, TaskSim does not provide information about the entire processor, so I have
not computed the power consumption of a processor with all the components involved
in it. I have selected those components that TaskSim simulates. These components are
mainly the components involved in memory accesses.
23
5.1.1 Configuration file
As we have seen in the previous section, there is a configuration file in order to configure
TaskSim with the processor specifications. The configuration file is composed by different
parts such as DRAM, MMU, Cache... Each part is a component that can be configured
by the user.
The variables that I have used in this project and also to configure McPAT are the
following ones:
• Number of cores.
• ROB size (Re-order Buffer).
• DRAM RAS size (Return Address Stack).
• Cache configuration (DL1 and L2).
– Size.
– Associativity.
– Block width.
– Latency.
– Number of w/r ports.
• MSHR (Miss status holding registers).
• Virtual memory page size.
Moreover of this values, there are other important ones which I have not parsed to
McPAT. These values have a direct impact to total number of cycles:
• Memory CPU commit rate.
• Memory latency.
• Processor idle cycles: How many cycles will be inserted whenever the scheduler
cannot provide a task for a CPU.
I have included an example of TaskSim configuration file in appendix A.
24
5.1.2 Results file
The results file generated by TaskSim gives us all the information collected during the
simulation. The information is given in an specific format. I have included an example
of results file in the appendix B.
The information which I have extracted from the results file and I have used to create
the power models can be seen below.
• Committed instructions.
• Total number of cycles.
• For each core:
– Number of loads/stores.
– For each L2 and DL1:
∗ Number of write/read.
∗ Number of hit/miss.
5.1.3 Additional information
I have used some additional information in order to configure McPAT that does not
appear on the results file or configuration file. This information is used during the
TaskSim execution and although it is not provided by the user, it is part of TaskSim
infrastructure.
• Frequency: TaskSim works at 1GHz.
• Cache policy: The cache policy used by TaskSim is copy back.
• Private or shared L2: TaskSim allows using private L2 or shared L2. Nowadays,
the selection has to be done through the TaskSim code. I have selected private L2
to do the tests.
• Heterogeneous cores: Each core differs from others.
• Out of order cores: TaskSim simulates out-of-order processors.
• Additional cache levels: TaskSim can incorporate as many cache levels (private
or not) as the user wants. These cache levels can be private or shared. In this
project, I have used two private cache levels (DL1 and L2).
25
5.2 McPAT
As I have explained in the previous chapters, McPAT can be considered a really complex
framework which allows us to calculate the power consumption of a processor. As it has
been a really important tool for the project, I have to explain how McPAT works in
order to get a rough idea of what I have done with it.
Figure 5.1: Block diagram of the McPAT framework
As we can see in the figure, given a configuration file, McPAT uses a sequence of steps
to generate the timing, area and power (not just dynamic power but also static and
short-circuit power) of a multicore processor. To generate all this information, McPAT
models the chip technology, architecture and circuits.
5.2.1 Configuration file
In order to configure McPAT and compute the power, it is necessary to create a xml file.
In the configuration file, McPAT allows us to specify the low-level configuration details
of our processor. It also provides default values If we want to specify only high-level
architectural parameters. The xml file has to contain all components of our processor
from which we want to obtain the power consumption with McPAT.
We can see an example of McPAT configuration file in the appendix C. The example
just contains the components that I have simulated.
5.2.2 Results file
For each component and sub-component (each component also has different sub-components),
McPAT gives us the following statistics:
26
• Area (mm2): Size of the component.
• Peak Dynamic (W): The absolute worst case power (maximum number of ac-
cesses).
• Subthreshold Leakage (W): The power wasted when a transistor that is sup-
posedly in the off state actually allows a small current to pass between its source
and drain.
• Subthreshold Leakage with power gating (W): The subthreshold leakage
with a technique to reduce the power consumption.
• Gate Leakage (W): The current that leaks through the gate terminal.
• Runtime Dynamic (W): The power spent in charging and discharging the ca-
pacitive loads when the circuit switches state.
I have included an example of McPAT results file in the appendix D.
As the timing of different executions with different cache sizes or number of CPUs it is
not equal, I have decided to compute the runtime dynamic energy for each test too. As
I have the frequency in which the processor works and the total number of cycles (both
things extracted from TaskSim) I can apply the formula below.
E(J) = P (W )× T (s) = P (W )× TotalCycles
Frequency(Hz)
(5.1)
5.2.2.1 Power gating
McPAT can report static power (subthreshold leakage) with power gating. Power gating
is a technique to reduce the power consumption. The technique consist in shutting off
the current to blocks of the circuit that are not in use. I have used static power with
power gating instead of without it due to the fact that nowadays all the actual processors
work with it and I have wanted to be as accurate as possible to a real consumption of a
processor.
Usually, the static power with power gating is the half of static power without power
gating. Here is an example:
27
32,768 65,536 131,072
12
14
16
18
20
22
24
26
DL1 Cache Size (bytes)
P
ow
er
(W
)
Without power gating
With power gating
Figure 5.2: Static power comparison
5.2.2.2 Processor modeling
The main components and sub-components which TaskSim models and I have used can
be seen below.
• Renaming Unit (RU)
• Memory Management Unit (MMU)
– Itlb
– Dtlb
• Execution Unit (EU)
– ALU
– Floating Point Unit
– Register Files
– Instruction Scheduler
• Instruction Fetch Unit (IFU)
– Instruction Cache
– Branch Target Buffer
– Branch Predictor
28
– Instruction Buffer
– Instruction Decoder
• Load Store Unit (LSU)
– Data Cache (DL1)
– LoadQ
– StoreQ
• L2 Cache
5.2.2.3 Cache modeling
In this project, I am going to focus on the power consumption of the memory components
(Cache DL1 and L2). To compute the power model of a cache, McPAT uses a tool called
CACTI[21]. CACTI is an integrated cache and memory access time, cycle time, area,
leakage and dynamic power model. Cacti is in charge to compute the power consumption
of data/tag cache array when it is read, searched or written.
In order to configure the cache, McPAT has the following variables:
• Capacity
• Block width
• Associativity
• Bank
• Throughput w.r.t core clock
• Latency w.r.t core clock
• Output width
• Cache policy
Also, for each cache it is needed:
• Number of write accesses.
• Number of read accesses.
• Number of read misses.
29
• Number of hit misses.
Besides, McPAT models the buffers involved in the cache, these buffers are:
• MSHR buffer.
• Write back buffer.
• Fill buffer.
• Prefetch buffer.
All the elements affect the power consumption of the cache. For example, when the cache
size is increased, the number of transistors increases too, so the area, peak dynamic and
the static power rise. It also increases the cache dynamic consumption although we might
be reducing the number of cache misses. The reason is because we are also increasing
the cache access time and the number of transistors to read/write the tag.
There are other variables which just affect the dynamic consumption, for example the
variables involved in cache accesses.
I have done a test in order to check the behaviour of the cache when cache size is
increased (All cache sizes have the same number of accesses).
First of all, how area is increased:
32,768 65,536 131,072
223
224
225
226
227
228
229
DL1 Cache Size (bytes)
A
re
a
(m
m
2
)
(a) Processor Total Area
32,768 65,536 131,072
6
8
10
12
14
DL1 Cache Size (bytes)
A
re
a
(m
m
2
)
(b) DL1 Total Area
Figure 5.3: Area comparison
Secondly, how power consumption is increased:
30
32,768 65,536 131,072
31.8
31.85
31.9
31.95
DL1 Cache Size (bytes)
P
ow
er
(W
)
(a) Processor Peak dynamic
32,768 65,536 131,072
2.15
2.2
2.25
2.3
DL1 Cache Size (bytes)
P
ow
er
(W
)
(b) DL1 Peak dynamic
Figure 5.4: Peak dynamic power difference
32,768 65,536 131,072
11.75
11.8
11.85
11.9
11.95
12
DL1 Cache Size (bytes)
P
ow
er
(W
)
(a) Processor Static power
32,768 65,536 131,072
0.3
0.35
0.4
0.45
0.5
0.55
DL1 Cache Size (bytes)
P
ow
er
(W
)
(b) DL1 Static power
Figure 5.5: Static power comparison
32,768 65,536 131,072
2.92
2.93
2.93
2.94
2.94
2.95
2.95
DL1 Cache Size (bytes)
P
ow
er
(W
)
(a) Processor Gate leakage
32,768 65,536 131,072
0.06
0.07
0.07
0.08
0.08
0.09
DL1 Cache Size (bytes)
P
ow
er
(W
)
(b) DL1 Gate leakage
Figure 5.6: Gate leakage comparison
31
32,768 65,536 131,072
6.24
6.26
6.28
6.3
DL1 Cache Size (bytes)
P
ow
er
(W
)
(a) Processor Dynamic power
32,768 65,536 131,072
0.08
0.1
0.12
0.14
0.16
0.18
DL1 Cache Size (bytes)
P
ow
er
(W
)
(b) DL1 Dynamic power
Figure 5.7: Dynamic power comparison
5.3 TaskSim-McPAT script
The main coding part of this project came from creating a script to parse the informa-
tion from TaskSim to McPAT. The script had to read the parameters from a TaskSim
results file and configuration file, and generate a McPAT configuration file with all the
information required to run it without problems.
First of all, I created a template with the required information to execute McPAT. The
template contained the basic components to run McPAT without problems, moreover, I
deleted the unnecessary components (network, cache directories, etc...).
After that, I created a python script to collect the information from TaskSim and store
it in a two-level dictionary. The first dictionary level contained the data roots (core,
configuration file or information of the whole system). The second level contained the
data name.
In order to add the information to the template I used a templating language for Python
called Jinja2. Jinja2 allowed me to generate a xml file with all the information from a
python dictionary.
All the information I collected from TaskSim was not enough to run McPAT, so more
information had to be added in order to run it. The biggest part of this information
was about the processor and core configuration. For example, the processor technology
used or the number of ALUs for each core.
32
McPAT has different configuration files extracted from real processor specifications. I
used real processors configuration files to fill the unknown information from my template.
The most part of the unknown information comes from an Intel Xeon.
5.4 Paraver script
With the TaskSim-McPAT script it is possible to know the total power consumption
of our application and the components involved in the execution. TaskSim also has
a feature which allows us to generate a Paraver trace. We can extract from Paraver
trace all the cache accesses for task and calculate the power consumption for each task.
I added this feature to the project, which allows us to observe the behavior of the different
application tasks in terms of power and how the power consumption is distributed along
the execution.
In order to do this, I have created a Paraver configuration file with the different cache
accesses of the different cache levels, then I have multiplied the different cache accesses
for the power consumption of a cache access (calculated by McPAT).
For each trace, the Paraver configuration file generates four different plots:
• Gradient color dynamic power consumption (W).
• Function line dynamic power consumption (W).
• Aggregated dynamic power consumption (W).
• Core burst.
33
Chapter 6
Evaluation
The results obtained from the different benchmarks have been explained in the following
chapter.
6.1 Vector copy
In all the following tests I have used a vector copy code. Vector copy has allowed me to
have for each vector iteration, a write and a read access.
6.1.1 Increasing vector size with private L2 cache
I have created this benchmark to check the behaviour of the power consumption from
the different components of a processor when the vector size is increased.
I have used the code below.
1 void computo ( i n t ∗x , i n t ∗y , i n t t imes ) {
2 i n t k , i ;
3 f o r ( k = 0 ; k < t imes ; k++ )
4 f o r ( i = 0 ; i < vecELEMENTS ; i++ )
5 x [ i ] = y [ i ] ;
6 }
Listing 6.1: Increasing vector size code
I have increased the vecELEMENTS size gradually, from 32768 elements to 262144 (each
element is an integer of 4 bytes). I have done the same test for 1 core, 4 cores and 8
cores. Moreover, I have selected a DL1 cache size of 128KB and a L2 cache size of 1MB.
34
I have also transformed the power consumption into energy consumption because the
difference of time between 1, 4 and 8 cores is different. The energy consumption for the
whole processor and the different components involved can be seen below.
VecSize Total Energy IFU RU LSU MMU EU L2
32768 0.112 0.011 0.013 0.034 0.008 0.044 0.001
65536 0.220 0.022 0.026 0.068 0.016 0.086 0.002
131072 0.444 0.044 0.053 0.137 0.033 0.174 0.003
262144 1.412 0.112 0.202 0.298 0.126 0.666 0.007
Table 6.1: Increasing vector size with 1 core
VecSize Total Energy IFU RU LSU MMU EU L2
32768 0.144 0.012 0.019 0.036 0.012 0.063 0.001
65536 0.275 0.024 0.036 0.071 0.023 0.119 0.002
131072 0.569 0.050 0.076 0.143 0.047 0.250 0.003
262144 1.752 0.127 0.265 0.314 0.166 0.873 0.007
Table 6.2: Increasing vector size with 4 core
VecSize Total Energy IFU RU LSU MMU EU L2
32768 0.196 0.015 0.029 0.038 0.018 0.095 0.001
65536 0.367 0.029 0.053 0.075 0.033 0.175 0.002
131072 0.774 0.059 0.114 0.152 0.071 0.375 0.003
262144 2.378 0.156 0.381 0.343 0.238 1.253 0.007
Table 6.3: Increasing vector size with 8 core
50,000 100,000 150,000 200,000 250,000
0
0.5
1
1.5
2
2.5
VecSize (Integers)
E
n
er
gy
(J
)
1 core
4 core
8 core
Figure 6.1: Processor dynamic energy comparison
35
Usually the test with most number of cores has the most energy consumption, this is
due the fact that the speed up obtained by that number of cores it is not the maximum
speed up possible.
6.1.2 Comparison between private and shared L2 cache
In this test I have changed the TaskSim code in order to test the same application with
shared L2 and with private L2. I have tested the application seen before with extra
vectors in order to create different tasks with different workloads.
1 f o r ( i = 0 ; i < TASKS ; i++ ) {
2 computo ( x , y , repsOUT ) ;
3 }
4 computo ( a , b , repsOUT ) ;
Listing 6.2: Shared and private cache code
The main difference between both configurations is the number of L2s, while a shared
L2 is one cache L2 shared between all the cores, in the private L2 case, all the cores
have their own L2 caches.
100,000 200,000 300,000 400,000 500,000
0
0.2
0.4
0.6
0.8
VecSize (Integers)
E
n
er
gy
(J
)
Shared
Private
(a) 1 core
100,000 200,000 300,000 400,000 500,000
0
0.2
0.4
0.6
0.8
VecSize (Integers)
E
n
er
gy
(J
)
Shared
Private
(b) 4 core
100,000 200,000 300,000 400,000 500,000
0
0.2
0.4
0.6
0.8
1
1.2
VecSize (Integers)
E
n
er
gy
(J
)
Shared
Private
(c) 8 core
Figure 6.2: Processor dynamic energy comparison
We can see that shared L2 gets problems when all the cores share the same cache and
are accessing to different vectors. The main difference between both configurations is
when both vectors do not fit into the cache L2, although one vector can fit.
Whether one of the vectors does not fit into the cache, or whether both of them fit, the
result is the same for all the configurations.
We can also see the difference of static power.
36
1 4 8
5
10
15
20
Core Size
P
ow
er
(W
)
Shared
Private
Figure 6.3: Processor static power comparison
6.1.3 Increasing the memory footprint
I have created a benchmark in order to increase the memory footprint for both cache
levels (DL1 and L2). The code which I have used can be seen below.
1 f o r ( k = 0 ; k < reps ; k++ )
2 f o r ( i = 0 ; i < vecELEMENTS ; i++ )
3 x [ i ∗BLOCK] = y [ i ∗BLOCK] ;
Listing 6.3: Increasing the memory footprint code
In this code, I have increased the variable BLOCK until the number of misses have been
the same as the number of accesses. The results for both cache levels can bee seen below.
1 4 8 12 16
0.12
0.14
0.16
0.18
BLOCK size
E
n
er
gy
(J
)
(a) Processor dynamic energy
1 4 8 12 16
0.01
0.015
0.02
0.025
0.03
BLOCK size
E
n
er
gy
(J
)
(b) DL1 dynamic energy
Figure 6.4: Increasing the memory footprint in DL1 dynamic energy consumption
37
1 4 8 12 16
0.2
0.4
0.6
0.8
BLOCK size
E
n
er
gy
(J
)
(a) Processor dynamic energy
1 4 8 12 16
0.012
0.0125
0.013
0.0135
BLOCK size
E
n
er
g
y
(J
)
(b) L2 dynamic energy
Figure 6.5: Increasing the memory footprint in L2 dynamic energy consumption
Both caches have more or less the same figure. We can see that in BLOCK size 1, 4 and
8 the number of misses does not increase too much and also that the vector still fits into
the cache. After BLOCK size 8, the vector does not fit inside the cache and the energy
consumption rises.
The big difference between DL1 and L2 is the processor energy consumption. While a
miss in cache DL1 means an access to L2, a miss in L2 means a memory access, which is
much slower than an access to L2. This is why the difference on figure 11.1.A between
BLOCK size 1 and BLOCK size 16 is smaller than on figure 11.2.A.
Moreover, we can see that L2 is less consuming than DL1. This is due the fact that in
a copy back policy, DL1 has to update in more levels than L2.
6.2 Vector Matrix multiplication
In this section, I have compared the power consumption between going throw a matrix
by columns and going throw it by rows. In order to compare both things, I have created
a vector-matrix multiplication test, which allows me to test that.
I have used the code below.
1 f o r ( i = 0 ; i < e lements ; i++ )
2 f o r ( j = 0 ; j < e lements ; j++ )
3 z [ i ] += x [ j ] ∗ y [ j ∗column+i ] ;
Listing 6.4: Columns code
38
1 f o r ( j = 0 ; j < e lements ; j++ )
2 f o r ( i = 0 ; i < e lements ; i++ )
3 z [ i ] += x [ j ] ∗ y [ j ∗column+i ] ;
Listing 6.5: Rows code
In this test, I have used the same configuration for both simulations, so the static power
is the same for both. The power dynamic is also not representative because of the time
difference between both tests.
Processor DL1 L2
0
0.5
1
1.5
2
2.5
3
3.5
0.
8
0
.1
4
2.
7
·1
0−
3
2
.9
8
0.
3
1
.4
7
·1
0−
2
E
n
er
gy
(J
)
Rows Columns
Figure 6.6: Dynamic energy comparison
We can see that the difference between both tests is huge. I have used a 4 core processor,
this is why the difference between the total dynamic energy and DL1 or L2 is so high.
6.3 PARSEC Benchmarks
I have created power models from the PARSEC (Princeton Application Repository for
Shared-Memory Computers) suite[1]. The suite focuses on emerging workloads and was
designed to be representative of the next-generation shared-memory programs for chip-
multiprocessors.
I have selected the following PARSEC benchmarks:
• Ferret: This application is used for content-based similarity search.
39
• Blackscholes: This application calculates the prices for a portfolio of European
options analytically with the Black-Scholes partial differential equation.
• Swaptions: This application uses the Heath-Jarrow-Morton framework to price
a portfolio of swaptions. Swaptions employs Monte Carlo simulation to compute
the prices.
• Dedup: This application compresses a data stream with a combination of global
and local compression.
• Streamcluster: Solves the online clustering problem.
For each benchmark, I have done the following:
• I have executed each benchmark for 1, 4 and 16 cores (I have not been able to
create power models of 4 and 16 cores for Dedup and Ferret benchmarks).
• For each core size, I have configured TaskSim with different L2 sizes (128KB,
256KB, 516KB, 1MB and 2MB).
• For each L2 size, I have executed TaskSim with different DL1 sizes (2KB, 4KB,
8KB, 16KB, 32KB, 64KB and 128KB).
• Moreover, I have done the test two times. The first time I have done it without
changing the latency and associativity values of cache DL1, those values have been
2 and 4 respectively. In the second time I have changed these values according to
the cache size:
– 2KB and 4KB: Latency 1 and associativity 2.
– 8KB and 16KB: Latency 2 and associativity 4.
– 32KB and 64KB: Latency 4 and associativity 8.
– 128KB: Latency 8 and associativity 16.
For the L2, I have used always the same values. The latency value has been 20 and the
associativity value has been 16.
For each benchmark I have created a heat map of:
• Processor dynamic energy consumption (J)
• Processor dynamic power consumption (W)
• Processor static power consumption (W)
40
• DL1 dynamic energy consumption (J)
• L2 dynamic energy consumption (J)
All the heat maps can be seen in appendix E. In the following sections, I have just
included the processor dynamic energy consumption for 1 core with fixed associativity
and latency and without it.
6.3.1 Ferret
Figure 6.7: Ferret processor dynamic energy consumption with fixed associativity
and latency
41
Figure 6.8: Ferret processor dynamic energy consumption with unfixed associativity
and latency
The graphic result it is due to the fact that the misses in cache L2 are being reduced
while L2 size increases. In small L2 size, there are a lot of cache misses so the memory
access time increases considerably so this affects the total number of program cycles.
So I can say, that in this benchmark, the best L2 cache size is the biggest one possible
(2MB in this case).
If we are looking for the best DL1 cache size, it is 8KB. In the DL1 8KB cache the
misses are reduced considerably from the previous smaller cache size. After this cache
size, cache misses reduce gradually but there is not an improvement due to the fact that
the power consumption of the cache rises too.
42
6.3.2 Blackscholes
Figure 6.9: Blackscholes processor dynamic energy consumption with fixed associa-
tivity and latency
43
Figure 6.10: Blackscholes processor dynamic energy consumption with unfixed asso-
ciativity and latency
In this benchmark it is easy to observe the best cache configuration. First of all, cache
misses in L2 are stable from the beginning (128KB size). So in this application only
matters the number of DL1 misses. In cache size 8KB the number of misses stops
decreasing. In the unfixed graphic, the best cache size is 4KB, this is due to the fact
that has smaller latency and associativity than 8KB.
44
6.3.3 Swaptions
Figure 6.11: Swaptions processor dynamic energy consumption with fixed associativ-
ity and latency
45
Figure 6.12: Swaptions processor dynamic energy consumption with unfixed associa-
tivity and latency
In this application we can observe that L2 size is not important, due to the fact that the
number of misses remains stable from the very beginning. The number of DL1 misses
is reduced by half while the cache size is increased. The best size if we are taking into
account the associativity and latency is 16KB, while if we are not, the best solution is
32KB.
46
6.3.4 Dedup
Figure 6.13: Dedup processor dynamic energy consumption with fixed associativity
and latency
47
Figure 6.14: Dedup processor dynamic energy consumption with unfixed associativity
and latency
In Dedup application both caches are important. The misses of L2 stop decreasing in
512KB size. We can see that bigger cache sizes are not the best ones. In cache DL1,
the number of misses is reduced by half as the cache size is increased. The best size for
the unfixed associativity and latency is 4KB while for fixed associativity and latency is
8KB.
48
6.3.5 Streamcluster
Figure 6.15: Streamcluster processor dynamic energy consumption with fixed asso-
ciativity and latency
49
Figure 6.16: Streamcluster processor dynamic energy consumption with unfixed as-
sociativity and latency
In this application, L2 misses are the most important thing. As we can see, in cache
sizes smaller than 1MB, the number of misses does not decrease really much. Once L2
size reaches 1MB, the number of misses decrease and becomes stable. For 1MB and
2MB the number of misses are the same.
In order to choose a DL1 size, the best solution would be to pick up the smaller one,
because the number of misses remains equal for all the DL1 cache sizes.
6.3.6 Average
I have computed the average of the processor power and energy consumption within the
different benchmarks (the average of processor power can be seen in the appendix).
Before doing this, and in order to compute the average energy consumption, I have
normalized its values.
With the average, we can check the best cache size combination for the PARSEC suite.
50
Figure 6.17: Average processor dynamic energy consumption with fixed associativity
and latency
Figure 6.18: Average processor dynamic energy consumption with unfixed associa-
tivity and latency
51
6.3.7 Further results
All the graphics included in this section, are the same for all benchmarks.
6.3.7.1 Static power
As bigger caches are used, more transistors are used, so more static power is consumed.
Figure 6.19: Processor static power
6.3.7.2 DL1 dynamic energy consumption
We can see that a smaller cache is usually the best option. The rate misses/hits has not
much influence on DL1 energy consumption although it has on other factors (L2).
52
Figure 6.20: DL1 dynamic energy consumption
Moreover, when misses are reduced by half in each cache size (if the number is really
big), then something like this can happen:
Figure 6.21: Swaptions DL1 dynamic energy consumption
53
6.3.7.3 L2 dynamic energy consumption
A bigger cache means a bigger consumption. In this graphic, for the same number of
accesses (vertical line), the best solution is always the smaller one. Moreover, we can
see that as we increase the DL1 size, we are decreasing the cache L2 number of accesses
(horizontal line).
Figure 6.22: L2 dynamic energy consumption
This is the result when the number of accesses in L2 remains equal in all DL1 sizes (we
have seen this in the streamcluster benchmark).
54
Figure 6.23: Streamcluster L2 dynamic energy consumption
6.4 Paraver visualization
6.4.1 Copy vector with different vector sizes
I have selected this test in order to check the behaviour of the power consumption along
the execution. This test allows me to see what happens in the different tasks when a
vector fits into the DL1 cache and when not. I have used the following code:
1 f o r ( i = 0 ; i < TASKS ; i++ )
2 computo ( x , y , repsOUT ) ;
3 #pragma omp taskwai t
4 f o r ( i = 0 ; i < TASKS ; i++ )
5 notask ( x , y ) ;
6 f o r ( i = 0 ; i < TASKS ; i++ )
7 computo ( x , y , repsOUT ) ;
8 #pragma omp taskwai t
Listing 6.6: Paraver code
I have selected a 4 core processor with shared L2 cache. I have selected a shared L2
cache because it is easier to observe what happens into the cache. I have also reduced
55
the vector size due to the fact that the Paraver trace is much bigger than the TaskSim
trace.
The first graphic is the core burst, in this graphic we can see how the tasks are divided
and how they are executed in the different CPUs.
(a) Vector does not fit
(b) Vector fits
Figure 6.24: Core burst
In the following figure, we can check how the cache power consumption is distributed.
56
(a) Vector does not fit
(b) Vector fits
Figure 6.25: Function line dynamic power consumption (0W - 0.094W)
Paraver can also create a gradient color plot.
57
(a) Vector does not fit
(b) Vector fits
Figure 6.26: Gradient color dynamic power consumption (0W - 0.094W)
Finally, I have generated the aggregated power consumption of the caches.
58
(a) Vector does not fit
(b) Vector fits
Figure 6.27: Aggregated dynamic power consumption (0W - 0.226W)
6.4.2 Swaptions with different cache sizes
I have extracted the power consumption distribution from swaptions. In this test I have
selected a 4 core processor with private L2 cache. The size of L2 cache has been 1MB. I
have done a test with a DL1 cache of 8KB and another one with a DL1 cache of 32KB
in order to corroborate the results seen in the previous chapter.
59
(a) 8KB DL1 Cache
(b) 32KB DL1 Cache
Figure 6.28: Core burst
60
(a) 8KB DL1 Cache
(b) 32KB DL1 Cache
Figure 6.29: Function line dynamic power consumption (0W - 0.056W)
61
(a) 8KB DL1 Cache
(b) 32KB DL1 Cache
Figure 6.30: Gradient color dynamic power consumption (0W - 0.056W)
62
(a) 8KB DL1 Cache
(b) 32KB DL1 Cache
Figure 6.31: Aggregated dynamic power consumption (0W - 0.381W)
As we can see in the function line figure and in the gradient figure, the power consump-
tion of 32KB DL1 cache is bigger than the 8KB cache although that the miss rate is
lower. This is due to the fact that the power consumption for access is bigger in 32KB
than in 8KB cache.
The aggregate power consumption is bigger in cache 8KB because of the power con-
sumption of L2. We can check the difference of L2 power consumption in the gradient
figure and in the function line figure.
63
Chapter 7
Conclusions
With all the work done, I have accomplished all the main objectives mentioned on the
project scope chapter:
• First of all, I have created a consistent tool in order to generate power models for
different applications and secondly I have evaluate the correct working of the tool
with different benchmarks as we can see in the evaluation chapter.
• I have explored different applications with different workloads. Moreover, I have
explored the best hardware for the PARSEC benchmark.
• I have created a tool in order to use Paraver to compute the power consumption.
Moreover, as we can see in the benchmarks, the variable which more influences the power
consumption of an application is the number of accesses (in the cache component). But
as we can observe in PARSEC benchmarks, we can vary our hardware in order to improve
the power consumption. Although in the test I have only varied the cache size, there
are a plenty of variables which can be modified in order to be more optimal. The tool
allows us to easily explore the best hardware suite to launch our executions in power
terms.
But if what we want is to change the behaviour of our application in order to reduce the
power consumption, we can check with Paraver the most power consuming parts and
try to optimize them.
64
7.1 Future work
Although I have done what has been possible with TaskSim, in a near future, TaskSim
is going to offer more functions like pipelining operations. The future work to do would
be to maintain the tool in order to add all the new statistics to the tool. With more
information about the executions, more accurate and real power models could be gen-
erated.
Moreover, it would be possible to integrate McPAT inside the TaskSim code, like other
simulators we have seen before. This would allow us to generate the power consumption
of an application immediately and give to TaskSim more functions.
It would be also reasonable to test the tool with all the verified benchmarks that TaskSim
project has. In this way, we could analyze which is the best hardware in order to support
all the benchmarks.
65
66
Appendices
67
Appendix A
TaskSim configuration file
1 ncpus = 1
2 mode se l e c to r = MEMORY
3 i d l e c y c l e s = 0
4 f o rwa rd t a sk i n ou t t o cpu = f a l s e
5 measure = FULL APPLICATION
6 maste r speedup ra t i o = 1 .0
7
8 [ Burst ]
9 p e r f r a t i o = 1
10
11 [MemCPU]
12 o u t b u f f s i z e = 4
13 r o b s i z e = 192
14 commit rate = 4
15
16 [ DL1Cache ]
17 mshr = DL1MSHR
18 l e v e l = 1
19 num−por t s = 1
20 l a t ency = 2
21 s i z e = 131072
22 l i n e−s i z e = 32
23 word−s i z e = 4
24 as soc = 1
25 vict im− l i n e s = 0
26 #vict im−cache−a f f i n i t y = ANY
27
28 [DL1MSHR]
29 s i z e = 128
30
31 [ L2Cache ]
32 mshr = L2MSHR
68
33 l e v e l = 2
34 num−por t s = 1
35 l a t ency = 10
36 s i z e = 1048576
37 l i n e−s i z e = 32
38 word−s i z e = 4
39 as soc = 1
40 vict im− l i n e s = 0
41
42 [L2MSHR]
43 s i z e = 128
44
45 [Memory ]
46 l a t ency = 10000
47
48 [ L2Bus ]
49 l a t ency = 1
50 width = 8
51
52 [DRAM]
53 f requency−d i v i d e r = 5
54 cas = 16
55 ra s = 16
56 precharge = 16
57 access−mask = rrrrrrrrrrrrrrRRRRBBBBhhhhhhhhhhbbb
58 input−bu f f e r = 4
59
60 [MMU]
61 page−s i z e = 8192
62 access−p r i o r i t y = t o t a l
63 a l l o c a t i o n−po l i c y = dynamic 2
64 empty−page−th r e sho ld = 16
65 backward−migrat ion−f a c t o r = 0 .4
66
67 [ Paraver ]
68 b u f f e r s i z e = 1000
69 hardware sampl ing in t e rva l = 100000
Listing A.1: TaskSim configuration file
69
Appendix B
TaskSim results file
1 Fina l c y c l e count : 762130109
2 0 : 1 : 3 3 : 4 : 2 7 : 3 : 2 : 3 : 1 8 : 4 : 4 5 : 2 : 1 7 : 2 : 4 2 : 2 : 3 2 : 1 5 : burs t count : Burst : 0
3 0 : under f lows :Memory CPU s t o r e s i z e histogram :Memory : 1
4 35240785:0 to 2 :Memory CPU s t o r e s i z e histogram :Memory : 1
5 0 :2 to 4 :Memory CPU s t o r e s i z e histogram :Memory : 1
6 0 :4 to 6 :Memory CPU s t o r e s i z e histogram :Memory : 1
7 0 :6 to 8 :Memory CPU s t o r e s i z e histogram :Memory : 1
8 0 :8 to 10 :Memory CPU s t o r e s i z e histogram :Memory : 1
9 0 :10 to 12 :Memory CPU s t o r e s i z e histogram :Memory : 1
10 0 :12 to 14 :Memory CPU s t o r e s i z e histogram :Memory : 1
11 0 :14 to 16 :Memory CPU s t o r e s i z e histogram :Memory : 1
12 0 : ove r f l ows :Memory CPU s t o r e s i z e histogram :Memory : 1
13 :Memory CPU s t o r e s i z e histogram :Memory : 1
14 0 : under f lows :Memory CPU load s i z e histogram :Memory : 1
15 35767668:0 to 2 :Memory CPU load s i z e histogram :Memory : 1
16 0 :2 to 4 :Memory CPU load s i z e histogram :Memory : 1
17 0 :4 to 6 :Memory CPU load s i z e histogram :Memory : 1
18 0 :6 to 8 :Memory CPU load s i z e histogram :Memory : 1
19 0 :8 to 10 :Memory CPU load s i z e histogram :Memory : 1
20 0 :10 to 12 :Memory CPU load s i z e histogram :Memory : 1
21 0 :12 to 14 :Memory CPU load s i z e histogram :Memory : 1
22 0 :14 to 16 :Memory CPU load s i z e histogram :Memory : 1
23 0 : ove r f l ows :Memory CPU load s i z e histogram :Memory : 1
24 :Memory CPU load s i z e histogram :Memory : 1
25 243743511:ROA number o f c y c l e s spent wai t ing f o r load re sponse s :ROA:0
26 1 :ROA t o t a l number o f output f u l l :ROA:0
27 2565160:ROA to t a l number o f s t a l l e d c y c l e s :ROA:0
28 344871603:ROA t o t a l number o f committed i n s t r u c t i o n s :ROA:0
29 35240785:ROA number o f s t o r e s sent :ROA:0
30 35241435:ROA number o f s t o r e s t o t a l :ROA:0
31 35767668:ROA number o f l oads sent :ROA:0
32 92295283:ROA number o f l oads t o t a l :ROA:0
70
33 0 : setup−task−io−r eque s t s : Cache : 2
34 71008453: r eques t s−proce s sed : Cache : 2
35 71008453: acks−sent : Cache : 2
36 7994975: r eques t s−sent : Cache : 2
37 7994975: acks−r e c e i v ed : Cache : 2
38 71008453: r eques t s−r e c e i v ed : Cache : 2
39 22 . 0646 : miss−time : Cache : 2
40 0 : wr i te noncompulsory miss : Cache : 2
41 0 : wr i t e compul sory mis s : Cache : 2
42 905377: w r i t e h a l f m i s s : Cache : 2
43 897348: wr i t e m i s s : Cache : 2
44 0 : w r i t e s h a r e d h i t : Cache : 2
45 0 : w r i t e e x c l u s i v e h i t : Cache : 2
46 33438060: w r i t e h i t : Cache : 2
47 0 : read noncompulsory miss : Cache : 2
48 0 : read compulsory miss : Cache : 2
49 1101851: r e ad ha l f m i s s : Cache : 2
50 4976888: read mis s : Cache : 2
51 0 : r e ad sha r ed h i t : Cache : 2
52 0 : r e a d e x c l u s i v e h i t : Cache : 2
53 29688929: r e ad h i t : Cache : 2
54 0 : setup−task−io−r eque s t s : Cache : 3
55 7994975: r eques t s−proce s sed : Cache : 3
56 7994975: acks−sent : Cache : 3
57 44270 : r eques t s−sent : Cache : 3
58 44270 : acks−r e c e i v ed : Cache : 3
59 7994975: r eques t s−r e c e i v ed : Cache : 3
60 200 . 359 : miss−time : Cache : 3
61 0 : wr i te noncompulsory miss : Cache : 3
62 0 : wr i t e compul sory mis s : Cache : 3
63 0 : w r i t e h a l f m i s s : Cache : 3
64 0 : wr i t e m i s s : Cache : 3
65 0 : w r i t e s h a r e d h i t : Cache : 3
66 0 : w r i t e e x c l u s i v e h i t : Cache : 3
67 2120739: w r i t e h i t : Cache : 3
68 0 : read noncompulsory miss : Cache : 3
69 0 : read compulsory miss : Cache : 3
70 2560 : r e ad ha l f m i s s : Cache : 3
71 33901 : read mis s : Cache : 3
72 0 : r e ad sha r ed h i t : Cache : 3
73 0 : r e a d e x c l u s i v e h i t : Cache : 3
74 5837775: r e ad h i t : Cache : 3
75 206 . 554 : access−time :DRAM:5
76 35016 : open :DRAM:5
77 35016 : precharge :DRAM:5
78 10369 : wr i t e :DRAM:5
79 33901 : read :DRAM:5
80 0 : AvgDataSize : Bus : 4
71
81 0 :Ack sent : Bus : 4
82 88540 :Ack r e c e i v ed : Bus : 4
83 88540 : Requests sent : Bus : 4
84 0 : Requests r e c e i v ed : Bus : 4
85 Clear count : 0
86 Forward migrat ion count : 0
87 Backward migrat ion count : 0
88 Total f o o t p r i n t : 0
Listing B.1: TaskSim results file
72
Appendix C
McPAT configuration file
1
2 <?xml ve r s i on =”1.0” ?>
3 <component id=”root ” name=”root”>
4 <component id=”system” name=”system”>
5
6 <param name=”number o f cores ” value=”1”/>
7 <param name=”number o f L1Di rec to r i e s ” va lue=”0”/>
8 <param name=”number o f L2Di rec to r i e s ” va lue=”0”/>
9 <param name=”number of L2s ” value=”1”/>
10 <!−−1 Private , 0 shared / coherent −−>
11 <param name=”Pr ivate L2 ” value=”1”/>
12 <param name=”number of L3s ” value=”0”/>
13 <param name=”number of NoCs” value=”0”/>
14 <!−−1 means homo −−>
15 <param name=”homogeneous cores ” value=”0”/>
16 <param name=”homogeneous L2s” value=”0”/>
17 <param name=”homogeneous L1Director ies ” va lue=”0”/>
18 <param name=”homogeneous L2Director ies ” va lue=”0”/>
19 <param name=”homogeneous L3s” value=”0”/>
20 <!−−cache coherence hardware −−>
21 <param name=”homogeneous ccs ” value=”1”/>
22 <param name=”homogeneous NoCs” value=”0”/>
23 <param name=”cor e t e ch node ” value=”45”/><!−− nm −−>
24 <param name=”t a r g e t c o r e c l o c k r a t e ” value=”1000”/> <!−− MHz −−>
25 <param name=”temperature ” value=”380”/> <!−− Kelvin −−>
26 <param name=”number cache l eve l s ” va lue=”2”/>
27 <!−−0: a g g r e s s i v e wire techno logy ; 1 : c on s e rva t i v e wire techno logy −−>
28 <param name=”i n t e r c onn e c t p r o j e c t i o n t yp e ” value=”0”/>
29 <!−−0: HP(High Performance Type ) ; 1 : LSTP(Low standby power ) 2 : LOP (
Low Operating Power ) −−>
30 <param name=”dev i c e type ” value=”0”/>
31 <param name=”l ong e r channe l d ev i c e ” value=”1”/>
73
32 <!−− 0 not enabled ; 1 enabled −−>
33 <param name=”power gat ing ” value=”1”/>
34 <param name=”mach ine b i t s ” va lue=”64”/>
35 <param name=”v i r t ua l add r e s s w id th ” value=”64”/>
36 <param name=”phys i c a l add r e s s w id th ” value=”52”/>
37 <param name=”v i r tua l memory page s i z e ” value=”8192”/>
38 <s t a t name=” t o t a l c y c l e s ” value=”247189664”/>
39 <s t a t name=” i d l e c y c l e s ” value=”100”/>
40 <s t a t name=”busy cyc l e s ” value=”247189564”/>
41
42 <!−− ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ co r e s ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ −−>
43 <component id=”core0 ” name=”core0”>
44 <!−− Core property −−>
45 <param name=”c l o c k r a t e ” value=”1000”/>
46 <!−− 0 means us ing ITRS de f au l t vdd −−>
47 <param name=”vdd” value=”0”/>
48 <!−− ”−1” means us ing d e f au l t power gat ing v i r t u a l power supply
vo l tage cons t ra ined by technology and computed automat i ca l l y −−>
49 <param name=”power gat ing vcc ” value=”−1”/>
50 <!−− f o r co r e s with unknown timing , s e t to 0 to f o r c e o f f the
opt f l a g −−>
51 <param name=”op t l o c a l ” va lue=”0”/>
52 <param name=” i n s t r u c t i o n l e n g t h ” value=”32”/>
53 <param name=”opcode width ” value=”16”/>
54 <param name=”x86” value=”1”/>
55 <param name=”micro opcode width ” value=”8”/>
56 <!−− i no rde r /OoO; 1 ino rde r ; 0 OOO−−>
57 <param name=”machine type ” value=”0”/>
58 <param name=”number hardware threads ” value=”1”/>
59 <param name=”fe t ch w id th ” value=”4”/>
60 <param name=”numbe r i n s t r u c t i on f e t ch po r t s ” value=”1”/>
61 <param name=”decode width ” value=”4”/>
62 <param name=”i s su e w id th ” value=”4”/>
63 <param name=”peak i s sue w id th ” value=”6”/>
64 <!−− commit width determines the number o f por t s o f r e g i s t e r f i l e s
−−>
65 <param name=”commit width” value=”4”/>
66 <param name=”f p i s s u e w i d t h ” value=”2”/>
67 <param name=”pred i c t i on w id th ” value=”1”/>
68 <param name=”p i p e l i n e s p e r c o r e ” value=”1,1”/>
69 <param name=”p ip e l i n e d ep th ” value=”31,31”/>
70 <param name=”ALU per core ” value=”6”/>
71 <param name=”MUL per core” value=”1”/>
72 <!−− For MUL and Div −−>
73 <param name=”FPU per core ” value=”2”/>
74 <!−− bu f f e r between IF and ID stage −−>
75 <param name=” i n s t r u c t i o n b u f f e r s i z e ” value=”32”/>
76 <!−− bu f f e r between ID and sche / exe s tage −−>
74
77 <param name=”de cod ed s t r e am bu f f e r s i z e ” value=”16”/>
78 <!−− 0 PHYREG based , 1 RSBASED−−>
79 <param name=”instruct ion window scheme ” value=”0”/>
80 <!−− McPAT support 2 types o f OoO cores , RS based and phys i c a l reg
based−−>
81 <param name=”in s t r u c t i on w indow s i z e ” value=”64”/>
82 <param name=”fp i n s t r u c t i o n w i ndow s i z e ” value=”64”/>
83 <!−− the i n s t r u c t i o n i s s u e Q as in Alpha 21264 ; The RS as in I n t e l P6
−−>
84 <param name=”ROB size” value=”192”/>
85 <!−− each in− f l i g h t i n s t r u c t i o n has an entry in ROB −−>
86 <!−− r e g i s t e r s −−>
87 <!−− X86−64 has 16GPR −−>
88 <param name=”ar ch i Reg s IRF s i z e ” value=”16”/>
89 <!−− MMX + XMM −−>
90 <param name=”arch i Regs FRF s ize ” value=”32”/>
91 <!−− i f OoO proces sor , phy reg number i s needed f o r renaming l o g i c ,
92 renaming l o g i c i s f o r both i n t e g e r and f l o a t i n g po int i n s t s . −−>
93 <param name=”phy Regs IRF size ” value=”256”/>
94 <param name=”phy Regs FRF size ” value=”256”/>
95 <!−− rename l o g i c −−>
96 <param name=”rename scheme” value=”0”/>
97 <param name=”r e g i s t e r w i ndow s s i z e ” value=”0”/>
98 <param name=”LSU order” value=”out−of−order”/>
99 <param name=” s t o r e b u f f e r s i z e ” value=”96”/>
100 <param name=” l o a d b u f f e r s i z e ” value=”48”/>
101 <!−− number o f por t s r e f e r to sus ta in−ab le concurrent memory a c c e s s e s
−−>
102 <param name=”memory ports” value=”1”/>
103 <param name=”RAS size ” value=”16”/>
104 <s t a t name=” t o t a l i n s t r u c t i o n s ” value=”246614721”/>
105 <s t a t name=” l o a d i n s t r u c t i o n s ” value=”45505417”/>
106 <s t a t name=”s t o r e i n s t r u c t i o n s ” value=”11327085”/>
107 <s t a t name=”commit t ed in s t ruc t i ons ” value=”246614721”/>
108 <s t a t name=”p i p e l i n e du t y c y c l e ” value=”1”/>
109 <s t a t name=” t o t a l c y c l e s ” value=”247189664”/>
110 <s t a t name=” i d l e c y c l e s ” value=”100”/>
111 <s t a t name=”busy cyc l e s ” value=”247189564”/>
112 <s t a t name=”IFU duty cyc le ” value=”0.25”/>
113 <s t a t name=”LSU duty cycle ” value=”0.25”/>
114 <s t a t name=”MemManU I duty cycle” value=”0.25”/>
115 <s t a t name=”MemManU D duty cycle” value=”0.25”/>
116 <s t a t name=”ALU duty cycle ” value=”1”/>
117 <s t a t name=”MUL duty cycle” value=”0.3”/>
118 <s t a t name=”FPU duty cycle ” value=”0.3”/>
119 <s t a t name=”ALU cdb duty cycle ” value=”1”/>
120 <s t a t name=”MUL cdb duty cycle ” value=”0.3”/>
121 <s t a t name=”FPU cdb duty cycle ” value=”0.3”/>
75
122 <param name=”number of BPT” value=”0”/>
123
124 <component id=”core0 . i c a che ” name=”i cache”>
125 <!−− the parameters are capac i ty , b lock width , a s s o c i a t i v i t y , bank ,
throughput w. r . t . core c lock , l a t ency w. r . t . core c lock , output width ,
cache po l i cy , −−>
126 <!−− c a ch e po l i c y ;//0 no wr i t e or write−though with non−wr i t e
a l l o c a t e ; 1 write−back with write−a l l o c a t e −−>
127 <param name=” i c a c h e c o n f i g ” value =”131072 ,32 ,8 ,1 ,8 ,3 ,32 ,1”/>
128 <!−− cache c o n t r o l l e r bu f f e r s i z e s : m i s s b u f f e r s i z e (MSHR) ,
f i l l b u f f e r s i z e , p r e f e t c h b u f f e r s i z e , wb bu f f e r s i z e−−>
129 <param name=”b u f f e r s i z e s ” value=”16 , 16 , 16 , 16”/>
130 </component>
131
132 <component id=”core0 . dcache” name=”dcache”>
133 <param name=”dcache con f i g ” value =”4096 , 32 , 4 , 1 , 4 , 2 , 32 , 1”/>
134 <param name=”b u f f e r s i z e s ” value =”128 ,128 ,128 ,128”/>
135 <s t a t name=”r e ad a c c e s s e s ” value=”22596345”/>
136 <s t a t name=”w r i t e a c c e s s e s ” value=”12902258”/>
137 <s t a t name=”read mi s s e s ” value=”1578157”/>
138 <s t a t name=”wr i t e m i s s e s ” value=”3528872”/>
139 <s t a t name=” c o n f l i c t s ” va lue=”0”/>
140 </component>
141 <param name=”number of BTB” value=”0”/>
142 <component id=”system . core0 .BTB” name=”BTB”>
143 <param name=”BTB config” value =”5120 ,4 ,2 ,1 ,1 ,3”/>
144 </component>
145 </component>
146
147 <component id=”system . L20” name=”L20”>
148 <param name=”L2 con f i g ” value =”524288 , 64 , 16 , 1 , 16 , 20 , 64 , 1”/>
149 <param name=”b u f f e r s i z e s ” value =”128 ,128 ,128 ,128”/>
150 <!−− number o f r , w, and rw por t s −−>
151 <param name=”por t s ” value=”1 ,1 ,1”/>
152 <param name=”c l o c k r a t e ” value=”1000”/>
153 <param name=”vdd” value=”0”/><!−− 0 means us ing ITRS de f au l t vdd
−−>
154 <param name=”power gat ing vcc ” value=”−1”/>
155 <param name=”dev i c e type ” value=”0”/>
156 <s t a t name=”r e ad a c c e s s e s ” value=”5107029”/>
157 <s t a t name=”w r i t e a c c e s s e s ” value=”3595533”/>
158 <s t a t name=”read mi s s e s ” value=”6327”/>
159 <s t a t name=”wr i t e m i s s e s ” value=”0”/>
160 <s t a t name=” c o n f l i c t s ” va lue=”0”/>
161 <s t a t name=”duty cyc l e ” value=”0.5”/>
162 </component>
163
164
76
165 <!−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗−−>
166 <component id=”system .mc” name=”mc”>
167 <param name=”number mcs” value=”0”/>
168 </component>
169 <!−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗−−>
170 <component id=”system . niu ” name=”niu”>
171 <param name=”number units ” va lue=”0”/>
172 </component>
173 <!−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗−−>
174 <component id=”system . pc i e ” name=”pc i e”>
175 <param name=”number units ” va lue=”0”/>
176 </component>
177 <!−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗−−>
178 <component id=”system . f l a s h c ” name=” f l a s h c”>
179 <param name=”number f la shcs ” value=”0”/>
180 </component>
181 <!−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗−−>
182 </component>
183 </component>
Listing C.1: McPAT configuration file
77
Appendix D
McPAT results file
1 McPAT ( ve r s i on 1 .3 o f Feb , 2015) r e s u l t s ( cur rent p r i n t l e v e l i s 3 , p l e a s e
i n c r e a s e p r i n t l e v e l to s ee the d e t a i l s in components ) :
2 ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
3 Technology 45 nm
4 Using Long Channel Devices When Appropriate
5 In t e r connec t metal p r o j e c t i o n= agg r e s s i v e i n t e r connec t techno logy
p r o j e c t i o n
6 Core c l o ck Rate (MHz) 1000
7
8 Proces sor :
9 Area = 23.2491 mmˆ2
10 Peak Power = 10.0036 W
11 Total Leakage = 4.72842 W
12 Peak Dynamic = 5.27521 W
13 Subthreshold Leakage = 4.4027 W
14 Subthreshold Leakage with power gat ing = 2.04596 W
15 Gate Leakage = 0.325718 W
16 Runtime Dynamic = 1.20404 W
17
18 Total Cores : 1 co r e s
19 Device Type= ITRS high performance dev i c e type
20 Area = 23.2491 mmˆ2
21 Peak Dynamic = 5.27521 W
22 Subthreshold Leakage = 4.4027 W
23 Subthreshold Leakage with power gat ing = 2.04596 W
24 Gate Leakage = 0.325718 W
25 Runtime Dynamic = 1.20404 W
26
27 Core :
28 Area = 23.2491 mmˆ2
29 Peak Dynamic = 5.27521 W
78
30 Subthreshold Leakage = 4.4027 W
31 Subthreshold Leakage with power gat ing = 2.04596 W
32 Gate Leakage = 0.325718 W
33 Runtime Dynamic = 1.20404 W
34
35 I n s t r u c t i o n Fetch Unit :
36 Area = 3.78963 mmˆ2
37 Peak Dynamic = 0.741947 W
38 Subthreshold Leakage = 0.569678 W
39 Subthreshold Leakage with power gat ing = 0.279599 W
40 Gate Leakage = 0.0342316 W
41 Runtime Dynamic = 0.162322 W
42
43 I n s t r u c t i o n Cache :
44 Area = 1.31432 mmˆ2
45 Peak Dynamic = 0.150314 W
46 Subthreshold Leakage = 0.184268 W
47 Subthreshold Leakage with power gat ing = 0.102491 W
48 Gate Leakage = 0.00843643 W
49 Runtime Dynamic = 3.58793 e−09 W
50
51 Branch Target Buf f e r :
52 Area = 0.478297 mmˆ2
53 Peak Dynamic = 0.0219604 W
54 Subthreshold Leakage = 0.0252612 W
55 Subthreshold Leakage with power gat ing = 0.0148829 W
56 Gate Leakage = 0.000985866 W
57 Runtime Dynamic = 2.89928 e−10 W
58
59 Branch Pred i c to r :
60 Area = 0.0142151 mmˆ2
61 Peak Dynamic = 0.00360149 W
62 Subthreshold Leakage = 0.000765153 W
63 Subthreshold Leakage with power gat ing = 0.000414063 W
64 Gate Leakage = 6.53926 e−05 W
65 Runtime Dynamic = 3.6871 e−11 W
66
67 I n s t r u c t i o n Buf f e r :
68 Area = 0.0245435 mmˆ2
69 Peak Dynamic = 0.161309 W
70 Subthreshold Leakage = 0.0011165 W
71 Subthreshold Leakage with power gat ing = 0.000590575 W
72 Gate Leakage = 5.69632 e−05 W
73 Runtime Dynamic = 0.0268189 W
74
75 I n s t r u c t i o n Decoder :
76 Area = 1.85799 mmˆ2
77 Peak Dynamic = 0.358719 W
79
78 Subthreshold Leakage = 0.325606 W
79 Subthreshold Leakage with power gat ing = 0.146523 W
80 Gate Leakage = 0.0185411 W
81 Runtime Dynamic = 0.0894599 W
82
83 Renaming Unit :
84 Area = 0.23512 mmˆ2
85 Peak Dynamic = 0.414558 W
86 Subthreshold Leakage = 0.0349715 W
87 Subthreshold Leakage with power gat ing = 0.0160445 W
88 Gate Leakage = 0.00629456 W
89 Runtime Dynamic = 0.184171 W
90
91 Load Store Unit :
92 Area = 1.25411 mmˆ2
93 Peak Dynamic = 0.362141 W
94 Subthreshold Leakage = 0.139991 W
95 Subthreshold Leakage with power gat ing = 0.0632416 W
96 Gate Leakage = 0.0133767 W
97 Runtime Dynamic = 0.121196 W
98
99 Data Cache :
100 Area = 0.864812 mmˆ2
101 Peak Dynamic = 0.279442 W
102 Subthreshold Leakage = 0.0910443 W
103 Subthreshold Leakage with power gat ing = 0.0412158 W
104 Gate Leakage = 0.00610289 W
105 Runtime Dynamic = 0.0228989 W
106
107 LoadQ :
108 Area = 0.083675 mmˆ2
109 Peak Dynamic = 0.0164852 W
110 Subthreshold Leakage = 0.00692433 W
111 Subthreshold Leakage with power gat ing = 0.00311595 W
112 Gate Leakage = 0.000499133 W
113 Runtime Dynamic = 0.0151588 W
114
115 StoreQ :
116 Area = 0.100465 mmˆ2
117 Peak Dynamic = 0.0201706 W
118 Subthreshold Leakage = 0.00936056 W
119 Subthreshold Leakage with power gat ing = 0.00421225 W
120 Gate Leakage = 0.000628884 W
121 Runtime Dynamic = 0.0370955 W
122
123 Memory Management Unit :
124 Area = 0.166096 mmˆ2
125 Peak Dynamic = 0.119143 W
80
126 Subthreshold Leakage = 0.0377489 W
127 Subthreshold Leakage with power gat ing = 0.016987 W
128 Runtime Dynamic = 0.115107 W
129
130 I t l b :
131 Area = 0.0329161 mmˆ2
132 Peak Dynamic = 0.00201826 W
133 Subthreshold Leakage = 0.00254371 W
134 Subthreshold Leakage with power gat ing = 0.00114467 W
135 Gate Leakage = 0.000193075 W
136 Runtime Dynamic = 6.90441 e−11 W
137
138 Dtlb :
139 Area = 0.0329161 mmˆ2
140 Peak Dynamic = 0.00201826 W
141 Subthreshold Leakage = 0.00254371 W
142 Subthreshold Leakage with power gat ing = 0.00114467 W
143 Gate Leakage = 0.000193075 W
144 Runtime Dynamic = 6.90441 e−11 W
145
146 Execution Unit :
147 Area = 8.20112 mmˆ2
148 Peak Dynamic = 2.88025 W
149 Subthreshold Leakage = 1.30495 W
150 Subthreshold Leakage with power gat ing = 0.591728 W
151 Runtime Dynamic = 0.606439 W
152
153 I n t eg e r ALUs (Count : 6 ) :
154 Area = 0.47087 mmˆ2
155 Peak Dynamic = 0.651519 W
156 Subthreshold Leakage = 0.295671 W
157 Subthreshold Leakage with power gat ing = 0.133052 W
158 Gate Leakage = 0.0221076 W
159 Runtime Dynamic = 0.070378 W
160
161 Float ing Point Units (FPUs) (Count : 2 ) :
162 Area = 4.6585 mmˆ2
163 Peak Dynamic = 0.237683 W
164 Subthreshold Leakage = 0.731296 W
165 Subthreshold Leakage with power gat ing = 0.329083 W
166 Gate Leakage = 0.0546797 W
167 Runtime Dynamic = 0.211134 W
168
169 Complex ALUs (Mul/Div ) (Count : 1 ) :
170 Area = 0.235435 mmˆ2
171 Peak Dynamic = 0.100341 W
172 Subthreshold Leakage = 0.147835 W
173 Subthreshold Leakage with power gat ing = 0.066526 W
81
174 Gate Leakage = 0.0110538 W
175 Runtime Dynamic = 0.140756 W
176
177 L2
178 Area = 3.67669 mmˆ2
179 Peak Dynamic = 0.757168 W
180 Subthreshold Leakage = 0.454718 W
181 Subthreshold Leakage with power gat ing = 0.241067 W
182 Gate Leakage = 0.0256912 W
183 Runtime Dynamic = 0.0148072 W
Listing D.1: McPAT results file
82
Appendix E
PARSEC benchmark graphics
E.1 Ferret
E.1.1 1 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.1: Ferret 1 core processor dynamic energy consumption
83
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.2: Ferret 1 core processor dynamic power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.3: Ferret 1 core processor static power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.4: Ferret 1 core DL1 dynamic energy consumption
84
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.5: Ferret 1 core L2 dynamic energy consumption
E.2 Blackscholes
E.2.1 1 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.6: Blackscholes 1 core processor dynamic energy consumption
85
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.7: Blackscholes 1 core processor dynamic power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.8: Blackscholes 1 core processor static power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.9: Blackscholes 1 core DL1 dynamic energy consumption
86
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.10: Blackscholes 1 core L2 dynamic energy consumption
E.2.2 4 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.11: Blackscholes 4 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.12: Blackscholes 4 core processor dynamic power consumption
87
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.13: Blackscholes 4 core processor static power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.14: Blackscholes 4 core DL1 dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.15: Blackscholes 4 core L2 dynamic energy consumption
88
E.2.3 16 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.16: Blackscholes 16 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.17: Blackscholes 16 core processor dynamic power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.18: Blackscholes 16 core processor static power consumption
89
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.19: Blackscholes 16 core DL1 dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.20: Blackscholes 16 core L2 dynamic energy consumption
90
E.3 Swaptions
E.3.1 1 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.21: Swaptions 1 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.22: Swaptions 1 core processor dynamic power consumption
91
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.23: Swaptions 1 core processor static power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.24: Swaptions 1 core DL1 dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.25: Swaptions 1 core L2 dynamic energy consumption
92
E.3.2 4 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.26: Swaptions 4 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.27: Swaptions 4 core processor dynamic power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.28: Swaptions 4 core processor static power consumption
93
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.29: Swaptions 4 core DL1 dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.30: Swaptions 4 core L2 dynamic energy consumption
E.3.3 16 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.31: Swaptions 16 core processor dynamic energy consumption
94
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.32: Swaptions 16 core processor dynamic power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.33: Swaptions 16 core processor static power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.34: Swaptions 16 core DL1 dynamic energy consumption
95
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.35: Swaptions 16 core L2 dynamic energy consumption
E.4 Dedup
E.4.1 1 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.36: Dedup 1 core processor dynamic energy consumption
96
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.37: Dedup 1 core processor dynamic power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.38: Dedup 1 core processor static power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.39: Dedup 1 core DL1 dynamic energy consumption
97
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.40: Dedup 1 core L2 dynamic energy consumption
E.5 Streamcluster
E.5.1 1 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.41: Streamcluster 1 core processor dynamic energy consumption
98
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.42: Streamcluster 1 core processor dynamic power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.43: Streamcluster 1 core processor static power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.44: Streamcluster 1 core DL1 dynamic energy consumption
99
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.45: Streamcluster 1 core L2 dynamic energy consumption
E.5.2 4 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.46: Streamcluster 4 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.47: Streamcluster 4 core processor dynamic power consumption
100
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.48: Streamcluster 4 core processor static power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.49: Streamcluster 4 core DL1 dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.50: Streamcluster 4 core L2 dynamic energy consumption
101
E.5.3 16 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.51: Streamcluster 16 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.52: Streamcluster 16 core processor dynamic power consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.53: Streamcluster 16 core processor static power consumption
102
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.54: Streamcluster 16 core DL1 dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.55: Streamcluster 16 core L2 dynamic energy consumption
103
E.6 Average
E.6.1 1 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.56: Average 1 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.57: Average 1 core processor dynamic power consumption
104
E.6.2 4 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.58: Average 4 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.59: Average 4 core processor dynamic power consumption
105
E.6.3 16 core
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.60: Average 16 core processor dynamic energy consumption
(a) Fixed associativity and latency (b) Unfixed associativity and latency
Figure E.61: Average 16 core processor dynamic power consumption
106
Bibliography
[1] C. Bienia. Benchmarking modern multiprocessors. January 2011.
[2] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness,
D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish,
M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput.Archit.News,
39(2):1–7, aug 2011.
[3] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Rein-
hardt. The m5 simulator: Modeling networked systems. Micro, IEEE, 26(4):52–60,
2006.
[4] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-
level power analysis and optimizations. In Computer Architecture, 2000. Proceedings
of the 27th International Symposium on, pages 83–94, 2000.
[5] BSC. Extrae. http://www.bsc.es/computer-sciences/extrae. Visited on 03/18/2015.
[6] BSC. Mercurium. https://pm.bsc.es/projects/mcxx. Visited on 03/18/2015.
[7] BSC. Nanos. https://pm.bsc.es/projects/nanox. Visited on 03/18/2015.
[8] BSC. Paraver: a flexible performance analysis tool. http://www.bsc.es/computer-
sciences/performance-tools/paraver/general-overview. Visited on 03/18/2015.
[9] T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstrac-
tion for scalable and accurate parallel multi-core simulation. In High Performance
Computing, Networking, Storage and Analysis (SC), 2011 International Conference
for, pages 1–12, 2011.
[10] L. J. Colmenar JM, Risco Martin JL. an overview of computer architecture and
system simulation. Technical report, 2011.
[11] A. Duran, E. Ayguade´, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and
J. Planas. Ompss: A proposal for programming heterogeneous multi-core architec-
tures. Parallel Processing Letters, 21(02):173–193, 2011.
107
[12] F. A. Endo, C. Damien, and H.-P. Charles. Micro-architectural simulation of embed-
ded core heterogeneity with gem5 and mcpat. In Proceedings of the 2015 Workshop
on Rapid Simulation and Performance Evaluation: Methods and Tools, RAPIDO
’15, pages 7:1–7:6, 2015.
[13] D. Genbrugge, S. Eyerman, and L. Eeckhout. Interval simulation: Raising the
level of abstraction in architectural simulation. In High Performance Computer
Architecture (HPCA), 2010 IEEE 16th International Symposium on, pages 1–12,
2010.
[14] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi.
Mcpat: An integrated power, area, and timing modeling framework for multicore
and manycore architectures. In Proceedings of the 42Nd Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, MICRO 42, pages 469–480, 2009.
[15] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi.
The mcpat framework for multicore and manycore architectures: Simultaneously
modeling power, area, and timing. ACM Trans.Archit.Code Optim., 10(1):5:1–5:29,
apr 2013.
[16] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R.
Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet’s gen-
eral execution-driven multiprocessor simulator (gems) toolset. SIGARCH Com-
put.Archit.News, 33(4):92–99, nov 2005.
[17] J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio,
J. Eastep, and A. Agarwal. Graphite: A distributed parallel simulator for mul-
ticores. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th
International Symposium on, pages 1–12, 2010.
[18] A. Rico, F. Cabarcas, C. Villavieja, M. Pavlovic, A. Vega, Y. Etsion, A. Ramirez,
and M. Valero. On the simulation of large-scale architectures using multiple ap-
plication abstraction levels. ACM Trans.Archit.Code Optim., 8(4):36:1–36:20, jan
2012.
[19] A. Rico, A. Duran, F. Cabarcas, Y. Etsion, A. Ramirez, and M. Valero. Trace-driven
simulation of multithreaded applications. In Performance Analysis of Systems and
Software (ISPASS), 2011 IEEE International Symposium on, pages 87–96, 2011.
[20] C. Sun, C. H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and
V. Stojanovic. Dsent - a tool connecting emerging photonics with electronics for
opto-electronic networks-on-chip modeling. In Networks on Chip (NoCS), 2012
Sixth IEEE/ACM International Symposium on, pages 201–210, 2012.
108
[21] S. J. E. Wilton and N. P. Jouppi. Cacti: an enhanced cache access and cycle time
model. Solid-State Circuits, IEEE Journal of, 31(5):677–688, 1996.
109
