Development of Energy Models for Design Space Exploration of Embedded
  Many-Core Systems by Klarhorst, Christian et al.
Development of Energy Models for Design Space Exploration of
Embedded Many-Core Systems
Christian Klarhorst†, Martin Flasskamp†, Johannes Ax†,
Thorsten Jungeblut†, Wayne Kelly*, Mario Porrmann†, and Ulrich Rückert†
† Cognitronics and Sensor Systems Group, * Science and Engineering Faculty
CITEC, Bielefeld University, Queensland University of Technology
Bielefeld, Germany Brisbane, Australia
ABSTRACT
This paper introduces a methodology to develop energy models for
the design space exploration of embedded many-core systems. The
design process of such systems can benefit from sophisticated mod-
els. Software and hardware can be specifically optimized based on
comprehensive knowledge about application scenario and hardware
behavior. The contribution of our work is an automated framework
to estimate the energy consumption at an arbitrary abstraction
level without the need to provide further information about the sys-
tem. We validated our framework with the configurable many-core
system CoreVA-MPSoC. Compared to a simulation of the CoreVA-
MPSoC on gate level in a 28 nm FD-SOI standard cell technology,
our framework shows an average estimation error of about 4 %.
KEYWORDS
energy modeling, many-core architecture, design space exploration
1 INTRODUCTION
Embedded system designers often have to deal with trade-offs. One
is to optimize a system that performs acceptable in the general
case instead of building an application-specific system. A general
purpose CPU excels an ASIC in flexibility but underlies it in perfor-
mance. However, a high configurability of hardware and software
increases the available design space tremendously.
An embedded system passes through many different design
stages, involving different groups of developers from the hardware,
software and system design domain. All have their own perspec-
tive on the system at different abstraction levels. For instance a
software developer does not know in detail how the usage of dif-
ferent hardware components affects the performance or energy
consumption of an application. Just as a hardware developer may
not have enough information about specific application require-
ments in an early design stage. Each group relies on estimations
of specific system metrics that are gathered by tools to perform
design decisions. Building optimized designs for specific use cases
often requires an information flow between all involved groups in
order to find solutions that are holistically optimal, which leads to
HIP3ES’18, January 2018, Manchester, United Kingdom
2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
a hardware software co-developing process. Therefore, it is appro-
priate to postpone trade-off decisions during the design process
in order to decide at the right time, as informed as possible and
for the right use case scenario. However, keeping more degrees of
freedom lead to a growing design space on all different abstraction
levels and the available and explored space better has to be handled
automatically.
Our target platform is the embeddedMPSoC architecture CoreVA-
MPSoC, which is highly configurable at design time. The CoreVA-
MPSoC toolchain has the objective to create an interface between
the different developer groups. It consists of an MPSoC system
specification in XML, parameterized VHDL SoC components, a
graphical user interface to configure MPSoCs, a cycle accurate sys-
tem simulator, a software backend based on the LLVM compiler
framework and our CoreVA-MPSoC compiler together with a broad
range of single and many-core applications. The VHDL SoC com-
ponents are bundled with their API specification into a component
library. The interaction between these components is depicted in
Figure 1.
The system needs to be accurately modeled during the design
process of hardware and software despite of the high configurability.
This can be achieved by combining partial models of individual
components like execution units or memories to a complete system
model. There are many different metrics to evaluate the system
depending on the specific abstraction level. In [5] we presented a
MPSoC 
Compiler
MPSoC 
GUI
LLVM
MPSoC Solution
SW + HW + Estimation
Prototype
ASICSimulator
Synth Tools
FPGA
MPSoC
Specification
Application Component LibraryAPI VHDL
Figure 1: Interaction of CoreVA-MPSoC toolchain compo-
nents
ar
X
iv
:1
80
1.
04
24
2v
2 
 [c
s.D
C]
  1
5 J
an
 20
18
HIP3ES’18, January 2018, Manchester, United Kingdom C. Klarhorst et al.
model to estimate the throughput of an MPSoC. In this work we
focus on modeling the energy consumption and how these models
influence the whole MPSoC design process. Energy efficiency is a
crucial design aspect for all kinds of processor systems. Obviously,
battery powered embedded systems are limited regarding their run
time. Additionally, the power consumption of high performance
systems limits their performance due to heat dissipation. The power
consumption can be addressed at all design stages. On the hardware
development side by choosing or dimensioning components and on
the software side by energy-aware compiler optimization. Hence,
we introduce an energy estimation framework for the CoreVA-
MPSoC toolchain.
Current research typically focuses on modeling effects like cache
behavior [11] or components like NoC routers [23] of a specific
system. In contrast, we want to cover a broad range of systems as for
instance the CoreVA-MPSoC as a highly configurable architecture.
A huge design space implies to trade off the required accuracy
against the setup time of the model. Therefore, our framework
generalizes the model definition and automates its creation. It is
able to create an energy model based on the needs of different user
groups that variate in granularity, accuracy and available time.
This work is organized as follows. The hardware and software
parts of the CoreVA-MPSoC project are introduced in Sections 2.1
and 2.2. An overview of related work in the domain of modeling
and estimating energy consumption is presented in Section 3. Sec-
tion 4 describes how energy estimation correlates with system
state transitions and Section 5 goes into details how the model cre-
ation is automatically done by our energy estimation framework.
The analysis of our energy estimation framework is presented in
Section 6.
2 THE COREVA-MPSOC PROJECT
As a target platform for our estimation framework presented in this
work, we use the self-developed many-core system CoreVA-MPSoC.
The CoreVA-MPSoC is used for application domains like signal
processing, vision processing, and encryption in embedded and
energy-limited systems. In [19] the CoreVA-MPSoC is presented as
a platform for Software-Defined-Radio applications.
2.1 Hardware Architecture
The CoreVA-MPSoC can be classified into the domain of many-
core architectures. Many-core architectures are defined by a very
...
CPU Cluster
NI
CPU NCPU 1CPU 0
Bus Interconnect
In
s
tr. 
M
e
m
CPU
Cluster
CPU
Cluster
CPU
Cluster
CPU
Cluster
D
a
ta
 
M
e
m
In
s
tr. 
M
e
m
D
a
ta
 
M
e
m
In
s
tr. 
M
e
m
D
a
ta
 
M
e
m
Shared L1 Data Mem.
CPU
Cluster
CPU
Cluster
CPU
Cluster
CPU
Cluster
CPU
Cluster
Figure 2: Hierarchical CoreVA-MPSoC architecture
high number of small-sized CPU cores, which are able to run ap-
plications with high resource efficiency. Kalray’s MPPA-256 [4]
and Adapteva’s Epiphany are typical examples for such many-core
architectures. In contrast to the Epiphany, our CoreVA-MPSoC fea-
tures a hierarchical communication infrastructure. The hierarchical
infrastructure consists of a Network-on-Chip (NoC) interconnect
that couples several CPU clusters (cf. Figure 2). Within each CPU
cluster several VLIW CPU cores are tightly coupled via a bus-based
interconnect.
The hardware architecture of our CoreVA-MPSoC is highly con-
figurable at design time on all different hierarchy levels.
Basic building block and lowest hierarchy level of our MPSoC is
the VLIW CPU CoreVA that is designed to provide high resource
efficiency [12]. According to the application domain, the number
of VLIW slots, ALUs, multiplication, and division units can be
configured at design time. Each CPU has local instruction and
data memories, which are used as software-managed scratchpad
memories, without the use of caches.
A CPU cluster is used to tightly couple up to 32 of those CPU
cores using a bus-based interconnect. The number of CPUs and
the interconnect fabric can be configured at design time and sup-
ports either AMBA AXI4 or Wishbone standard, each featuring
a crossbar or a shared bus [17]. Within a cluster, a CPU can ac-
cess the local data memory of every other CPU in a Non-Uniform
Memory Access (NUMA) fashion. Flexibility and communication
bandwidth can further be improved by integrating a shared data
memory into a cluster, which is tightly coupled to each CPU of the
same cluster [18].
For realizing MPSoCs with dozens or hundreds of CPU cores a
Network on Chip (NoC) is introduced to the CoreVA-MPSoC. The
NoC acts as a second interconnection hierarchy level and connects
multiple CPU clusters. The NoC features a packet based wormhole
switching. Packets are segmented into flits with 64 bit payload data.
The NoC interconnect is built up of routers, each having a config-
urable number of ports. This flexibility permits the implementation
of most common network topologies. Each CPU cluster is connected
to the NoC via a network interface (NI) [1], which bridges between
the address-based communication of the clusters and the flow- and
packet-based communication of the NoC. Therefore, it provides
a certain number of independent uni-directional channels. These
channels are synchronized and the packet data is directly stored
to and read from each CPU’s local data memory or the cluster’s
shared L1 data memory. Hence, the CPUs can benefit from the low
access latency of its local memory.
2.2 Programming Infrastructure
To program the MPSoC, the inter-CPU communication is encap-
sulated by our communication C-library (API), transparently for
both cluster and NoC communication. The CPUs communicate via
a block-based synchronization model and unidirectional commu-
nication channels [9]. To run C applications, we have created a C
compiler tool chain based on the LLVM compiler infrastructure.
Our custom backend supports VLIW and SIMD vectorization.
Nevertheless, efficiently programming a complete MPSoC with
many CPUs is a challenge for the programmer. For this reason we
have developed a parallelizing compiler for streaming applications
Development of Energy Models for Design Space Exploration HIP3ES’18, January 2018, Manchester, United Kingdom
to assist in programming the CoreVA-MPSoC [5, 9]. This compiler,
called CoreVA-MPSoC compiler, processes applications written in
the StreamIt language [20]. An application is represented by a struc-
tured data flow graph, which describes the inherent parallelism of
its tasks. Our compiler for streaming applications searches for a
valid partition with the best performance. It utilizes an approach
based on simulated annealing to map the tasks of a program onto
the individual CPUs of the MPSoC. A typical optimization goal is
to find a placement for the application’s tasks which maximizes the
throughput or minimizes the latency of an application. During the
partitioning process the compiler exploits three degrees of freedom
to alter an application’s data flow graph. Firstly, the compiler de-
cides on which processor a task is placed. Secondly, a task can be
cloned to exploit data parallelism. Thirdly, the granularity of work
done in each iteration can be increased to reduce the overhead of
communication. These changes are called mutations and further
increase the search space for the partitioning algorithm.
Every partition is judged regarding the achieved performance,
by a Simulation Based Estimation (SBE) model, which combines
a single execution-based simulation and an analytic approach [5].
MPSoC configurations with the same number of CPUs can differ
in many ways like topology, communication infrastructure and
memory architecture. All these characteristics are modeled in our
SBE to achieve accurate performance estimations. In addition, a
partition is also checked if any hardware limits of a specifiedMPSoC
configuration are exceeded. Therefore, our SBE model estimates
the consumption of data memory and enables the partitioning
algorithm to reject partitions exceeding the physically available
memory.
Finally, the best partition is used to generate an individual C
code file for each CPU of the MPSoC. With the energy estimation
framework presented in this paper, our SBE model can be extended
to estimate the expected energy consumption of an application and
perform energy-aware compilers optimization.
3 RELATEDWORK
Energy consumption is still a research topic of much interest. On
the one hand power consumption imposes limits on the system
performance due to e.g. heat dissipation. On the other hand power
efficiency often limits the diversity of target applications. Research
focuses on energy models, simulators and configuration space ex-
ploration, often to characterize various system aspects. Differences
are in the purpose, target user group, level of detail and the method
to infer and analyze those models. The purpose might be mainly
driven by the needs of the various development process levels to
provide benefits for the involved user groups. Research was often
used to limit the design-space and point out design inefficiencies
as early as possible [21].
In the area of hardware development research focuses on dif-
ferent levels of detail. On gate level Roselló et al. [16] present an
energy model for domino CMOS gates based on a detailed descrip-
tion of internal capacitance switching and discharging currents.
Many models target specific processor components for example
Dramsim [15] and Cacti [22] are well-known tools used for timing
and energy analysis of DRAMS and caches. In [23]Wolkotte et al. fo-
cus on NoC routers while the Orion2 [8] network simulator models
complete Network on Chips. Energy simulation tools for different
kinds of processors are provided by the frameworks Watch [3] for
CPUs and GPUWatch [10] for GPUs. Just as by the work of Vasilakis
et al. [21] for ARM processors and Jordans et al. [7] for VLIW proces-
sors. An instruction-level energy characterization of the Adapteva
Epiphany SoC was done by Ortiz et al. in [13]. They measured
the energy consumption of the SoC with a digital current meter
while executing a number of microbenchmarks. The McPat frame-
work [11] focuses on complete SoCs by describing fundamental
components of multi-core and many-core processor configurations
in a flexible XML interface. Specific properties and implementation
details like cache behavior and timing are described quantitatively.
By evaluating the model McPat estimates power, area and timing.
Hesse et al. created a power consumption model of a complete
circuit board of a wireless sensor node [6]. The model containing
sensor, microcontroller, wireless transceiver and power supply is
utilized for choosing components at an early design stage. In the
compiler research area the configuration space of compiler options
and its impact on power consumption was explored by Pallister
et al. [14]. They performed power measurements on five embedded
platforms to ensure all architectural effects are captured. All user
groups focus on different levels of detail or granularity and admit
different error margins.
Besides the target platform or components the aforementioned
work further differs in theirmethodology. TheMcPat framework [11]
estimates energy based on a model specification while other work
gathers data from various sources to fit models like Ortiz et al.
did by measuring the MPSoC [13]. Multiple options to create and
gather data for the second option may be found in research. Mi-
crobenchmarks which are often handwritten are used for the be-
havior creation in contrast to real applications or benchmark suits.
Furthermore data can be gathered by simulating those systems or
by measuring final ASICs under real world conditions. The simula-
tion method provides fine granular power analysis, e.g. it is possible
to gather power data for all components of a CPU using gate level
analysis at the cost of an often compute intensive but scalable simu-
lation Finally, the simulation method is able to provide data before
tape out and it is able to simulate various environment conditions.
While the measurement method provides better accuracy because
it incorporates all physical effects and uses real word conditions.
IS-Simulator EnergyEstimation
Model
Constants
Function
Gatelevel
Simulation
Application
Hardware
Specification
Figure 3: Exemplary energy estimation of instruction set
simulation by model created at gate level
HIP3ES’18, January 2018, Manchester, United Kingdom C. Klarhorst et al.
4 ENERGY ESTIMATION METHODOLOGIES
Energy consumption of digital circuits originates from different
physical effects. The static power consumption is the result of leak-
age current of the transistors. While the dynamic power consump-
tion is composed of switching current, a transistor switches from
one logic state to another, and the switching frequency dependent
load of charging external load capacitance. The state of the entire
system can be described by the states of all components.
Our idea is to describe the system state and correlate energy con-
sumption with the system state transitions. This also gives us the
ability to correlate energy consumption from a lower abstraction
level with high-level state transitions. Figure 3 shows exemplary
how the result of gate level simulations can be utilized to develop
an energy model on logic level. This model can be used together
with a cycle accurate system simulator to estimate the energy con-
sumption.
Model Constants and Functions
The state of a system can be described at all abstraction levels. An
overview of abstraction levels shows Table 1. On circuit level the
system state is described by differential equations of the transistors.
Similarly, instructions define the system state at the higher cycle
abstraction level. Without additional information we could now
perform energy measurements and correlate them with state transi-
tions. A comprehensive model containing all possible state transfor-
mations could be created by performing nmeas = nstates ×nstates
measurements. But unfortunately an incalculable amount of ways
to describe the system state exist.
This correlation is visualized in Figure 4 on the basis of the
CoreVA-MPSoC. Figure 4 (a) show a sparse model which only dis-
tinguishes whether a component was used (coloured) or unused
(grey) during the period of measurement. In contrast the model in
(c) breaks down the state of each component to a very fine-grained
state description, visualized by multicolored bar charts on the CPU
components and pie charts on the NoC routers. For instance this
model considers the variation of power consumption for different
CPU instructions. A trade-off between both attempts depicts model
(b), which considers the time a used components was in active or
idle state. In this case the state description of a component contains
no redundant states.
On cycle level system state transitions might be caused by a
sequence of instructions as a result of an execution of a function.
The energy consumption per instruction forms the model constants
on cycle level (cf. Figure 3). Setting up the model includes the
creation of system behavior, the capturing of the system’s energy
Table 1: Abstraction levels
Abstraction Level System State Granularity
Circuit Transistors Differential Equations
Logic Gates Boolean Functions
RTL Registers RTL Spec.
Cycle Instructions Instruction
Task Functions System Spec.
Application Program Application Spec.
consumption during simulation and the final model determination.
Accordingly, the setup time for a simple CPU core model would
include the generation of micro benchmarks, gate level simulations
and fitting the model constants.
On each abstraction level the system state is modeled with a
different granularity. Therefore, we added a function to the model
to transform the state description from simulation into the state
description used by the model. It is able to add, change and remove
state information and provides the necessary flexibility while it
may also include additional system assumptions to simplify e.g.
reduce the system description. This reduction results in faster model
evaluation and setup time. The same model constants can be reused
on different abstraction levels by transforming the system state
with different model functions.
The model is evaluated by executing the model function and
combining the resulting state with the model parameters. Since no
simulation is required the model evaluation is only a fast computa-
tion.
Exemplary MPSoC Component Models
As an example the following paragraph describes the creation of an
model for an ideal NoC communication. The NoC of the CoreVA-
MPSoC is organized as a 2D-mesh, described in Section 2.1. A
transmission can be characterized by the coordinates of the involved
CPUs and the size of the transmitted buffer. So a naive model would
contain all possible permutations of the coordinates and all possible
packet sizes resulting in a huge model.
The path on which a packet passes through the NoC can be
characterized by a sequence of independent hops (routers). That
way, the foregoing model can be drastically reduced by removing
the redundancy of individual links. The reduced model would then
contain all number of hops and possible packet sizes together with
a function to define the new distance metric number of hops, for
example by nhops = manhattan_dist(x ,y).
More knowledge about the hardware could be incorporated into
the model to reduce it even further. For example, one could assume
System State Description FullSparse
Redundancy No GainLoss of Accuracy
(a) (b) (c)
Figure 4: Abstraction levels of system state description per
period of measurement
Development of Energy Models for Design Space Exploration HIP3ES’18, January 2018, Manchester, United Kingdom
that the energy consumption will increase linearly with the packet
size (under the assumption of no congestion). A model containing
all available number of hops together with two constants for the
linear functions would be reduced again but possibly at the expense
of an increasing estimation error. More detailed results and further
analysis follow in Section 6.
5 ENERGY MODELING FRAMEWORK
The contribution of our work is a framework to estimate the energy
consumption at an arbitrary abstraction level without the need
to provide further information about the system. However, the
framework allows for adding additional knowledge to the system
to improve estimation accuracy or to accelerate the model setup
time.
Back annotated simulations on gate level could be employed to
create a model, as long as one is in possession of the RTL (VHDL,
Verilog) sources of the system and a process design kit or corre-
sponding IP-Cores. Afterwards the simulation results can be an-
alyzed by commercial tools like Cadence© Voltus™ to obtain an
energy estimation. Unfortunately, comprehensive gate level simu-
lations for a complex system like the configurable CoreVA-MPSoC
architecture would imply a large number of simulations. Accord-
ingly, these require a lot of simulation time and resources as well
as tool licenses. Therefore, our framework enables the developer
to obtain an energy estimation faster and with an acceptable loss
of accuracy with the help of reducing the number of necessary
simulations by including system knowledge into the model.
We require a set of system states and energy information to find
the correlation parameters for the specific energy model. The sys-
tem states (the system behavior) are created by a previously selected
application or bymicrobenchmarks and a gate level simulation. This
simulation provides the necessary state information together with
the occurred switching activity data that is fed into the tools for
energy measurement. We further provide a trace infrastructure,
implemented up to the C level, in order to provide a method to
create additional, high level system state information that could be
used in the model or by the model function. We developed a tool to
automatically create a broad range of microbenchmarks based on
user preferences and on the available information in the toolchain.
The tool is able to variate the use pattern of all SoC components
that have a specified API description. This includes for example
all different communication and memory components. It is further
able to create microbenchmarks for the CoreVA CPU cores. This
is done automatically by parsing the instruction set description al-
ready available in the software toolchain. The tool always generates
setup code in front of every test case in order to bring the CoreVA
into a defined state to have a reproductive measuring environment.
6 VALIDATION
There are many effects in a System on Chip that have an impact
on the energy consumption. This work focuses on a many-core
system with a 2D-mesh NoC which connects multiple clusters each
containing four fully connected CoreVA CPUs. The CoreVA CPU
cores include two VLIW slots with one load/store and multiplica-
tion/division unit. Each CPU is configured with 16 kB of instruction
memory organized in two banks and 16 kB of local data memory
equally organized. Each cluster additionally contains 64 kByte of
shared data memory, tightly coupled to each CPU and organized in
eight banks. Each memory bank in the system contains 2 k words
with 32 bit word-size. Each network interface (NI) supports 128
synchronous channels with an effective bandwidth of 64 bit per
clock cycle (cf. Section 2.1). The cluster interconnects are carried
out as crossbars with a width of 64 bit.
One CPU cluster occupies 0.817mm2 of chip area using a 28 nm
FD-SOI standard cell technology: STMicroelectronics, 10 metal
layer, worst case corner 1.0 V, 125 ◦C. Allows for maximum clock
frequency of 700MHz. One CoreVA CPU dissipates 4.5mW during
idle state and up to 30mW during active computation. A full CPU
cluster has a total power consumption of up to 150mW. The layout
corresponds with the design shown in Figure 5. In addition to the
CPUs, memory and the crossbar, each cluster layout includes NoC
components, like the NI and a router with connection ports to all
four directions. Multiple of these clusters can be directly strung
together on a top level P&R step to build a scalable MPSoC with
a 2D-Mesh NoC, as we have presented in [2]. As we pointed out
earlier, many scenarios exist where accurate energy metrics are
important for the final result. This already begins in an early design
stage, e.g. during the specification of the CPU’s feature and instruc-
tion set. With our toolchain it is possible to get a fast indicator for
the energy consumption of the instruction set. We automatically
create microbenchmarks, which variate just the instructions while
setting all registers and the memory content to a specific value.
For instance, the description of CoreVA CPU instruction set (cf.
Section 5) contains 155 different definitions of instructions, which
results in 20,093 different instruction groups for the two VLIW slots
(not all instructions could be placed on an arbitrary slot). All in-
struction groups were analyzed for three data patterns. This part of
the model leads to Figure 6 which shows the mean power consump-
tion of the CoreVA CPU instructions for the given configuration.
The energy consumption is shown in pico joule and broken down
into the CPU components core, instruction and data memory as
well as a small constant unclassified section. Instructions of similar
behavior are categorized into the groups NOP, SIMD, Load/Store
and Miscellaneous due to illustrative purposes. An empty VLIW
slot is depicted as Empty. Non SIMD arithmetic instructions are
included in the Miscellaneous instruction group.
Shared 
Mem.
Shared 
Mem.
CPU CPU
CPUCPU
B
u
s
NI
RouterNI-Mem.
P
o
rt
 3
Port 4
Port 2
P
o
rt
 1
Figure 5: Physical layout of a CPU cluster node. Left: Placed
design, Right: Routed design
HIP3ES’18, January 2018, Manchester, United Kingdom C. Klarhorst et al.
As one might expect, the NOP instruction leads to the lowest
energy consumption of the CPU core. Whereas the most energy is
consumed when a SIMD instruction is executed due to the highest
core activity. The data memory shown in blue has obviously an
impact only on the power consumption of Load/Store instructions.
The worst case difference between the three data patterns men-
tioned above is 2.6mW. The energy consumption of the instruction
memory depends on the VLIW instruction format. A VLIW instruc-
tion for a CoreVA CPU with two slots may contain instructions for
both or only one slot. The latter case results in instruction com-
pression to save space in memory and energy on access as seen in
Figure 6. The provided hardware macro of a single CoreVA CPU has
a minimum power usage of 15.3mW and a maximum of 38.0mW.
The model creation needs 55 h of gate level simulation and 70 days
of power analysis time to run 60,279 different tests on a single In-
tel® Xeon® E5-1650 v4 CPU core. Actually, the tests are distributed
onto 26 CPU cores in parallel.
Many detailed effects on the power consumption were left aside,
like the position of the instruction or the data dependency. There-
fore, we investigated how the specific instruction memory address
effects the energy consumption of the instruction memory. We
create microbenchmarks where the position of the instruction is
variated while the rest of the system state remains constant. An
16 kb instruction memory is able to store 2048 2-Slot and up to
4096 1-Slot CoreVA instructions because of instruction compres-
sion. The energy consumption for the first 800 instruction memory
addresses is shown in Figure 7. All addresses were filled with the
same instruction hence the energy consumption only depends on
the accessed address and is not a function on the memory content.
The position creates an energy consumption range of 1.7 pJ for two
VLIW slot instructions and 3.2 pJ for a single slot, which can be
explained by the internal interconnect of the memory blocks. We
assume that the arrangement of the data points is caused by the
tree like structure of the memory address decoder.
No
p-E
mp
ty
Mi
sc
-E
mp
ty
SIM
D-
Em
pty
No
p-N
op
No
p-M
isc
Mi
sc
-N
op
No
p-S
IM
D
Mi
sc
-M
isc
SIM
D-
No
p
Mi
sc
-S
IM
D
Ld
/St
-E
mp
ty
Ld
/St
-N
op
Ld
/St
-M
isc
Ld
/St
-S
IM
D
CoreVA Instruction Group
0
10
20
30
40
50
60
Av
era
ge
 En
erg
y C
on
su
mp
tio
n i
n p
J
Unclassified
Core
Instruction Memory
Data Memory
Figure 6: Average power consumption of CoreVA CPU in-
struction groups for two VLIW slots divided into CPU com-
ponents
A communication transfer between two CPU cores across the
NoC involves the two CPU cores itself, routers, the interfaces be-
tween CPU cluster and NoC (NI) and other communication in-
frastructure like the bus interconnect within a CPU cluster. Our
toolchain is able to generate benchmarks with the tool presented
in Section 5 to analyze communication characteristics in an auto-
mated way based on an API description. Therefore, it is possible to
variate the packet size for a NoC communication between two cores
on different clusters. Figure 8 shows the energy consumption of
specific transfer sizes. The packet size variates from 4 to 1024 byte
in 4 byte increments. The zoomed view includes 36 data points from
374 to 446 in 2 byte increments. Due to the 64 bit flit size, into which
all NoC packets are segmented (cf. Section 2.1), the plot shows a
staircase function. One bar depicts the energy consumption in nJ
and is separated into the shares of the involved components. CPU
synchronization, data memories, bus, router and the NoC interface
are shown in red, blue, green, bright red and gray. The unclassified
part shown in yellow contains, among others, especially the parts
of the CPU cluster interconnect, which are not assigned to the
bus part. The costs for CPU synchronization are software costs to
synchronize the communication channel. More details about these
software costs for communication are presented in [1].
Generating the results of 256 data points for the plot of Figure 8
took 40 h of simulation time and 170 h for power analysis. The data
was further approximated with a linear and a staircase function as
shown in Section 4. We used 16 data points around the center for
the fit in order to reduce the simulation time to 3 h and the power
analysis time to 11 h. The linear function shown as a blue line had
a maximum error of 62 pJ while the staircase function shown as a
green line had a maximum error of 52 pJ.
We validated the applicability of our approach by estimating
the energy consumption for a set of five many-core benchmarks
from the StreamIt benchmark suite [20]. The analyzed benchmarks
are chosen from the signal processing, matrix multiplication and
sorting algorithm domain. Compared to a gate level simulation the
average estimation error is about 4%with amaximum absolute error
21.0
21.5
22.0
22.5
23.0
Nop   -   Nop Nop   -  None
0 100 200 300 400 500 600 700 800
Instruction Memory Address
9.0
9.5
10.0
10.5
11.0
11.5
12.0
12.5
13.0
En
er
gy
 C
on
su
m
pt
ion
 in
 p
J
Figure 7: Spatial correlation between energy consumption
of nop instruction and address in instruction memory
Development of Energy Models for Design Space Exploration HIP3ES’18, January 2018, Manchester, United Kingdom
4 128 256 384 512 640 768 896 1024
Packet Size in byte
5
6
7
8
9
10
En
er
gy
 C
on
su
m
pt
io
n 
in
 n
J
CPU Synchronization
Data Memories
Unclassified
Bus
Router
NI
Zoom: 8x
Figure 8: Energy consumption in nJ of NoC transmissions
with varying packet size separated into components
of 10% for a simplified model. The model considers all instruction
types and three data pattern but not the instruction position in the
memory. However, the accuracy depends on how many effects are
taken into account when establishing the model.
7 CONCLUSION
This paper presents an approach to develop and use energy models
for a design space exploration of embedded many-core systems.
The contribution of our work is a framework to estimate the energy
consumption at an arbitrary abstraction level without the need
to provide further information about the system. Therefore, we
developed an automatic tool flow to fill the energy model with
energy information gathered from gate level simulations of basic
system states of different components of a many-core system. Once
the energy model is filled, it can be used on higher abstraction levels
to get fast and accurate energy estimations for specific applications
and hardware configurations of the many-core system.
To validate our framework, we used our, at design time, highly
configurable many-core system CoreVA-MPSoC as a target archi-
tecture. Compared to a gate level simulation of the CoreVA-MPSoC
in a 28 nm FD-SOI standard cell technology, the average estima-
tion error of our energy model framework is about 4 %. Compared
to other approaches, like Ortiz et al. [13], we are able to analyze
the energy consumption more fine-grained without the restriction
to measure an existing ASIC. Hence, our model contains internal
MPSoC components like instruction memory and VLIW slots as
well. However, Ortiz et al. achieve a more accurate total energy
estimation at instruction level. The design process of highly con-
figurable embedded systems like the CoreVA-MPSoC benefits from
such a sophisticated model. Applications may be specifically opti-
mized for tailored systems by modeling the different variants and
characteristics of system components. A trade-off can be found
based on comprehensive knowledge about application scenario
and hardware behavior. In future work, our MPSoC compiler will
use the presented energy model to automatically map streaming
applications to our CoreVA-MPSoC, optimized for energy.
ACKNOWLEDGMENT
The reported research was supported through project grants Kogni-
Home (German Federal Ministry of Education and Research (BMBF)
Grant No. 16SV7054K), the Cluster of Excellence Cognitive Inter-
action Technology "CITEC" (EXC 277) at Bielefeld University and
the BMBF Leading-Edge Cluster "Intelligent Technical Systems
OstWestfalenLippe" (it’s OWL).
REFERENCES
[1] Johannes Ax et al. 2015. In press. System-Level Analysis of Network Interfaces for
Hierarchical MPSoCs. In NoCArc. ACM. https://doi.org/10.1145/2835512.2835513
[2] Johannes Ax et al. 2017. Comparing synchronous, mesochronous and asynchro-
nous NoCs for GALS based MPSoC. In MCSoC ’17.
[3] D. Brooks, V. Tiwari, and M. Martonosi. 2000. Wattch: a framework for
architectural-level power analysis and optimizations. In Proceedings of 27th Inter-
national Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[4] Benoit Dupont de Dinechin et al. 2013. A clustered manycore processor
architecture for embedded and accelerated applications. In HPEC ’13. IEEE.
https://doi.org/10.1109/HPEC.2013.6670342
[5] Martin Flasskamp et al. 2016. Performance estimation of streaming applications
for hierarchical MPSoCs. In RAPIDO ’16, Vol. 18-January. ACM Press. https:
//doi.org/10.1145/2852339.2852342
[6] Marc Hesse, Michael Adams, Timm Hormann, and Ulrich Rückert. 2016. Towards
a comprehensive power consumption model for wireless sensor nodes. In BSN
’16. https://doi.org/10.1109/BSN.2016.7516293
[7] R. Jordans, R. Corvino, L. Józwiak, and H. Corporaal. 2013. An Efficient Method
for Energy Estimation of Application Specific Instruction-Set Processors. In
Euromicro DSD ’13. https://doi.org/10.1109/DSD.2013.120
[8] A. B. Kahng, Bin Li, L. S. Peh, and K. Samadi. 2009. ORION 2.0: A fast and accurate
NoC power and area model for early-stage design space exploration. In DATE
’09. https://doi.org/10.1109/DATE.2009.5090700
[9] Wayne Kelly et al. 2014. A Communication Model and Partitioning Algorithm
for Streaming Applications for an Embedded MPSoC. In SoC ’14. IEEE. https:
//doi.org/10.1109/ISSOC.2014.6972436
[10] Jingwen Leng et al. 2013. GPUWattch: Enabling EnergyOptimizations in GPGPUs.
In ISCA ’13. https://doi.org/10.1145/2485922.2485964
[11] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009.
McPAT: An integrated power, area, and timing modeling framework for multicore
and manycore architectures. MICRO ’09. https://doi.org/10.1145/1669112.1669172
[12] Sven Lütkemeier et al. 2013. A 65 nm 32 b Subthreshold Processor With 9T
Multi-Vt SRAM and Adaptive Supply Voltage Control. IEEE Journal of Solid-State
Circuits 48, 1 (2013).
[13] Gabriel Ortiz, Lars Svensson, Erik Alveflo, and Per Larsson-Edefors. 2017. In-
struction level energy model for the Adapteva Epiphany multi-core processor. In
CF ’17. ACM Press. https://doi.org/10.1145/3075564.3078892
[14] James Pallister et al. 2013. Identifying Compiler Options to Minimise Energy
Consumption for Embedded Platforms. CoRR abs/1303.6485 (2013).
[15] P. Rosenfeld et al. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator.
IEEE L-CA ’11 10, 1 (2011). https://doi.org/10.1109/L-CA.2011.4
[16] J.L. Rossello, C. de Benito, and J. Segura. 2005. A compact gate-level energy and
delay model of dynamic CMOS gates. IEEE Transactions on Circuits and Systems
II: Express Briefs 52. https://doi.org/10.1109/TCSII.2005.851992
[17] Gregor Sievers et al. 2015. Comparison of Shared and Private L1 Data Memories
for an Embedded MPSoC in 28nm FD-SOI. In MCSoC ’15. https://doi.org/10.1109/
MCSoC.2015.25
[18] Gregor Sievers et al. 2015. Evaluation of Interconnect Fabrics for an Embedded
MPSoC in 28nm FD-SOI. In ISCAS ’15. IEEE.
[19] Gregor Sievers et al. 2017. The CoreVA-MPSoC: A Multiprocessor Platform for
Software-Defined Radio. Springer International Publishing, Cham. https://doi.
org/10.1007/978-3-319-49679-5_3
[20] William Thies et al. 2002. StreamIt: A Language for Streaming Applications. In
CC ’02. Springer. https://doi.org/10.1007/3-540-45937-5_14
[21] Evangelos Vasilakis. 2015. An Instruction Level Energy Characterization of ARM
Processors. Technical Report March.
[22] S. J. E. Wilton and N. P. Jouppi. 1996. CACTI: an enhanced cache access and
cycle time model. IEEE Journal of Solid-State Circuits 31, 5 (May 1996). https:
//doi.org/10.1109/4.509850
[23] P.T. Wolkotte, G.J.M. Smit, N. Kavaldjiev, J.E. Becker, and Jürgen Becker. 2005.
Energy Model of Networks-on-Chip and a Bus. In SoC ’05. IEEE. https://doi.org/
10.1109/ISSOC.2005.1595650
