Low power memory allocation and mapping for area-constrained systems-on-chips by Strobel, Manuel et al.
EURASIP Journal on
Embedded Systems
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 
DOI 10.1186/s13639-016-0039-5
RESEARCH Open Access
Low power memory allocation and
mapping for area-constrained systems-on-chips
Manuel Strobel*, Marcus Eggenberger and Martin Radetzki
Abstract
Large fractions of today’s embedded systems’ power consumption can be attributed to the memory subsystem. In
order to reduce this fraction, we propose a mathematical model to optimize on-chip memory configurations for
minimal power. We exploit the power reduction effect of splitting memory into subunits with frequently accessed
addresses mapped to small memories. The definition of an integer linear programming model enables us to solve the
twofold problem of allocating an optimal set of memory instances with varying size on the one hand and finding an
optimal mapping of application segments to allocated memories on the other hand. Experimental results yield power
reductions of up to 82 % for instruction memory and 73 % for data memory. Area usage, at the same time, deteriorates
by only 2.1 %, respectively, 1.2 % on average and even improves in some cases. Flexibility and performance of our
model make it a valuable tool for low power system-on-chip design, either for efficient design space exploration or as
part of a HW/SW codesign synthesis flow.
Keywords: Integer linear programming, ILP, Low power, On-chip memory, SRAM, System-on-chip, SoC
1 Introduction
The ubiquitous nature of embedded systems substanti-
ates the need for design and development methods that
yield the lowest possible power consumption. Fortunately,
such specialized systems often perform only known tasks,
which allows engineers to optimize specifically for those
tasks without compromise. Since up to 60 % of an embed-
ded system’s power consumption is attributed to memory
[1], optimizing the memory subsystems is an evident
design goal. A commonly used method to reduce mem-
ory power consumption is splitting memory into several
individual memory instances [1–6].
A break-down analysis of the energy consumption
caused by reading from on-chip static random-access
memory (SRAM) shows that less than 1 % is consumed by
the actual memory cells and about 90 % by components
such as precharge unit, sense amplifiers, and address tran-
sition detection [7]. The energy consumption of these
components is heavily affected by the overall size of the
SRAM instance being accessed. One can make use of this
aspect to reduce the total energy consumption by splitting
*Correspondence: manuel.strobel@informatik.uni-stuttgart.de
Embedded Systems Department, Institute of Computer Architecture and
Computer Engineering, University of Stuttgart, Pfaffenwaldring 5b, 70569
Stuttgart, Germany
on-chip memory into multiple instances such that fre-
quently accessed segments of the address space reside in
separate small memory instances.
However, splitting memory into multiple units requires
an interconnect, e.g., a bus or a custom fabric, that for-
wards read and write requests to the individual memory
instances. Obviously, this interconnect consumes on-chip
area and energy itself and may offset the benefits of the
split memories. Furthermore, using multiple small memo-
ries instead of a single large one implies an increase of area
requirements and thus can become prohibitive. To achieve
the highest power reduction possible, it is also necessary
to reorganize the logical address space such that the most
frequently accessed segments are grouped together and
can be mapped to the same physical memory instance.
For example, if the most frequently accessed memory
addresses are uniformly distributed over the application’s
address space, frequent and infrequent addresses will
inevitably be mapped to the samememory instances void-
ing any benefit of a split memory architecture. Altogether,
this makes the search for an optimal memory configura-
tion a non-trivial task.
In this work, we propose an integer linear programming
(ILP) model that solves the twofold problem of finding
the optimal allocation of memory instances as well as the
© 2016 Strobel et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made.
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 2 of 12
optimal mapping of address ranges to allocatedmemories.
Themodel is parameterized to allow user constraints such
as limiting the maximum number of memory instances or
the available area. The model can be used to optimize the
on-chip memory architecture for a single dedicated soft-
ware, multiple but replaceable applications, or coexisting
applications in a multitasking environment.
The rest of this work is organized as follows. Section 2
discusses existing research in the field of split memories.
We declare our problem statement in Section 3 and set up
the design space, which is used for the elaboration of our
formal ILP model in Section 4. The integration with sys-
tem synthesis is outlined in Section 5. Section 6 discusses
evaluation results, and we conclude with Section 7.
2 Related work
Heuristics for optimized memory configurations have
been investigated for different design goals. Mai et al. [1]
enable manual algorithm execution by largely simplify-
ing the low power optimization problem. The authors of
[5, 8] target the combined optimization problem of mem-
ory and bus partitioning for multi-master, multi-memory
systems. Zhuge et al. [9] distribute variables between dif-
ferent memory instances to increase digital signal proces-
sor (DSP) performance. However, we aim for an optimal
solution.
Optimal algorithms have been employed within a
confined problem scope or with reduced complexity. Such
algorithms are either limited to a fixed address layout
[2, 10]; only work with two memory instances [11];
coarsely quantize memory accesses over time [10]; or do
not consider leakage or deselect power [2]. Furthermore,
some allow only splitting into equally sized sub-banks
[10, 12] or only estimate the segmentation overhead
[2, 12, 13].
A closely related field of interest lies in the optimiza-
tion of scratch padmemories (SPM), which are an efficient
replacement for caches in embedded systems used in con-
junction with external memory [14]. To identify the opti-
mal selection of address ranges to be mapped to the SPM,
ILP models have been developed [4, 6] and dynamic pro-
gramming has been used [13]; all of which only consider a
single fixed size SPM and cannot be used to optimize the
SPM’s sub-organization.
Another common problem in the SPM domain is the
logical partitioning of address spaces to reduce the num-
ber of SPM fills [15–17].While related, we solve a different
problem: the partitioning of the physical on-chip memory
organization to minimize power consumption.
On-chip memory configurations have also been opti-
mized for application-specific purposes, e.g., for DSP [18]
or video processing applications [19, 20]. Our work differs
as we intend to provide a generic optimization methodol-
ogy independent of the targeted application specifics.
The main contribution of this work is a mathematical
model to identify a low power memory configuration for
a set of applications. The model differentiates between
read and write accesses and supports leakage and deselect
power. From an almost arbitrary list of memory types, the
model allocates the optimal number and types of memo-
ries and yields an optimalmapping of address space ranges
to the selected memories to achieve the lowest possible
power consumption. User-defined constraints provide the
ability to efficiently explore the design space in the area
and power domain. An automated optimization flow con-
tributing HW and SW optimizations makes our model
further a valuable tool for system synthesis.
3 Problem statement
The targeted hardware platforms are single-CPU systems-
on-chips (SoC) with support for multiple on-chip mem-
ories of varying structure, i.e., sub-banking, and size (cf.
Fig. 1).
We distinguish between the following three modes of
operation:
1. Single-App —The system is dedicated to the
execution of a single application.
2. Combined —Consideration of multiple
applications that are exchangeable, e.g., via
firmware update but executed by the system
exclusively (only one application resides in
on-chip memory at any point in time).
3. Multitasking —Multitasking with static or
dynamic scheduling.
We assume a set M =[1,m] of different memory types
and a set of applications A =[1, a]. Each application a is
characterized by a corresponding set of application pro-
files Pa =[1, p], which will be detailed later in this section.
Fig. 1 Exemplary single-CPU SoC with heterogeneous on-chip
memory structure
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 3 of 12
In case of Multitasking, each application is constantly
executed by one and the same task.
The problem statement is defined as follows: Find an
allocation α of memory instances and a mapping β that
assigns each application profile to exactly one memory
instance such that α and β yield the lowest power con-
sumption of all possible allocations and mappings while
satisfying area and user-defined constraints. For the sake
of readability and clarity, we focus on instruction mem-
ory. However, themodel can also be used for data memory
with small modifications, which we point out whenever
applicable. In order to emphasize whether memory is
read only or can be written to, we refer to instruction
memory as ROM and to data memory as RAM in the
following.
3.1 Design space
The basic elements of the design space are the individual
memory typesmi, whereas the actual design space is com-
posed of all possible combinations of one ormorememory
types with multiple selections of individual memory types
being possible. That is, all possible allocations α ∈ Nm0
with αi being the number of instances of memory type
mi. A memory type refers to a specific instance of on-chip
SRAM according to a technology library. The individual
memory types can differ in various aspects such as size,
area, or sub-banking organization.
In our model, eachmemory typemi is defined by a set of
relevant physical parameters. For ROM, these parameters
are:
• The size in kilobytes
• The area in technology size units (e.g. mm2)
• The read current in μA/MHz, consumed when
accessing the ROM
• The deselect current in μA/MHz, consumed when
the memory is idle
• The leakage current in μA, which is permanently
consumed
For RAM, the write current must be accounted as addi-
tional parameter, also given in μA/MHz. Note that read,
write, and deselect current are given with respect to the
operational clock frequency of the system, which we con-
sider to be fixed. The interconnect fabric is not directly
part of the design space. However, it contributes to over-
all power consumption depending on the total number of
allocated memory instances and is therefore considered in
our model as well.
This power model is in line with vendor-specific
datasheets (e.g., [21]), making it suitable for the explo-
ration of the design space, which we prune from any
solution that is infeasible or does not meet the designer’s
needs using three constraints. One constraint ensures that
enough memory is provided for the embedded software
application. The other two constraints allow limiting the
available area for the split memory organization and the
total number of memory instances.
An actual memory configuration for a given set of
application profiles can be identified by a pair of power
consumption on the one hand and on-chip area require-
ment on the other hand. For each individual dimension,
power and area, a single minimal solution exists in the
design space, while both together consequently define the
range of the solution space. Within these borders, fur-
ther pareto-optimal solutions can be obtained through
variation of the abovementioned constraints.
3.2 Application profiles
Applications are described in terms of a set Pa of appli-
cation profiles, each representing the behaviors of a part
of the application, which are possibly periodic. Our model
does not presume a certain granularity for application
profiles and thus can be chosen according to the individual
needs. For example, ranging from fine to coarse grained,
application profiles can represent individual instructions,
basic blocks, functions, or groups thereof.
The relevant characteristics of an exemplary applica-
tion profile are shown in Fig. 2a. Each profile assumes
a fixed period, which can be divided into two parts: an
active phase, the duty cycle, and an idle phase. While
memory can only be accessed during the duty cycle, it
is not necessarily accessed throughout the entire duty
cycle as indicated by the individual peaks in Fig. 2a. How-
ever, the power consumption does not depend on the
individual points in time when the memory is accessed
but only on the duration of those accesses. This allows
us to simplify the application profile by combining the
individual short memory accesses and modeling them
as a fraction of the duty cycle called access probability.
Figure 2b shows the resulting, simplified application pro-
file. For RAM, the access probability is replaced by a pair
of read and write probabilities, representing the fraction
of time for read and write operations during the duty cycle
(cf. Fig. 2c, d).
Note that while application profiles are designed to sup-
port the periodic nature often found in embedded appli-
cations, they can easily be used to describe non-periodic
behavior by setting the period to the total application
runtime.
3.3 Schedule
In case of Single-App and Combined operation modes (cf.
Section 3), the overall system runtime is attributed to one
application only. In aMultitasking environment, however,
we assume one task per application and a schedule that
determines the share, each application consumes of total
system runtime. This aspect is modeled in a flexible way
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 4 of 12
Duty Cycle
Period
Duty Cycle
Period
idle idle
Duty Cycle
Period
Duty Cycle
Period
idle idle
Access Probability
(a)
(b)
Read Probability Write Probability
(c)
(d)
Duty Cycle
Period
Duty Cycle
Period
idle idle
Duty Cycle
Period
Duty Cycle
Period
idle idle
Fig. 2 a Exemplary application profile with period and duty cycle (ROM). Blue bars represent read memory accesses. b Simplified application profile
using an access probability. c Exemplary application profile (RAM). Blue bars represent read, red barswrite memory accesses. d Simplified application
profile using read and write probabilities
by a set S ∈ [ 0, 1]|A| with each element sa representing
the share of total execution time for each application and∑|A|
a=1 sa = 1. In a static schedule, application runtimes
and the hypercycle suffice to determine this set. In case
of dynamic scheduling, reasonable values for S can be
obtained from system simulation.
3.4 Power model
In this work, we focus on reducing the average power con-
sumption Pavg = Etot/T depending on the total energy
consumed Etot and the runtime T of the application(s).
In this section, we establish a component wise definition
of the required energy Ep for a single application profile.
This model forms the basis of our ILP model, presented in
Section 4.
Due to the periodic nature of application profiles, the
total energy consumption of a single profile only depends
on the energy consumed in one period. For a given mem-
ory instance, energy consumption depends on whether it
is currently being accessed or not. When reading from
memory, standby current plus read current is consumed,
and otherwise, standby current plus deselect current is
consumed. A single period of a profile p accordingly
consumes the sum of read, deselect, and standby energy:
Ep = Eread + Edesel + Estdby (1)
The three components Eread, Edesel, and Estdby are
defined as follows:
Eread = d · pr · Ir
(
f
) · V · tp (2)
Edesel = (1 − d · pr) · Id
(
f
) · V · tp (3)
Estdby = Is · V · tp (4)
with duty cycle d, access probability pr , and period tp of
the profile; Ir , Id, and Is representing read, deselect, and
standby current as defined by the technology library; and
V as the supply voltage of the memory.
To determine the overall power consumption of an
application a, all application profiles must be considered.
Note that we use uppercase indices for the combined ener-
gies and lowercase indices for the energy consumption of
individual profiles. According to Eq. 1, the total energy
consumption of an application is the sum of overall read,
deselect, and standby energy:
Etot = EREAD + EDESEL + ESTDBY (5)
The read energy of an application is defined as the sum
of all individual application profiles’ read energies (Eq. 6).
By substituting Ereadi with Eq. 2, we find in Eq. 7 that the
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 5 of 12
individual profile periods tpi can be eliminated to obtain
Eq. 8.
EREAD(a) =
|Pa|∑
i=1
Ereadi ·
T
tpi
(6)
=
|Pa|∑
i=1
di · pri · Ir
(
f
) · V · tpi ·
T
tpi
(7)
= T ·
|Pa|∑
i=1
di · pri · Ir
(
f
) · V (8)
The combined deselect energy cannot be expressed in
an additive form as deselect current only applies when
no profile is accessing the memory (cf. Eq. 9). Again,
the individual profile periods tpi can be eliminated to get
Eq. 11.
EDESEL(a) =
⎛
⎝T −
|Pa|∑
i=1
di · pri · tpi ·
T
tpi
⎞
⎠ · Id
(
f
) · V
(9)
=
⎛
⎝T − T ·
|Pa|∑
i=1
di · pri
⎞
⎠ · Id
(
f
) · V (10)
= T ·
⎛
⎝1 − ·
|Pa|∑
i=1
di · pri
⎞
⎠ · Id
(
f
) · V (11)
Standby energy is consumed independently of any appli-
cation profile (Eq. 12).
ESTDBY(a) = T · Is · V (12)
Altogether, Eqs. 8, 11, and 12 prove that the total energy
consumption Etot does not depend on the individual pro-
file periods but only on the corresponding duty cycles and
access probabilities. With Pavg = Etot/T , one can further
eliminate the application runtime T , i.e.:
PREAD(a) =
|Pa|∑
i=1
di · pri · Ir
(
f
) · V (13)
PDESEL(a) =
⎛
⎝1 − ·
|Pa|∑
i=1
di · pri
⎞
⎠ · Id
(
f
) · V (14)
PSTDBY(a) = Is · V (15)
For RAM, the average power increases by the write
power PWRITE and the formula for PDESEL must be
amended such that deselect current applies when no pro-
file is reading or writing to memory.
PWRITE(a) =
|Pa|∑
i=1
di · pwi · Iw
(
f
) · V (16)
PDESEL(a) =
⎛
⎝1 −
|Pa|∑
i=1
di · (pri + pwi)
⎞
⎠ · Id
(
f
) · V
(17)
4 ILPmodel
For a complete model of the whole memory architecture,
not only the individual memory instances but also the
interconnect fabric must be accounted (cf. Section 3.1).
As the fabric grows in complexity with the number of
connected memories, its power consumption and area
requirements grow. To capture this effect, we model both
power and area requirements of the interconnect fabric
as piecewise linear function of the number of connected
memories. This allows using exact data points for desired
configurations and relying on linear interpolation other-
wise. Let PF : N0 → R and AF : N0 → R denote the
functions for the interconnect fabric’s power consumption
and area requirements, respectively.
4.1 Memory allocation
One part of the ILP problem is finding an allocation α ∈
N
|M|
0 with M being the set of available memory types and
αi representing the number of allocated instances ofmem-
ory type mi. In the ILP model, α is constrained to be of
type non-negative integer: ∀i ∈ [1, |M|] : αi ≥ 0.
Using the allocation α and a user-defined parameter
memsmax, we limit the maximum number of memory
instances:
|M|∑
i=1
αi ≤ memsmax (18)
With AM ∈ R|M| representing the area requirements
of the individual memory types, parameter areamax con-
strains the total area available for all memory instances
and the interconnect:
|M|∑
i=1
αi · AM,i + AF
⎛
⎝
|M|∑
i=1
αi
⎞
⎠ ≤ areamax (19)
4.2 Application mapping
For each application a, we represent the mapping as a
binary matrix βa ∈ {0, 1}|Pa|×|M|, with the elements βaij
indicating whether application profile i of application a
has been mapped to memory type j (βaij = 1) or not
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 6 of 12
(βaij = 0). To ensure a correct solution, each application
profile must be bound to exactly one memory type:
∀a ∈ [ 1, |A|] ,∀i ∈[ 1, |Pa|] :
|M|∑
j=1
βaij = 1 (20)
Furthermore, enough instances of a given memory type
must be provided fitting all application profiles mapped
to that memory type. Let σ Pa ∈ N|Pa|0 and σM ∈ N|M|0 be
the vectors representing memory required by application
profiles and memory provided by memory types, respec-
tively. With this, we specify the memory requirements as
follows.
∀a ∈ [ 1, |A|] ,∀j ∈ [ 1, |M|] :
|Pa|∑
i=1
βaij ·σ Pai ≤ αj ·σMj (21)
Note that Eq. 21 only ensures that the total allocated
memory is sufficient for each application individually and
not for all applications at the same time. This reflects the
Single-App and Combined operation mode (cf. Section 3)
with a single but interchangeable software application
such as a firmware. Thus, Eq. 21 allows each application
exclusive access to the whole memory.
For the consideration of Multitasking, enough mem-
ory has to be allocated in order to satisfy the require-
ments of all applications at the same time. Equation 21
consequently has to be altered by replacing the univer-
sal quantifier ∀a ∈ [ 1, |A|] with a summation ∑|A|a=1 to
support concurrent applications that share the memory.
Accordingly, we get Eq. 22 as follows:
∀j ∈ [ 1, |M|] :
|A|∑
a=1
|Pa|∑
i=1
βaij · σ Pai ≤ αj · σMj (22)
It is worth mentioning that neither Eq. 20 nor Eq. 21/22
explicitly bind application profiles to memory instances
but only to memory types. This significantly reduces the
complexity of the ILP problem without sacrificing preci-
sion or correctness.
4.3 Optimization goal
From Section 3.4 and Eqs. 13 to 15, we already know that
the average read power consumption neither depends on
the application runtime nor on the period of the indi-
vidual profiles and that the same holds for the deselect
and standby power. Note that we assume the application
runtime T to be independent of the memory selection
and, thus, to be constant. We justified this assumption
by only splitting a single memory into multiple instances
of the same timing characteristics. Our utilized intercon-
nect is further barely affecting the critical path of memory
accesses. Consequently, we do not have to consider com-
mon periods of the individual profiles in our powermodel.
Based thereon, we can now derive the individual power
components for each memory j and application a as given
in Eqs. 23 to 25.
Pread,j(a) =
|Pa|∑
i=1
βaij · di · pri · Ir,j
(
f
) · V (23)
Pdesel,j(a) =
⎛
⎝αj −
|Pa|∑
i=1
βaij · di · pri
⎞
⎠ · Id,j
(
f
) · V (24)
Pstdby,j = αj · Is,j · V (25)
Furthermore, combining Eqs. 23 to 25 and the intercon-
nect fabric’s power consumption PF(n) allows us to pos-
tulate the average power consumption Pavg for Single-App
and Combined mode to be minimized by the ILP solver
by choosing suitable variable assignments for allocation α
and mapping β .
Pj(a) = Pread,j(a) + Pdesel,j(a) + Pstdby,j (26)
Pavg = PF
⎛
⎝
|M|∑
i=1
αi
⎞
⎠+ 1|A|
|A|∑
a=1
|M|∑
j=1
Pj(a) (27)
In the Single-App case with |A| = 1, as well as in Com-
bined operation mode, we assume all applications to be of
equal importance and thus average their individual power
contributions
∑|M|
j=1 Pj(a) (cf. Eq. 27).
In the case of Multitasking, Eq. 27 is replaced by Eq. 28
in order to prioritize the power contribution of each appli-
cation. To this end, we assign different weights according
to the scheduled execution times as specified by sa ∈ S for
each application (cf. Section 3.3).
Pavg = PF
⎛
⎝
|M|∑
i=1
αi
⎞
⎠+
|A|∑
a=1
|M|∑
j=1
sa · Pj(a) (28)
5 System synthesis
In this section, we briefly illustrate how the ILP model
can be incorporated into a low power system synthesis
flow. Since the model not only yields the optimal alloca-
tion of memories but also provides an optimal mapping
of profiles to memory instances, our process is ideal for
hardware/software codesign. As illustrated in Fig. 3, the
general synthesis flow can be divided into a sequence
of four steps: cross compilation of the application (a),
extraction of application profiles (b), ILP solving (c), and
optimization of hardware and software domains (d, e).
In the first step (a), the application’s source files are
cross compiled for the target architecture, and the result-
ing object files are linked to an executable binary. The
binary is then analyzed to set up the list of application
profiles. While application profiles can be modeled at dif-
ferent levels of granularity, a reasonable starting point is
creating application profiles on a basis of symbols, which
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 7 of 12
Src-Files
Object-Files
Binary Executable
Mem. Allocation Object Mapping
ISS Simulation
ILP Solver
Linker Script
Optimized BinaryOpt. on-chip Mem.
HW Opt.
b
a Compilation
c
d e SW Opt.Mem. + Fabric 
Synthesis
Fig. 3 Synthesis flow for low power HW/SW codesign
correspond to functions or objects. Objects not only rep-
resent global variables but also stack and heap, for which
we only extract the initial sizes as their eventual sizes are
determined at runtime. For each symbol in the binary, the
name, starting address, and size are extracted.
In the second step (b), an instruction set simulator (ISS)
is used to complete the application profiles. For func-
tions, the simulator records the number of instruction
fetches and time spent in the function. For all other sym-
bols, reads and writes are recorded. Furthermore, stack
and heap sizes are tracked to determine their maximum
extent.
The resulting profiles are then fed to the ILP solver
(c), which solves two independent ILP problems, one for
instruction memory and one for data memory. The result-
ing allocation is then used to instantiate the required
memory IP blocks and to synthesize the interconnect fab-
ric (d). Based on the optimal mapping, a linker script is
created to re-link the previously compiled object files with
an optimal address space layout (e).
6 Evaluation
We evaluated the applicability of our approach using the
following four applications from the Embedded Micro-
processor Benchmark Consortium (EEMBC) MultiBench
benchmark suite [22]: IP reassembly, IP check, MD5, and
Huffman. IP reassembly reflects the work of a network
router when reassembling fragmented packets, IP check
performs IP header validation, MD5 performs checksum
calculation, and Huffman implements the decoding algo-
rithm commonly found in image and video compression
standards. In Single-App operation mode, each applica-
tion was evaluated individually, i.e., |A| = 1. In Combined
operation mode, we considered the optimization of all
four applications with |A| = 4. In case ofMultitasking, the
above applications were considered as parts of a simple
network benchmark that processes incoming data from a
network with one task per application. IP check is per-
formed per incoming packet while IP reassembly, MD5,
and Huffman are only executed for each fully received
fragmented packet. As exemplary case, we assumed a
mean packet fragmentation of 4 and the following sched-
ule S =[ 17 , 47 , 17 , 17 ] accordingly.
To provide the reader with a consistent terminology, we
refer to the set of all applications and all operation modes,
as described above, whenever using the term evaluated
scenarios.
Profiling data, as basis for the application profiles Pa,
was extracted using the ppc405 ISS from the SoCLib
platform [23].
We used CACTI 6.5 [24] to generate a set of 79 differ-
ent memory typesM ranging from 512 bytes to 16Mbytes
in the 45 nm technology node. For each size, multi-
ple versions with different number of sub-banks were
created to allow the ILP solver choosing between low
active power and low standby power memories. Note
that the number of sub-banks is not an ILP variable
as the CACTI-generated memory data is not paramet-
ric. Instead, two memories differing in their sub-banking
organization correspond to two different memory types
available to the ILP solver. From the set of information that
CACTI provides per memory type, dynamic read/write
energy per access and leakage power were utilized to
derive the corresponding currents as required by our
model.
The interconnect fabric was designed as a parameter-
ized, multiplexer-based VHDL model and has been syn-
thesized using the NanGate 45nm Open Cell Library [25].
Power simulations were performed using actual memory
access traces for individually synthesized fabrics. While
the interconnect prolongs the critical path for memory
accesses, all investigated setups were still able to run at
clock frequencies of up to 800 MHz.
For our experiments, we assumed an arbitrary but rea-
sonable memory operation frequency of 100 MHz and
a system operation voltage of 1.0 V. Since all equations
in our model have linear character, these two parameters
can easily be modified to values as dictated by a system
design at hand. Together with this basic information, the
application profiles, data about the interconnect, and the
memory data from CACTI, all required input parameters
for the optimization model can be provided. To get an
idea about the impact of the central optimization model
parameters, i.e., read, write, and standby current, Table 1
provides exemplary values for a subset of single-banked
memories, as they are available to the ILP solver. Please
note that the deselect current Id is not listed in this table.
Since our interconnect model was designed in a way that
keeps the interfacing signals of the connected memory
components stable during non-access phases, we are able
to ignore the impact of Id. However, the deselect current is
partly present in vendor data sheets to keep track of signal
line toggling during idle periods of a memory component
and therefore supported by our model.
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 8 of 12
Table 1 Optimization model parameters for an exemplary set of
single-banked memories with 32 bit bus width
Size (bytes) Ir(mA) Iw(mA) Is(mA)
512 0.309996 0.288318 0.000110234
4K 0.649919 0.597679 0.000932013
32K 2.27616 1.64435 0.00649379
256K 10.7093 2.63148 0.0453433
2M 46.6984 14.5283 0.368005
16M 111.438 45.8345 2.86068
6.1 Power optimality
For all evaluated scenarios, separate optimizations for
instruction memory and data memory were performed.
For both, the ILP model was solved with different
limits for the allowed number of partitions ranging
from 1 (unpartitioned) to 8. This allows investigating the
effect of split memory configurations of different sizes.
The resulting optimal power consumptions for each limit
are plotted in Fig. 4. For instruction memory, a split mem-
ory configuration of only two instances already causes a
drastic power reduction (80.5 % for IP reassembly, 67.6 %
on average). Increasing the number of partitions shows
slight improvements for up to four memories (82.9 % for
IP reassembly), but no evaluated scenario can benefit from
more than four instances. With a still very good power
reduction of 60.2 %, theMD5 benchmark benefits the least
from split memories.
Increasing the number of data memories reduces power
consumptionmore gradually, and all applications can ben-
efit from up to eight memory instances. The average
power reduction of data memory is 60.8 %, with a min-
imum of 37.8 % for Huffman decoding and a maximum
of 73.2 % for IP check. While still yielding good results,
splitting data memory was not as beneficial as splitting
instruction memory, which we attribute to the large heap
requirements of the applications. It is worth noting that
6 or 7 partitions is not favorable for the IP reassem-
bly benchmark, but allowing eight partitions eventually
reduces the power consumption by another 5 %. This
highlights that the optimal number of memory partitions
cannot be known in advance and should be a free variable
in the optimization process.
Table 2 shows a detailed power consumption break
down for instruction and data memory of two exemplary
applications comparing unpartitioned and optimal solu-
tions. For a given type of memory, characterized by Size
and number of sub-banks (Banks), Num represents the
number of instances allocated by the ILP solver, and Objs
is the number of functions or global variables mapped by
the ILP to that memory type. The average power con-
sumption P (mW) is given under consideration of the
mapped objects and their memory access patterns. Reads
and Writes state the relative amount of memory accesses
caused by the mapped objects.
For instruction memory, the IP reassembly benchmark
profits heavily from the split memory configuration as
 0
 0.5
 1
 1.5
 2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8A
ve
ra
ge
 P
ow
er
 [
m
W
]
Max Allowed Memory Instances
Instruction Memory
Interconnect Power
Memory Power
MultitaskingCombinedHuffmanMD5IP checkIP reassembly
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
A
ve
ra
ge
 P
ow
er
 [
m
W
]
Max Allowed Memory Instances
Data Memory
MultitaskingCombinedHuffmanMD5IP checkIP reassembly
Fig. 4 Power consumptions of instruction memory (top) and data memory (bottom) with varying limits for memory instances
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 9 of 12
Table 2 Memory power consumption details for instruction
memory (IP reassembly) and data memory (MD5)
Instruction memory (IP reassembly)
Mems Num Size Banks P (mW) Reads (%) Funcs
1 1 64K 16 1.6708 100 241
4 – – – 0.2856 – –
1 512 1 0.2057 95.1 9
1 2K 2 0.0108 3.1 9
1 4K 2 0.0081 1.7 8
1 32K 16 0.0022 0.1 215
Interconnect 0.0588
Data memory (MD5)
Mems Num Size Banks P[mW] Reads (%) Writes (%) Objs
1 1 2M 16 0.7350 100 100 37
8 – – – 0.2442 – – –
1 2K 2 0.0081 33.4 30.9 33
2 8K 8 0.0068 20.4 0 1
5 256K 16 0.1412 46.2 69.1 3
Interconnect 0.0882
it features a highly non-uniform memory access pattern.
Here, only 9 of 241 functionsmake up 95.1 % of all instruc-
tion fetches. Consequently, the major power reduction
is achieved by moving these 9 functions into a separate,
small 512-bytememory. This clearly shows that separating
the most frequent functions into low read-power memo-
ries significantly reduces overall power consumption (in
this case, 82.9 % considering interconnect power).
For data memory, the causes for power reduction are
not as evident because a large number of reads and writes
address the very large heap (up to almost 2Mbytes). Thus,
power reduction is mainly achieved by spreading the heap
across 5 equally-sized, smaller memories. However, 33
fairly small objects still account for 32 % of all reads and
writes and thus have been moved to a small, low power
memory.
Also shown in Fig. 4 are the results for the Combined
optimization of all applications. The savings for the aver-
age power consumption in this operation mode amount
to 79.9 % in the mean for instruction memory. This large
power reduction is attributed to the relatively inefficient
reference case memsmax= 1. Since the size of the sin-
gle memory is determined by the largest application, the
power consumption for applications with a small mem-
ory footprint increases. As a result, already using two
memory instances significantly reduces power consump-
tion as the most frequent functions no longer reside in
the large memory with a high read power consump-
tion. For data memory, the power consumption can be
reduced to 59.7 %, which is even better than the average
power reduction of 58.7 % when performing individual
optimizations (Single-App). Again, this is attributed to all
applications using a large heap dominating the memory
organization and thus power consumption.
The investigation of the computed power figures for the
Multitasking operation mode (cf. Fig. 4) reveals a max-
imum power consumption reduction of 81.5 % in case
of instruction memory. As for the Combined setup, an
increase of the memsmax constraint from 1 to 2 allowed
memory instances already results in an improvement of
over 70 %. Once more, this is due to a particularly ineffi-
cient reference case with a single unpartitioned memory
that has to fit the sum of all memory footprints of all appli-
cations in case ofMultitasking. This aspect also influences
the optimization of data memory where a significant over-
all improvement of 72.5 % is obtained. This value is close
to the best power reduction of all evaluated RAM scenar-
ios as achieved for the IP check benchmark in Single-App
operation mode with 73.2 %.
Altogether for the aspect of power consumption, an
average improvement of 73.7 % for instruction memory
and 61.2 % for data memory can be presented through
all evaluated scenarios. Compared to results from liter-
ature, these values constitute a solid improvement. The
authors of [1] provide average power savings of 17.8 %
for instruction and 47.8 % for data memory in compar-
ison with a single-banked memory configuration. Benini
et al. [2] consider SRAM uniformly, i.e., without clearly
distinguishing between code and data sections, and
state an average improvement of 41.7% versus mono-
lithic memory. With savings between 44.7 and 64.8 %
[12], respectively, 59.2 and 73.9 % [10], there is also
related work that yields comparable or even slightly better
saving ratios than our approach. However, note that both
methods take memory components with retention mode
into account. Accordingly, idle memories are put into
sleep state with negligible leakage power consumption,
which is considered as beneficial factor. For that reason,
a direct comparison of these approaches with our work is
not possible.
6.2 Area requirement
In order to discuss the aspect of area requirement, Fig. 5
depicts the area footprints in mm2 that correspond to
the optimal power configurations as given in Fig. 4. Inter-
estingly, we can observe that power consumption and
area are not correlated. The assumption of an increas-
ing area requirement with increasing number of utilized
memory instances, as mentioned in the introduction (cf.
Section 1), is consequently disproved. However, please
note that only memories of size 2N are available to the
ILP solver. Depending on the evaluated scenario, we can
observe that the area requirement either improves over
the number of allowed memory instances (IP check, data
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 10 of 12
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
A
re
a 
C
on
su
m
pt
io
n 
[m
m
2 ]
Max Allowed Memory Instances
Instruction MemoryMemory Area
MultitaskingCombinedHuffmanMD5IP checkIP reassembly
 0
 10
 20
 30
 40
 50
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
A
re
a 
C
on
su
m
pt
io
n 
[m
m
2 ]
Max Allowed Memory Instances
Data Memory
MultitaskingCombinedHuffmanMD5IP checkIP reassembly
Fig. 5 Area requirement of instruction memory (top) and data memory (bottom) with varying limits for memory instances
memory), deteriorates (Multitasking), or does not follow
any trend at all (Combined operation mode). Hence, area
should also be a free variable in the optimization process.
The area impact of the interconnect fabric for up to
eight connected subscribers amounts in any case to less
than 0.002 mm2 and is therefore considered as negligible
and not illustrated in Fig. 5.
Averaging all experiments in the range of 1 to 8 mem-
ories yields a deterioration of area consumption by 2.1 %
for instruction memory and 1.2 % in case of data mem-
ory. In comparison, the partitioning method of Mai et al.
[1] introduces an area overhead of 80.33 % for instruction
and 44.64 % in case of data memory. With respect to these
values and the accompanied considerable improvements
in power consumption (cf. Section 6.1), the presented
small increase in on-chip area emphasizes the strength
and relevance of our approach.
Altogether, finding the optimal power/area trade-off is
not trivial. Therefore, especially in power and area critical
designs, an efficient design space exploration is crucial, in
order to reduce effort and costs.
6.3 Pareto optimality
Even though our model is basically designed for power
optimization, we are also able to obtain area optimal
results from it. To this end, the power cost function (cf.
Eq. 27 respectively Eq. 28 in Section 4) is replaced by
an area cost function that is derived from Eq. 19. The
former power cost function is further incorporated as
user constraint. In this setup, another experimental series
was carried out together with the ILP solver minimizing
area requirement. With the above-presented results, we
get one solution with minimal power, and one solution
with minimal area requirement per evaluated scenario.
These two results span up a solution range in the two-
dimensional power/area design space. All other obtained
configurations represent a trade-off in one or the other
dimension and therefore always reside between the bor-
ders of our solution range. Through variation of user
constraints, i.e., maximum number of allowed memo-
ries (memsmax), maximum power consumption (Pmax),
respectively, area budget (areamax), we are able to influ-
ence the ILP solver in order to obtain even more valid
implementations. Eventually, the exploration of the result-
ing design space allows the identification of pareto-
optimal solutions. All implementations that belong to this
solution subset represent a trade-off that is not dominated
by any other solution in the design space, i.e., a better
value for one criterion is automatically bound to a wors-
ening on the second criterion. The curve that connects all
pareto-optimal points is referred to as pareto front.
As an example, design space and pareto front are illus-
trated for the IP check instructionmemory in Fig. 6. Single
marks represent all valid implementations as obtained
from the optimization process; however, only the solu-
tions on the pareto front are of actual relevance. The
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 11 of 12
Fig. 6 Power/area design space exploration example for instruction
memory (IP check)
highlighted points on this curve are especially interesting
as they identify local extrema and thus mark the most rea-
sonable solutions to choose from. For the example illus-
trated in Fig. 6, two such points can be identified. One is
more preferable in terms of area whereas the other is supe-
rior in terms of power consumption. Further observation
reveals that, except for three solutions in the middle of
the depicted range, all other points are either identical or
close to the minimum on one criterion, which facilitates
the selection process significantly.
In the second example, as given in Fig. 7, the indi-
vidual memory configurations are more distributed and
not as close to the solution range borders as in the pre-
vious example. Exploration of this design space for IP
reassembly data memory nevertheless also reveals two
local extrema that identify the most reasonable configura-
tions on the pareto front.
From Sections 6.1 and 6.2, we already know that power
and area do not correlate. The presented combined con-
sideration of both criteria in a pareto investigation con-
sequently closes the gap of finding a reasonable trade-off
configuration.
6.4 ILP performance
To show that our model is fast enough to be used in
an optimized synthesis flow, we measured the execution
times of the ILP solver for the different problems. We
used the IBMCPLEX ILP solver on an IntelTM E5-2660 v2
Xeon System clocked at 2.2 GHz and having 128GB RAM.
The individual execution times of the most complex cases
(memsmax = 8) are listed in Table 3.
Fig. 7 Power/area design space exploration example for data memory
(IP reassembly)
Optimizing individual applications had average execu-
tion times of 1.95 s for RAM models and 21.7 s for ROM
models, which proves the efficiency of our approach. We
attribute the high performance to the fact that we do not
map application profiles to individual memory instances
but only to memory types and the ILP solver chooses the
ideal number of instances. The consideration of multi-
ple applications in the other operation modes results in
more ILP variables and thus in a slight increase of execu-
tion times, e.g., for the Multitasking operation mode to
6.71 s for the RAM model and 138.13 s in case of ROM.
Strikingly, even more time is required for the Combined
optimization processes. With 19.92 s for data memory,
we are still in a reasonable range; however, 97.36 min
for instruction memory appears to be disproportionately
large, compared to the other values. The main explanation
for this lengthy execution time is the depletion of the host
machine’s main memory. Even though 128 GB constitutes
a considerable amount of RAM resources, some optimiza-
tion problems go beyond this scope, which results in swap
Table 3 ILP execution times for eight allowed memory instances
Benchmark Instr. Mem Data Mem
IP reassembly 47.47 s 0.25 s
IP check 16.48 s 2.37 s
MD5 9.24 s 1.19 s
Huffman 13.55 s 3.97 s
Combined 97.36 min 19.92 s
Multitasking 138.13 s 6.71 s
Strobel et al. EURASIP Journal on Embedded Systems  (2017) 2017:2 Page 12 of 12
operations to the file system. This severely slows down
the whole optimization process, which results in execu-
tion times in the range of over an hour, as for the example
above. Nevertheless, as our main focus lies on applica-
tion specific and highly optimized systems-on-chips we
still consider those times as reasonable commitment in
exchange for a power-optimal memory configuration.
7 Conclusions
In this article, we have provided a mathematical model to
determine an optimal on-chip memory organization with
respect to low power. The model is highly flexible, allow-
ing applications to be modeled with different degrees of
precision, supports optimization for multiple applications
at the same time, and works with a large list of mem-
ory types. A particular advantage of our approach is given
by the ability to provide optimal allocation and mapping
at once. Hence, number and type of memory instances
as well as mapping of address space ranges to the allo-
cated set of memories can be obtained from one and the
same workflow. Achieved power savings of up to 82 % for
instruction memory and 73 % for data memory in a set
of industrial grade benchmarks prove the benefits of our
model. Its value is further emphasized through a particu-
larly small deterioration of area requirement that is bound
to these power savings. On average, additional area of
2.1 % for instruction memory and 1.2 % for data memory
is required only. Furthermore, we showed how our model
can be used for efficient design space exploration in the
power/area domain and, on top of that, how an incorpora-
tion into an automated synthesis flow makes it a valuable
tool in low power HW/SW codesign.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
We would like to thank Nouman Naim Hasan for his valuable input on
modeling the memory utilization of applications.
Received: 23 January 2016 Accepted: 20 June 2016
References
1. S Mai, C Zhang, Y Zhao, J Chao, Z Wang, in Proc. International Conference
on ASIC (ASICON). An application-specific memory partitioning method
for low power (IEEE, Guilin, China, 2007)
2. L Benini, A Macii, M Poncino, in Proc. International Symposium on Low
Power Electronics and Design (ISLPED). A recursive algorithm for low power
memory partitioning (IEEE, Rapallo, Italy, 2000)
3. S Krishnamoorthy, U Catalyurek, J Nieplocha, A Rountev, P Sadayappan, in
Proc. The International Conference for High Performance Computing,
Networking, Storage, and Analysis (SC). Hypergraph Partitioning for
Automatic Memory Hierarchy Management (IEEE, Tampa, FL, USA, 2006)
4. F Menichelli, M Olivieri, Static minimization of total energy consumption
in memory subsystem for scratchpad-based systems-on-chips. IEEE Trans.
Very Large Scale Integr. (VLSI) Syst. 17(2), 161–171 (2009)
5. S Pasricha, ND Dutt, A framework for cosynthesis of memory and
communication architectures for MPSoC. IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst. (TCAD). 26(3), 408–420 (2007)
6. S Steinke, L Wehmeyer, B-S Lee, P Marwedel, in Proc. Design, Automation
and Test in Europe Conference and Exhibition (DATE). Assigning program
and data objects to scratchpad for energy reduction (IEEE, Paris, France,
2002), pp. 409–415
7. SL Coumeri, DE Thomas, in Proc. International Symposium on Low Power
Electronics and Design (ISLPED). Memory modeling for system synthesis
(IEEE, Montery, CA, USA, 1998), pp. 179–184
8. S Srinivasan, F Angiolini, M Rugiero, L Benini, V Narayanan, in Proc. SOC
Conference. Simultaneous memory and bus partitioning for SoC
architectures (IEEE, Herndon, VA, USA, 2005)
9. Q Zhuge, EH-M Sha, B Xiao, C Chantrapornchai, Efficient variable
partitioning and scheduling for DSP processors with multiple memory
modules. IEEE Trans. Signal Process. (SP). 52(4), 1090–1099 (2004)
10. M Loghi, O Golubeva, E Macii, M Poncino, Architectural leakage power
minimization of scratchpad memories by application-driven subbanking.
IEEE Trans. Comput. 59(7), 891–904 (2010)
11. T Liu, Y Zhao, CJ Xue, M Li, Power-aware variable partitioning for DSPs
with hybrid PRAM and DRAMmain memory. IEEE Trans. Signal Process.
61(14), 3509–3520 (2013)
12. L Steinfeld, M Ritt, F Silveira, L Carro, in Proc. IFIP TC 10 Int’l Embedded
Systems Symposium (IESS). Low power processors require effective
memory partitioning (Springer, Paderborn, Germany, 2013)
13. F Angiolini, L Benini, A Caprara, An efficient profile-based algorithm for
scratchpad memory partitioning. IEEE Trans. Comput.-Aided Design
Integr. Circuits Syst. (TCAD). 24(11), 1660–1676 (2005)
14. H Takase, H Tomiyama, G Zeng, H Takada, in Proc. Embedded Software and
Systems (ICESS). Energy efficiency of scratch-pad memory at 65 nm and
below: an empirical study (IEEE, Sichuan, 2008)
15. A Kannan, A Shrivastava, A Pabalkar, J-E Lee, in Proc. Asia and South Pacific
Design Automation Conference (ASP-DAC). A software solution for dynamic
stack management on scratch pad memory (IEEE, Yokohama, 2009)
16. CJ Seung, A Shrivastava, K Bai, Dynamic code mapping for limited local
memory systems. Proc. IEEE Int’l Conf. on Application-specific Systems
Architectures and Processors (ASAP) (2010)
17. A Shrivastava, A Kannan, J Lee, A Software-Only Solution to Use Scratch
Pads for Stack Data. IEEE Trans. Comput.-Aided Design Integr. Circuits
Syst. (TCAD). 28, 1719–1727 (2009)
18. F Balasa, CV Gingu, II Luican, H Zhu, in Proc. Embedded and Real-Time
Computing Systems and Applications (RTCSA). Design space exploration for
low-power memory systems in embedded signal processing applications
(IEEE, Taipei, 2013)
19. F Sampaio, M Shafique, B Zatt, S Bampi, J Henkel, in Proc. Design,
Automation and Test in Europe Conference and Exhibition (DATE). dSVM:
Energy-efficient distributed Scratchpad Video Memory Architecture for
the next-generation High Efficiency Video Coding (IEEE, Dresden,
Germany, 2014)
20. B Zatt, M Shafique, S Bampi, J Henkel, A low power memory architecture
with application-aware power management for motion & disparity
estimation in Multiview Video Coding. Proc. IEEE/ACM Int’l Conference on
Computer-Aided Design (ICCAD) (2011)
21. STMicroelectronics, M48Z35 256 Kbit (32 Kbit x 8) SRAM Datasheet, 2011.
http://www.st.com/web/en/resource/technical/document/datasheet/
CD00000550.pdf. Last visited on 01/11/2016
22. EEMBC, EEMBC Multibench 1.0 Multicore Benchmark Software. http://
www.eembc.org/benchmark/multi_sl.php. Last visited on 01/11/2016
23. SOCLIB Consortium, Projet SOCLIB: Plate-forme de modélisation et de
simulation de systèmes integrés sur puce (SOCLIB project: An integrated
system-on-chip modelling and simulation platform) Technical report,
CNRS, 2003. http://www.soclib.fr/
24. N Muralimanohar, R Balasubramonian, NP Jouppi, CACTI 6.0: A Tool to
Model Large Caches. HP Laboratories, HPL-2009-85 (2009). http://www.
hpl.hp.com/techreports/2009/HPL-2009-85.pdf
25. NanGate Inc, NanGate FreePDK45 Open Cell Library. http://www.nangate.
com/?page_id=2325. Last visited on 01/11/2016
