SARA: Self-Aware Resource Allocation for Heterogeneous MPSoCs by Song, Yang et al.
ar
X
iv
:1
80
4.
02
06
8v
1 
 [c
s.D
C]
  5
 A
pr
 20
18
SARA: Self-Aware Resource Allocation
for Heterogeneous MPSoCs
Yang Song†, Olivier Alavoine‡, Bill Lin†
†Electrical and Computer Engineering Department, University of California at San Diego
‡Qualcomm Inc., San Diego, CA
y6song@ucsd.edu
ABSTRACT
In modern heterogeneous MPSoCs, the management of shared
memory resources is crucial in delivering end-to-end QoS.
Previous frameworks have either focused on singular QoS
targets or the allocation of partitionable resources among
CPU applications at relatively slow timescales. However,
heterogeneous MPSoCs typically require instant response
from the memory system where most resources cannot be
partitioned. Moreover, the health of different cores in a
heterogeneous MPSoC is often measured by diverse perfor-
mance objectives. In this work, we propose a Self-Aware Re-
source Allocation (SARA) framework for heterogeneous MP-
SoCs. Priority-based adaptation allows cores to use differ-
ent target performance and self-monitor their own intrinsic
health. In response, the system allocates non-partitionable
resources based on priorities. The proposed framework meets
a diverse range of QoS demands from heterogeneous cores.
1. INTRODUCTION
Modern heterogeneous MPSoCs [1, 2] have been widely
deployed in mobile devices thanks to their energy efficiency.
These MPSoCs typically integrate a diverse collection of
cores. Fig. 1 depicts an example of a heterogeneous MPSoC.
Besides general-purpose cores like the CPU for running ap-
plications, most heterogeneous cores are dedicated to certain
functions, such as the GPU, the DSP and the display. These
cores have diverse notions of Quality-of-Service (QoS). For
example, the GPU measures target real-time performance in
terms of frame rate; the DSP demands the memory latency
to remain below a certain limit; and the display requires
sufficient bandwidth to refresh frames at a constant rate.
To save cost and energy, heterogeneous cores commonly
share resources, among which, the sharing of the memory
system (including the on-chip network and the memory con-
troller) is the most challenging because memory performance
often has a direct and substantial impact on the system per-
formance. As data is being shared through memory, com-
peting memory requests from different cores interfere with
each other, and these memory interferences can cause the
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
GPU
Display
JPEG
Frame
Rotator On-Chip Network
Memory Controller
DRAM...
Modem
USB
WiFi
..
.
CPUCPUCPUCPU
CPUCPUCPUDSP
Video
Codec
GPS
Media
cores
System
cores
Figure 1: Heterogeneous system architecture example.
memory system to fail in meeting the target performance of
some cores. Fig. 2 depicts a camcorder application, which
represents a typical use case in that it involves many cores at
the same time. With ineffective memory scheduling, a real-
time core (e.g., the display) may not achieve the target real-
time performance due to inadequate memory bandwidth.
Moreover, as latency-sensitive cores such as the DSP share
memory with other cores, they can be easily overwhelmed
by real-time cores consuming high bandwidth.
Sensor Video BufferCamera
Snapshot
Buffer
JPEG
Image
Processor
Reference
Frame
Preview
Buffer
Rotator
Preview
Buffer
Video
Codec
Reference
Frames
Storage
Display
LCD
Panel
Icons
GPU
Recording
a video
Taking
a snapshot
Previewing
the video
Figure 2: Simplified dataflow of a camcorder application.
Shared memory represented by boxes in dash lines and
cores by boxes in solid lines.
QoS-aware management for specific types of memory re-
sources has been well-studied by previous work [3, 4, 5, 6,
7]. In [3], a QoS-aware scheduling policy was proposed for
CPU-GPU systems. The concept of frame progress was in-
troduced for monitoring GPU performance. Although the
policy can be extended to include more media cores, it can-
not be applied to real-time cores whose target QoS can-
not be assessed in terms of frame rate. Moreover, holis-
tic memory management frameworks for CPU-centric ho-
mogeneous systems have also been explored recently [8, 9,
10]. This series of work typically constructs a management
model based on the control theory to partition computing
and memory resources. These frameworks accept flexible
QoS targets as clients are allowed to define their own tar-
get performance. Nonetheless, such type of approaches is
performed at a relatively slow timescale (e.g., on the order
of milleseconds) due to the computational complexity. In
comparison, real-time cores in heterogeneous MPSoCs often
demand much more instant response from the memory sys-
tem. Besides, communication between heterogeneous cores
is mainly conducted through shared memory as shown in
Fig. 2, because multimedia data is generally too large to fit
in caches. Therefore, DRAM plays a more crucial role in
heterogeneous systems. However, previous frameworks can-
not handle DRAM effectively because its bandwidth is not
partitionable. Specifically, available DRAM bandwidth re-
lies on the memory access pattern, as higher spatial locality
results in fewer redundant precharge operations and better
memory efficiency.
So far, there has not been a QoS-aware resource man-
agement model for heterogeneous MPSoCs which is capable
of allocating non-partitionable resources to fleeting QoS de-
mands. In this work, we propose the Self-Aware Resource
Allocation (SARA) framework as a solution. The contribu-
tions of our work can be summarized as follows.
• We propose a QoS-aware holistic resource management
framework for heterogeneous systems. The SARA model
accepts diverse notions of QoS and monitors performance
distributively with lightweight meters to guarantee end-
to-end QoS.
• We introduce priority-based self-adaptations for the man-
agement of non-partitionable resources, such as DRAM
and on-chip network, which constitute most of the shared
resources in heterogeneous MPSoCs.
• We evaluate the proposed framework using memory traffic
of next-generation MPSoCs and show that the proposed
SARA model delivers target performance to all cores. In
contrast, the performance of critical cores can fall below
10% of their targets without the SARA framework. Fur-
ther, memory system optimization is performed without
QoS degradations.
The rest of this paper is organized as follows: Section 2
briefly reviews related work. Section 3 describes the pro-
posed SARA framework. Experimental results and conclu-
sions follow in Sections 4 and 5.
2. RELATED WORK
Most previous work on QoS-aware resource management
in heterogeneous MPSoCs were focused on a single layer of
the memory system. In [3], a novel scheduling policy was
introduced to dynamically balance bandwidth between the
CPU and the GPU based on the frame progress of real-
time workloads. To achieve QoS-aware memory scheduling,
the staged memory scheduler [4] was presented as the first
QoS-aware scheduler for CPU-GPU systems. Further, the
single-tier virtual queuing memory controller [5] was pro-
posed to overcome the limitation of two-tier schedulers in
QoS-aware scheduling. Besides memory scheduling, QoS-
aware cache management [7] and on-chip network design [6]
have also been well-explored in recent years. Nonetheless,
these work cannot guarantee end-to-end QoS because they
only deal with certain parts of the memory system. For
example, the QoS provided in the memory controller could
be deteriorated by the interconnect if it is not applying the
same QoS policy. In addition, implementing a centralized
QoS monitor in the memory system can be prohibitive since
it needs to collect runtime information from all cores. More
limiting, these work assume specific notions of QoS, which
is not applicable to modern heterogeneous MPSoCs where
the health of different cores is often evaluated by diverse
performance objectives.
METE [8] is a multi-level framework for end-to-end re-
source management based on the control theory. It utilizes
runtime information to predict application behaviors. Ap-
plication controllers calculate the amounts of resources re-
quired to achieve target application performance. A global
resource broker determines the final resource partitions for
applications. SEEC [9] is a self-aware computing framework
designed for a many-core processor. It follows the control
loop of observe-decide-act for resource allocations. Perfor-
mance of CPU applications are observed by the decision en-
gine which decides resource partitions using available actions
defined by system designers. ARCC [10] is a self-computing
framework implemented in the Tessellation many-core OS.
It performs the two-level scheduling: first the resource allo-
cation broker distributes global resources and then at user-
level scheduling policies are customized separately.
Aforementioned frameworks were intended for CPU-centric
multi-core systems. These frameworks are aimed at allocat-
ing partitionable resources, such as CPU cores and cache
ways, to applications at the software/OS level. They are not
suitable for heterogeneous MPSoCs for the following reasons.
First, complicated control models may not be fast enough
for heterogeneous cores (e.g., these software/OS level ap-
proaches operate at milleseconds timescales). For example,
the DSP sets limit on memory latency at nanosecond level,
but prior frameworks need more time to adapt through con-
trol theory computations in OS. Second, prior work assume
all memory resources are partitionable. However, DRAM
bandwidth cannot be simply partitioned like cache ways. In
DRAM, data storage of a memory bank is organized into
rows and columns. To access a column, the row where this
column is located will be loaded into the row-buffer (i.e.
row activation operation) after the other rows are closed
(i.e. precharge operation) [11]. These row activation and
precharge operations cause time penalty without contribut-
ing to actual data transfer, which makes DRAM bandwidth
inconstant and unpredictable.
3. SELF-AWARE RESOURCE
ALLOCATION FRAMEWORK
On-Chip Network
Performance
Meter
Direct
Memory
Access
Target
perf.
Distributed
Monitoring
Distributed
Priority-based
Adaptation
Distributed
System
Response
Core A
Core B
Core D
Core C
Memory Controller
DRAM
Actual perf.
Requests with priorities
Priority
...
Each resource
performs allocation
in a distributed
manner
NPI
Figure 3: The proposed SARA framework for heteroge-
neous MPSoCs. Each core self-monitors its performance
and self-adapts its priority, and each resource performs
priority-based allocation in a distributed manner.
time0
Frame progress ΔOccupancy
-25%
0
25%
50%100%
time
timetime
Average latency
50%Max
75%Max
time
time
5
6
7
3
4
5
3
4
5
Priority PriorityPriority
x1
x0.5
x0.75
6
2
(a) DSP (c) Display(b) GPU
Max
Reference
progress
0 -50%
Figure 4: Examples of priority-based adaptation in het-
erogeneous cores, including the DSP, GPU and display.
The proposed architecture of SARA framework is shown
in Fig. 3. The resource management model consists of three
stages, including distributed monitoring, priority-based run-
time adaptation and system response. In the rest of this sec-
tion, we will go through SARA framework stage by stage.
3.1 Distributed Self-Monitoring
In the first stage, each core self-monitors its own per-
formance. The distributed monitoring relieves the memory
system from the burden of monitoring heterogeneous cores
with various notions of QoS. Self-monitoring also provides
more accurate feedback on the end-to-end QoS compared
with centralized monitoring in the memory system. In addi-
tion, implementing lightweight performance meters is good
for scalability, because a new core can be added or modified
without updating the rest of the system.
Every core customizes its own internal performance me-
ter to measure its own performance or progression against
a given target, and the measurement gets normalized into
a fractional number called a Normalized Performance In-
dicator (NPI), which is used as an indicator of the core’s
intrinsic health. In the DSP, the performance meter mon-
itors the average latency of its transactions, while in the
display the meter counts the occupancy level in the read
buffer. The deviation from the target performance (e.g., la-
tency, occupancy level, etc) produces the NPI metric. In our
framework, each independent DMA (Direct Memory Access)
unit is equipped with a performance meter. Note that there
are usually multiple DMAs in a single core. For simplicity,
we only show one DMA per core in Fig. 3.
3.2 Distributed Priority-Based Adaptation
In the second stage, each core adapts the relative priority
of its transactions based on its NPI value. The NPI value
delivered by the performance meter is translated into a rela-
tive priority level which is attached to memory transactions
from the same DMA. The priority level will be evaluated
within on-chip network arbiters and the memory controller,
as the transaction travels along the way to DRAM. Priority-
based arbitrations allow the memory system to provide QoS
without specifying the heterogeneous QoS for all cores and
DMAs. Same with performance meters, the formulation of
the NPI metric and the adaptations of priority can be im-
plemented differently from core to core, depending on the
local target performance. Fig. 4 shows three examples of
priority-based adaptation in different cores.
As for the DSP, the target performance is to have the
average memory latency lower than the maximum latency
limit. The average latency is measured and compared with
a pre-set limit to produce the NPI value (see Eqn. 1), which
remains above or equal to 1 when the target performance
is achieved. This NPI value is then translated to a relative
priority level (Fig. 4(a)). The priority level increases along
with average latency.
NPIDSP =
maximum latency limit
average latency
(1)
Similarly, cores requesting for bandwidth produce NPI
metrics by computing the ratio between the average and the
target bandwidth. However, frame rate differs from band-
width, because frame size can be variable and thus a con-
stant frame rate can lead to variable bandwidth. Hence
frame progress [3, 5] is used instead to produce NPI metrics
for frame rate based cores. Take the GPU as an example,
the target is to let the frame progress reach 100% as the cur-
rent frame period comes to an end. The GPU’s NPI value is
produced at any time by comparing the frame progress with
reference progresses which grow proportionally with frame
time. The NPI value is then translated to a relative prior-
ity level of GPU transactions. Fig. 4(b) shows the reference
progresses achieving 1, 0.75 and 0.5 times the average data
rate of target performance.
NPIGPU =
frame progress
reference progress
(2)
In the display, LCD panel reads data from a read buffer
at a constant frame rate, while the display controller DMA
tries to refill this buffer from DRAM so it never gets empty.
Its health (see Eqn. 3) relies on maintaining the refill rate
(Rrefill) no lower than the read data rate (Rread), and
can be indicated by the variation of buffer occupancy level
(∆occupancy). Compared with an initial level (e.g. 50%),
the lower the occupancy level of this buffer gets, the worse
the NPI value becomes, which is in turn translated to a
higher priority level (Fig. 4(c)).
NPIdisplay =
Rrefill
Rread
= 1 +
∆occupancy
Rread · time
(3)
Intuitively, one might be concerned that every core would
intentionally raise the priority to the maximum level to ob-
tain as much resources as possible. However, this situation
should not happen because the priority level is only max-
imized when the actual performance is far below the tar-
get. The system designer has the responsibility to make
sure cores have realistic performance targets and enough re-
sources to satisfy all possible combinations of QoS demands.
Once the system is fabricated in hardware, heterogeneous
cores cannot change their target performance arbitrarily, es-
pecially because most of them are fixed-function IP blocks
with invariable QoS targets and little programmability.
In our evaluations, the priority levels are quantized into
2k levels, which can be encoded using k bits. We found that
k = 3 bits provides sufficient granularity in priority levels
to produce satisfying results (i.e., the priority levels range
from 0 to 7).
3.3 Distributed System Response
As transactions travel through the memory system, the
system responds to QoS demands by providing resource man-
agement based on their priority levels. The priority-based
management is performed correspondingly in different parts
of the memory system. In on-chip network routers, trans-
actions with higher priorities are preferentially selected dur-
ing switch allocation. In the memory controller, when a
priority-based scheduler arbitrates among transactions going
to available memory banks, the ones with higher priorities
have more chances to be served. An example of such memory
scheduling policies is the priority-based round-robin shown
in Policy 1. To avoid starvation of transactions with low pri-
orities, the scheduler also needs to consider the aging factor
during arbitration. In our evaluations, the scheduler peri-
odically clears the backlog of transactions that have waited
for at least T cycles (e.g., T = 10000 cycles).
• Policy 1: Suppose PA and PB are priorities for transac-
tions A and B, if PA > PB choose A; if PA < PB choose B;
otherwise choose between A and B in round-robin man-
ners.
Priorities notify the system whether the cores are in urgent
QoS demands. That gives the memory system an opportu-
nity to optimize memory performance without undermining
the QoS. Specifically, when transactions are in low urgency,
the system can improve memory performance such as row-
buffer hit rate, instead of focusing on serving QoS demands.
Row-buffer hits refer to the number of memory accesses
to the same active row-buffer before precharge. More row-
buffer hits means less time and power are wasted on row
activation and precharge operations. Thus increasing row-
buffer hits helps lower memory latency and improve DRAM
total bandwidth.
To increase row-buffer hits, the memory controller re-
orders transactions to favor the ones hitting open rows. It
may cause degradations to the QoS when the transactions in
high urgency are postponed due to row-buffer hits optimiza-
tion. Yet, with priorities, the memory controller is aware of
the urgency levels of transactions and able to avoid delay-
ing urgent transactions during optimization. Policy 2 shows
an extension of Policy 1 to increase row-buffer hits with-
out QoS degradations. The parameter δ is an adjustable
threshold to balance row-buffer hits optimization and QoS-
aware scheduling. When the priority level is lower than δ,
the scheduler focuses on row-buffer hits, otherwise the QoS
comes first. A higher δ value gives more favor to DRAM
bandwidth, but also potentially causes more disturbance to
the QoS. We found δ = 6 a good setting to achieve high
DRAM bandwidth without causing QoS degradations.
• Policy 2: Suppose transaction A is going to an active
row-buffer and B is not. If PA, PB < δ or PA = PB ,
choose A. Otherwise, perform priority-based round-robin.
The priority-based resource allocation is able to handle
non-partitionable with little computation in comparison with
previous management models [8, 9, 10]. This facilitates in-
stant response from the memory system to QoS demands.
3.4 Hardware Implementation
The implementation of the proposed SARA framework in-
cludes three parts: the computation of NPI value, the trans-
lation of NPI value to a priority level, and the priority-based
arbitration in the memory system.
To calculate the NPI, a divider is needed at the perfor-
mance meter for each DMA. For the translation of the NPI,
a mapping function can be stored in a look-up table at each
core. Each priority level is assigned with a table entry, and
this entry stores the lowest NPI value allowed at that pri-
ority level. For example, if priority = p when NPI ∈ [u, v),
the value u will be stored at the entry for p on the look-up
table. Note that v will be the lower bound of the NPI for
the priority level p − 1. Comparators are needed to access
table entries in parallel. If the current NPI value is not lower
than the stored lower bound of NPI value, the corresponding
priority level will be asserted. When multiple priority levels
are asserted, the lowest level will be adopted.
Supposed each priority level is encoded into three bits,
Table 1: Simulation settings.
Test Cases
Case A
all cores active
with DRAM @ 1866MHz;
Case B
inactive cores:
GPS, camera, rotator and JPEG,
with DRAM @ 1700MHz.
Memory Controller
Total entries 42
Transaction queues 5
DRAM
Volume 2GB
Max I/O bus freq. 1866MHz
CL-tRCD-tRP (cycles) 36-34-34
tWTR-tRTP-tWR (cycles) 19-14-34
tRRD-tFAW (cycles) 19-75
Channels-Ranks-Banks 2-2-8
Table 2: Summary of heterogeneous cores and types of
target performance.
Name
Performance
Name
Performance
type type
GPU frame rate Display buffer occupancy
DSP latency GPS processing time
Image Processor frame rate WiFi bandwidth
Video Codec frame rate USB bandwidth
Rotator frame rate Modem processing time
JPEG frame rate Audio latency
Camera buffer occupancy
a look-up table requires 23 = 8 entries and each entry is
a register for the NPI value. A comparator is paired with
each table entry. In total, the implementation only costs the
storage of eight registers and eight comparators per core.
In the memory system, performing the priority-based ar-
bitration requires a 3-bit comparator to arbitrate among
transactions with different priority levels. Since most ex-
isting QoS-aware schedulers already provide hardware sup-
port for priorities, our framework can be integrated into the
memory system without raising complexity.
4. EVALUATION
In this section, the proposed SARA framework will be
tested to demonstrate its effectiveness in providing target
performance to heterogeneous cores. Two test cases based
on the camcorder dataflow (Fig. 2) will be used for demon-
stration. Further, we will show row-buffer hits optimization
can be performed efficiently within SARA framework with-
out performance degradations.
The proposed framework is modeled as in Fig. 3, where
memory traffic from every DMA is generated based on a
next-generation MPSoC [1]. DRAMSim2 [12] with LPDDR4
timing model is used for cycle-accurate simulation of DRAM.
Table 1 shows the simulation settings. Table 2 lists the sim-
ulated cores and the types of target performance.
The target performance for each core is set according to
the camcorder dataflow (Fig. 2) which runs at 30fps. For
instance, the frame rotator writes and reads 1080p YUV420
images at 30fps, which requires 89MB/s for each DMA and
178MB/s in total.
4.1 Delivering Target Performance
To begin with, we test the SARA framework in delivering
target performance to heterogeneous cores. For comparison,
four arbitration policies are used in the memory controller
and on-chip network arbiters, including first-come-first-serve
(FCFS), round-robin (RR), a frame-rate-based QoS policy
[3] and the priority-based QoS policy (Policy 1). FCFS pol-
icy serves all the transactions according to the arrival or-
der. Round-robin policy separates transactions into differ-
0 10 20 30
0.1
  1
10
time(ms)
N
P
I
 
 
Image Proc.
Rotator
Video Codec
Display
Camera
USB
GPS
WiFi
0 10 20 30
0.1
  1
10
time(ms)
N
P
I
 
 
Image Proc.
Rotator
Video Codec
Display
Camera
USB
GPS
WiFi
0 10 20 30
0.1
  1
10
time(ms)
N
P
I
 
 
Image Proc.
Rotator
Video Codec
Display
Camera
USB
GPS
WiFi
0 10 20 30
0.1
  1
10
time(ms)
N
P
I
 
 
Image Proc.
Rotator
Video Codec
Display
Camera
USB
GPS
WiFi
(a) FCFS policy (b) Round-robin policy (c) Frame-rate-based QoS policy [4] (d) Priority-based QoS policy
Figure 5: NPI value of critical cores during one frame period (33ms) for test case A with different arbitration policies.
time(ms)
0.1
1
10
N
P
I
Image Proc.
Video Codec
Display
USB
DSP
WiFi
time(ms)
0.1
1
10
N
P
I
Image Proc.
Video Codec
Display
USB
DSP
WiFi
0 10 20 300 10 20 30
time(ms)
0.1
1
10
N
P
I
Image Proc.
Video Codec
Display
USB
DSP
WiFi
0 10 20 30
time(ms)
0.1
1
10
N
P
I
Image Proc.
Video Codec
Display
USB
DSP
WiFi
(a) FCFS policy (b) Round-robin policy (c) Frame-rate-based QoS policy [4] (d) Priority-based QoS policy
0 10 20 30
Figure 6: NPI value of critical cores during one frame period (33ms) for test case B with different arbitration policies.
ent queues and serves them in a round-robin fashion. In
the memory controller, we have five transaction queues re-
spectively designated to the CPU, the GPU, the DSP, me-
dia cores and system cores. Round-robin policy also ap-
plies to on-chip network arbiters, as input queues are served
in turn. The frame-rate-based QoS policy prioritizes me-
dia cores when they are missing real-time deadlines, but
otherwise, the policy provides best-effort service to latency-
sensitive cores. Furthermore, the priority-based QoS policy
compares priority levels for arbitration and uses round-robin
as the tiebreaker.
The NPI of critical cores during a frame period are shown
in Fig. 5 when test case A is applied. As explained in Sec-
tion 3.2, the NPI metric reflects performance as higher value
indicates better performance. When NPI value drops below
1, it means the the target performance is not achieved.
Without reordering memory requests, FCFS policy ends
up spending most of the time serving cores consuming high
bandwidth. That easily leads to the starvation of latency-
sensitive cores. As shown in Fig. 5(a), the NPI of the GPS
drops below 1 because the GPS is overwhelmed by other sys-
tem cores sharing the same interconnect, such as the USB.
For media cores, the video codec, the rotator and the image
processor have all the frame data available at the beginning
of a frame period and thus create bursty traffic, meanwhile
the camera and the display generate and consume data at
constant rates which are determined by image sensor and
LCD panel. In Fig. 5(a), media cores with bursty traffic
obtain most of the bandwidth in the beginning, resulting
in high NPI value. On the other hand, the display fails to
achieve the target performance. The display’s NPI drops as
low as 0.13 which means only 13% of the target performance
is achieved.
When round-robin policy is applied, the competition among
media cores becomes more intense since they share the same
transaction queue in the memory controller. In Fig. 5(b),
the display and the camera both fail due to the interference
from other media cores. Less than 10% of their target per-
formance is achieved in the worst case. In the meantime, all
the system cores meet their target performance because they
avoid the interference from media cores by using a separate
transaction queue.
The frame-rate-based QoS policy helps all media cores
achieve NPI value above 1 in Fig. 5(c). However, all system
cores fail due to the absence of adaptations for the cores
with different QoS targets other than frame rates.
In Fig. 5(d), all the cores reach their target performance
when QoS-aware scheduling is performed, because priority-
based adaptations help arbiters serve the cores in urgent
needs. Note that the NPI of the other cores such as the
GPU are not shown because no failure is observed from these
cores.
The results by test case B are shown in Fig. 6. Similar to
Fig. 5, the latency-sensitive DSP suffers when FCFS policy
is adopted (Fig. 6(a)). When round-robin policy is applied
(Fig. 6(b)), the DSP suffers less since it has its own trans-
action queue, while the display fails due to the increased
interference from other media cores sharing the same trans-
action queue. Again, the frame-rate-based QoS policy fails
to serve non-media cores. At last, the dynamic priorities
help the memory system deliver target performance to all
cores (Fig. 6(d)).
Next, we take the image processor from test case A as an
example to examine the priority-based adaptation in a sin-
gle core. Fig. 7 shows the distributions of the image proces-
sor’s priority levels during one frame period, while DRAM
frequency decreases from 1700MHz to 1300MHz. Each hor-
izontal bar is designated to a certain DRAM frequency. In
a single bar, each block represents the percentage of time
during which a certain priority level is adopted. Different
shades of blue represent different priority levels, as higher
priority levels in darker shades. As shown in Fig. 7, when
DRAM frequency is set to 1700MHz, for 90% of the time the
image processor is adapted to the priority of 0. As frequency
decreases, less memory requests can be processed by DRAM.
More memory interferences and competitions happen as the
result. To maintain target bandwidth, the self-adaptation
leads to a gradual increase in priority levels, which can be ob-
served through the increasing area of blocks in dark shades.
When DRAM frequency is lowered to 1300MHz, the image
processor has the priority of 7 for 60% of the time. In addi-
tion, as frequency decreases, the average bandwidth of the
image processor remains above target bandwidth thanks to
the priority-based adaptation.
0% 20% 40% 60% 80% 100%
1300
1400
1500
1600
1700
D
R
A
M
 F
re
q
u
e
n
cy
 (
M
H
z) 
0 1
2 3
4 5
6 7
 priority
Figure 7: Distributions of the image processor’s priority
levels during one frame period (33ms) with respect to
different DRAM frequencies.
4.2 Row-buffer Hits Optimization
As explained in Section 3.3, row-buffer hits optimization
helps improve available DRAM bandwidth. With the knowl-
edge of heterogeneous cores’ urgency levels, the memory con-
troller in the SARA framework is capable of optimizing row-
buffer hits without degrading system performance.
For comparison, we compare with another scheduling pol-
icy named first-ready first-come-first-serve (FR-FCFS) which
prioritizes transactions going to open rows whenever it is
possible, and otherwise schedules transactions based on FCFS.
FR-FCFS policy is expected to achieve the most row-buffer
hits and the highest DRAM bandwidth. Fig. 8 shows the
average DRAM bandwidth during one frame period when
test case A is applied. Four memory scheduling policies are
tested, including RR, FCFS, QoS (Policy 1), QoS-RB (Pol-
icy 2) and FR-FCFS. Fig. 9 shows the NPI of critical cores as
QoS-RB and FR-FCFS are adopted. As expected, FR-FCFS
policy achieves the highest bandwidth, whereas performance
degradations happen to the GPS and the display as the ex-
pense. The bandwidth by QoS-RB is slightly lower (by 1%)
than FR-FCFS, but much higher than other policies. Specif-
ically, the average DRAM bandwidth obtained by QoS-RB
policy is 24%, 12% and 10% higher than RR, FCFS and
QoS policies respectively. In the meantime, no performance
degradations are caused to heterogeneous cores.
14 15 16 17 18 19
FR-FCFS
QoS-RB
QoS
FCFS
RR
DRAM Bandwidth (GB/s) 
Figure 8: Summary of average bandwidth when different
scheduling policies applied.
5. CONCLUSIONS
0 10 20 30
0.1
1
10
time(ms)
N
P
I
QoS w/ row-buffer optimization
 
 
Image Proc.
Rotator
Video Codec
Display
Camera
USB
GPS
WiFi
0 10 20 30
0.1
1
10
time(ms)
N
P
I
First-Ready First-Come-First-Serve
 
 
Image Proc.
Rotator
Video Codec
Display
Camera
USB
GPS
WiFi
Figure 9: NPI value for test case A with respect to FR-
FCFS and QoS-RB scheduling policies.
In this work, we proposed the self-aware resource alloca-
tion (SARA) framework for memory management in het-
erogeneous systems. Lightweight performance meters are
distributed in each core to monitor end-to-end QoS with
low cost. The priority-based adaptation allows cores to
customize their target performance and adjust their pri-
ority levels according to the observed performance. The
memory system with non-partitionable resources responds
to QoS demands by performing priority-based management
which does not require complicated computations. Experi-
mental results show that with the priority-based adaptation
and management, SARA framework helps all the heteroge-
neous cores achieve their target performance. By compar-
ison, without using priorities, performance of critical cores
can drop lower than 10% of the target.
6. REFERENCES
[1] Qualcomm. Snapdragon 820.
https://www.qualcomm.com/products/snapdragon/processors/820,
2015.
[2] NVIDIA. Tegra x1.
http://www.nvidia.com/object/tegra-x1-processor.html, 2015.
[3] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. A QoS-aware
memory controller for dynamically balancing GPU and CPU
bandwidth use in an MPSoC. In ACM/IEEE DAC, 2012.
[4] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh,
and O. Mutlu. Staged memory scheduling: Achieving high
performance and scalability in heterogeneous systems. In ACM
ISCA, 2012.
[5] Y. Song, K. Samadi, and B. Lin. Single-tier virtual queuing:
An efficacious memory controller architecture for MPSoCs with
multiple realtime cores. In ACM/IEEE DAC, 2016.
[6] B. Grot, S. W. Keckler, and O. Mutlu. Preemptive virtual
clock: A flexible, efficient, and cost-effective QoS scheme for
networks-on-chip. In ACM/IEEE MICRO, 2009.
[7] P.-H. Wang, C.-H. Li, and C.-L. Yang. Latency
sensitivity-based cache partitioning for heterogeneous
multi-core architecture. In ACM/IEEE DAC, 2016.
[8] A. Sharifi, S. Srikantaiah, A. K. Mishra, M. Kandemir, and
C. R. Das. METE: Meeting end-to-end QoS in multicores
through system-wide resource management. SIGMETRICS
Perform. Eval. Rev., 39(1):13–24, June 2011.
[9] H. Hoffmann, J. Holt, G. Kurian, E. Lau, M. Maggio, J. E.
Miller, S. M. Neuman, M. Sinangil, Y. Sinangil, A. Agarwal,
A. P. Chandrakasan, and S. Devadas. Self-aware computing in
the Angstrom processor. In ACM/IEEE DAC, 2012.
[10] J. A. Colmenares, G. Eads, S. Hofmeyr, S. Bird, M. Moreto´,
D. Chou, B. Gluzman, E. Roman, D. B. Bartolini, N. Mor,
K. Asanovic´, and J. D. Kubiatowicz. Tessellation: Refactoring
the OS around explicit resource containers with continuous
adaptation. In ACM/IEEE DAC, 2013.
[11] B. Jacob, S. Ng, and D. Wang. Memory Systems: Cache,
DRAM, Disk. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 2007.
[12] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel,
and B. Jacob. DRAMSim: A memory-system simulator. In
SIGARCH Computer Architecture News, 2005.
