In modern heterogeneous MPSoCs, the management of shared memory resources is crucial in delivering end-to-end QoS. Previous frameworks have either focused on singular QoS targets or the allocation of partitionable resources among CPU applications at relatively slow timescales. However, heterogeneous MPSoCs typically require instant response from the memory system where most resources cannot be partitioned. Moreover, the health of different cores in a heterogeneous MPSoC is often measured by diverse performance objectives. In this work, we propose a Self-Aware Resource Allocation (SARA) framework for heterogeneous MPSoCs. Priority-based adaptation allows cores to use different target performance and self-monitor their own intrinsic health. In response, the system allocates non-partitionable resources based on priorities. The proposed framework meets a diverse range of QoS demands from heterogeneous cores.
INTRODUCTION
Modern heterogeneous MPSoCs [1, 2] have been widely deployed in mobile devices thanks to their energy efficiency. These MPSoCs typically integrate a diverse collection of cores. Fig. 1 depicts an example of a heterogeneous MPSoC. Besides general-purpose cores like the CPU for running applications, most heterogeneous cores are dedicated to certain functions, such as the GPU, the DSP and the display. These cores have diverse notions of Quality-of-Service (QoS). For example, the GPU measures target real-time performance in terms of frame rate; the DSP demands the memory latency to remain below a certain limit; and the display requires sufficient bandwidth to refresh frames at a constant rate.
To save cost and energy, heterogeneous cores commonly share resources, among which, the sharing of the memory system (including the on-chip network and the memory controller) is the most challenging because memory performance often has a direct and substantial impact on the system performance. As data is being shared through memory, competing memory requests from different cores interfere with each other, and these memory interferences can cause the memory system to fail in meeting the target performance of some cores. Fig. 2 depicts a camcorder application, which represents a typical use case in that it involves many cores at the same time. With ineffective memory scheduling, a realtime core (e.g., the display) may not achieve the target realtime performance due to inadequate memory bandwidth. Moreover, as latency-sensitive cores such as the DSP share memory with other cores, they can be easily overwhelmed by real-time cores consuming high bandwidth. QoS-aware management for specific types of memory resources has been well-studied by previous work [3, 4, 5, 6, 7] . In [3] , a QoS-aware scheduling policy was proposed for CPU-GPU systems. The concept of frame progress was introduced for monitoring GPU performance. Although the policy can be extended to include more media cores, it cannot be applied to real-time cores whose target QoS cannot be assessed in terms of frame rate. Moreover, holistic memory management frameworks for CPU-centric homogeneous systems have also been explored recently [8, 9, 10] . This series of work typically constructs a management model based on the control theory to partition computing and memory resources. These frameworks accept flexible QoS targets as clients are allowed to define their own target performance. Nonetheless, such type of approaches is performed at a relatively slow timescale (e.g., on the order of milleseconds) due to the computational complexity. In comparison, real-time cores in heterogeneous MPSoCs often demand much more instant response from the memory system. Besides, communication between heterogeneous cores is mainly conducted through shared memory as shown in Fig. 2 , because multimedia data is generally too large to fit in caches. Therefore, DRAM plays a more crucial role in heterogeneous systems. However, previous frameworks cannot handle DRAM effectively because its bandwidth is not partitionable. Specifically, available DRAM bandwidth relies on the memory access pattern, as higher spatial locality results in fewer redundant precharge operations and better memory efficiency.
So far, there has not been a QoS-aware resource management model for heterogeneous MPSoCs which is capable of allocating non-partitionable resources to fleeting QoS demands. In this work, we propose the Self-Aware Resource Allocation (SARA) framework as a solution. The contributions of our work can be summarized as follows.
• We propose a QoS-aware holistic resource management framework for heterogeneous systems. The SARA model accepts diverse notions of QoS and monitors performance distributively with lightweight meters to guarantee endto-end QoS.
• We introduce priority-based self-adaptations for the management of non-partitionable resources, such as DRAM and on-chip network, which constitute most of the shared resources in heterogeneous MPSoCs.
• We evaluate the proposed framework using memory traffic of next-generation MPSoCs and show that the proposed SARA model delivers target performance to all cores. In contrast, the performance of critical cores can fall below 10% of their targets without the SARA framework. Further, memory system optimization is performed without QoS degradations. The rest of this paper is organized as follows: Section 2 briefly reviews related work. Section 3 describes the proposed SARA framework. Experimental results and conclusions follow in Sections 4 and 5.
RELATED WORK
Most previous work on QoS-aware resource management in heterogeneous MPSoCs were focused on a single layer of the memory system. In [3] , a novel scheduling policy was introduced to dynamically balance bandwidth between the CPU and the GPU based on the frame progress of realtime workloads. To achieve QoS-aware memory scheduling, the staged memory scheduler [4] was presented as the first QoS-aware scheduler for CPU-GPU systems. Further, the single-tier virtual queuing memory controller [5] was proposed to overcome the limitation of two-tier schedulers in QoS-aware scheduling. Besides memory scheduling, QoSaware cache management [7] and on-chip network design [6] have also been well-explored in recent years. Nonetheless, these work cannot guarantee end-to-end QoS because they only deal with certain parts of the memory system. For example, the QoS provided in the memory controller could be deteriorated by the interconnect if it is not applying the same QoS policy. In addition, implementing a centralized QoS monitor in the memory system can be prohibitive since it needs to collect runtime information from all cores. More limiting, these work assume specific notions of QoS, which is not applicable to modern heterogeneous MPSoCs where the health of different cores is often evaluated by diverse performance objectives.
METE [8] is a multi-level framework for end-to-end resource management based on the control theory. It utilizes runtime information to predict application behaviors. Application controllers calculate the amounts of resources required to achieve target application performance. A global resource broker determines the final resource partitions for applications. SEEC [9] is a self-aware computing framework designed for a many-core processor. It follows the control loop of observe-decide-act for resource allocations. Performance of CPU applications are observed by the decision engine which decides resource partitions using available actions defined by system designers. ARCC [10] is a self-computing framework implemented in the Tessellation many-core OS. It performs the two-level scheduling: first the resource allocation broker distributes global resources and then at userlevel scheduling policies are customized separately.
Aforementioned frameworks were intended for CPU-centric multi-core systems. These frameworks are aimed at allocating partitionable resources, such as CPU cores and cache ways, to applications at the software/OS level. They are not suitable for heterogeneous MPSoCs for the following reasons. First, complicated control models may not be fast enough for heterogeneous cores (e.g., these software/OS level approaches operate at milleseconds timescales). For example, the DSP sets limit on memory latency at nanosecond level, but prior frameworks need more time to adapt through control theory computations in OS. Second, prior work assume all memory resources are partitionable. However, DRAM bandwidth cannot be simply partitioned like cache ways. In DRAM, data storage of a memory bank is organized into rows and columns. To access a column, the row where this column is located will be loaded into the row-buffer (i.e. row activation operation) after the other rows are closed (i.e. precharge operation) [11] . These row activation and precharge operations cause time penalty without contributing to actual data transfer, which makes DRAM bandwidth inconstant and unpredictable.
SELF-AWARE RESOURCE ALLOCATION FRAMEWORK
On-Chip Network The proposed architecture of SARA framework is shown in Fig. 3 . The resource management model consists of three stages, including distributed monitoring, priority-based runtime adaptation and system response. In the rest of this section, we will go through SARA framework stage by stage.
Distributed Self-Monitoring
In the first stage, each core self-monitors its own performance. The distributed monitoring relieves the memory system from the burden of monitoring heterogeneous cores with various notions of QoS. Self-monitoring also provides more accurate feedback on the end-to-end QoS compared with centralized monitoring in the memory system. In addition, implementing lightweight performance meters is good for scalability, because a new core can be added or modified without updating the rest of the system. Every core customizes its own internal performance meter to measure its own performance or progression against a given target, and the measurement gets normalized into a fractional number called a Normalized Performance Indicator (NPI), which is used as an indicator of the core's intrinsic health. In the DSP, the performance meter monitors the average latency of its transactions, while in the display the meter counts the occupancy level in the read buffer. The deviation from the target performance (e.g., latency, occupancy level, etc) produces the NPI metric. In our framework, each independent DMA (Direct Memory Access) unit is equipped with a performance meter. Note that there are usually multiple DMAs in a single core. For simplicity, we only show one DMA per core in Fig. 3. 
Distributed Priority-Based Adaptation
In the second stage, each core adapts the relative priority of its transactions based on its NPI value. The NPI value delivered by the performance meter is translated into a relative priority level which is attached to memory transactions from the same DMA. The priority level will be evaluated within on-chip network arbiters and the memory controller, as the transaction travels along the way to DRAM. Prioritybased arbitrations allow the memory system to provide QoS without specifying the heterogeneous QoS for all cores and DMAs. Same with performance meters, the formulation of the NPI metric and the adaptations of priority can be implemented differently from core to core, depending on the local target performance. Fig. 4 shows three examples of priority-based adaptation in different cores.
As for the DSP, the target performance is to have the average memory latency lower than the maximum latency limit. The average latency is measured and compared with a pre-set limit to produce the NPI value (see Eqn. 1), which remains above or equal to 1 when the target performance is achieved. This NPI value is then translated to a relative priority level (Fig. 4(a) ). The priority level increases along with average latency.
N P IDSP = maximum latency limit average latency (1) Similarly, cores requesting for bandwidth produce NPI metrics by computing the ratio between the average and the target bandwidth. However, frame rate differs from bandwidth, because frame size can be variable and thus a constant frame rate can lead to variable bandwidth. Hence frame progress [3, 5] is used instead to produce NPI metrics for frame rate based cores. Take the GPU as an example, the target is to let the frame progress reach 100% as the current frame period comes to an end. The GPU's NPI value is produced at any time by comparing the frame progress with reference progresses which grow proportionally with frame time. The NPI value is then translated to a relative priority level of GPU transactions. Fig. 4(b) shows the reference progresses achieving 1, 0.75 and 0.5 times the average data rate of target performance. 2) In the display, LCD panel reads data from a read buffer at a constant frame rate, while the display controller DMA tries to refill this buffer from DRAM so it never gets empty. Its health (see Eqn. 3) relies on maintaining the refill rate (R ref ill ) no lower than the read data rate (R read ), and can be indicated by the variation of buffer occupancy level (∆occupancy). Compared with an initial level (e.g. 50%), the lower the occupancy level of this buffer gets, the worse the NPI value becomes, which is in turn translated to a higher priority level (Fig. 4(c) ).
Intuitively, one might be concerned that every core would intentionally raise the priority to the maximum level to obtain as much resources as possible. However, this situation should not happen because the priority level is only maximized when the actual performance is far below the target. The system designer has the responsibility to make sure cores have realistic performance targets and enough resources to satisfy all possible combinations of QoS demands. Once the system is fabricated in hardware, heterogeneous cores cannot change their target performance arbitrarily, especially because most of them are fixed-function IP blocks with invariable QoS targets and little programmability.
In our evaluations, the priority levels are quantized into 2 k levels, which can be encoded using k bits. We found that k = 3 bits provides sufficient granularity in priority levels to produce satisfying results (i.e., the priority levels range from 0 to 7).
Distributed System Response
As transactions travel through the memory system, the system responds to QoS demands by providing resource management based on their priority levels. The priority-based management is performed correspondingly in different parts of the memory system. In on-chip network routers, transactions with higher priorities are preferentially selected during switch allocation. In the memory controller, when a priority-based scheduler arbitrates among transactions going to available memory banks, the ones with higher priorities have more chances to be served. An example of such memory scheduling policies is the priority-based round-robin shown in Policy 1. To avoid starvation of transactions with low priorities, the scheduler also needs to consider the aging factor during arbitration. In our evaluations, the scheduler periodically clears the backlog of transactions that have waited for at least T cycles (e.g., T = 10000 cycles).
• Policy 1: Suppose PA and PB are priorities for transactions A and B, if PA > PB choose A; if PA < PB choose B; otherwise choose between A and B in round-robin manners. Priorities notify the system whether the cores are in urgent QoS demands. That gives the memory system an opportunity to optimize memory performance without undermining the QoS. Specifically, when transactions are in low urgency, the system can improve memory performance such as rowbuffer hit rate, instead of focusing on serving QoS demands.
Row-buffer hits refer to the number of memory accesses to the same active row-buffer before precharge. More rowbuffer hits means less time and power are wasted on row activation and precharge operations. Thus increasing rowbuffer hits helps lower memory latency and improve DRAM total bandwidth.
To increase row-buffer hits, the memory controller reorders transactions to favor the ones hitting open rows. It may cause degradations to the QoS when the transactions in high urgency are postponed due to row-buffer hits optimization. Yet, with priorities, the memory controller is aware of the urgency levels of transactions and able to avoid delaying urgent transactions during optimization. Policy 2 shows an extension of Policy 1 to increase row-buffer hits without QoS degradations. The parameter δ is an adjustable threshold to balance row-buffer hits optimization and QoSaware scheduling. When the priority level is lower than δ, the scheduler focuses on row-buffer hits, otherwise the QoS comes first. A higher δ value gives more favor to DRAM bandwidth, but also potentially causes more disturbance to the QoS. We found δ = 6 a good setting to achieve high DRAM bandwidth without causing QoS degradations.
• Policy 2: Suppose transaction A is going to an active row-buffer and B is not. If PA, PB < δ or PA = PB, choose A. Otherwise, perform priority-based round-robin. The priority-based resource allocation is able to handle non-partitionable with little computation in comparison with previous management models [8, 9, 10] . This facilitates instant response from the memory system to QoS demands.
Hardware Implementation
The implementation of the proposed SARA framework includes three parts: the computation of NPI value, the translation of NPI value to a priority level, and the priority-based arbitration in the memory system.
To calculate the NPI, a divider is needed at the performance meter for each DMA. For the translation of the NPI, a mapping function can be stored in a look-up table at each core. Each priority level is assigned with a table entry, and this entry stores the lowest NPI value allowed at that priority level. For example, if priority = p when NPI ∈ [u, v), the value u will be stored at the entry for p on the look-up table. Note that v will be the lower bound of the NPI for the priority level p − 1. Comparators are needed to access table entries in parallel. If the current NPI value is not lower than the stored lower bound of NPI value, the corresponding priority level will be asserted. When multiple priority levels are asserted, the lowest level will be adopted.
Supposed each priority level is encoded into three bits, In the memory system, performing the priority-based arbitration requires a 3-bit comparator to arbitrate among transactions with different priority levels. Since most existing QoS-aware schedulers already provide hardware support for priorities, our framework can be integrated into the memory system without raising complexity.
EVALUATION
In this section, the proposed SARA framework will be tested to demonstrate its effectiveness in providing target performance to heterogeneous cores. Two test cases based on the camcorder dataflow (Fig. 2) will be used for demonstration. Further, we will show row-buffer hits optimization can be performed efficiently within SARA framework without performance degradations.
The proposed framework is modeled as in Fig. 3 , where memory traffic from every DMA is generated based on a next-generation MPSoC [1] . DRAMSim2 [12] with LPDDR4 timing model is used for cycle-accurate simulation of DRAM. Table 1 shows the simulation settings. Table 2 lists the simulated cores and the types of target performance.
The target performance for each core is set according to the camcorder dataflow (Fig. 2) which runs at 30fps. For instance, the frame rotator writes and reads 1080p YUV420 images at 30fps, which requires 89MB/s for each DMA and 178MB/s in total.
Delivering Target Performance
To begin with, we test the SARA framework in delivering target performance to heterogeneous cores. For comparison, four arbitration policies are used in the memory controller and on-chip network arbiters, including first-come-first-serve (FCFS), round-robin (RR), a frame-rate-based QoS policy [3] and the priority-based QoS policy (Policy 1). FCFS policy serves all the transactions according to the arrival order. Round-robin policy separates transactions into differ- ent queues and serves them in a round-robin fashion. In the memory controller, we have five transaction queues respectively designated to the CPU, the GPU, the DSP, media cores and system cores. Round-robin policy also applies to on-chip network arbiters, as input queues are served in turn. The frame-rate-based QoS policy prioritizes media cores when they are missing real-time deadlines, but otherwise, the policy provides best-effort service to latencysensitive cores. Furthermore, the priority-based QoS policy compares priority levels for arbitration and uses round-robin as the tiebreaker. The NPI of critical cores during a frame period are shown in Fig. 5 when test case A is applied. As explained in Section 3.2, the NPI metric reflects performance as higher value indicates better performance. When NPI value drops below 1, it means the the target performance is not achieved.
Without reordering memory requests, FCFS policy ends up spending most of the time serving cores consuming high bandwidth. That easily leads to the starvation of latencysensitive cores. As shown in Fig. 5(a) , the NPI of the GPS drops below 1 because the GPS is overwhelmed by other system cores sharing the same interconnect, such as the USB. For media cores, the video codec, the rotator and the image processor have all the frame data available at the beginning of a frame period and thus create bursty traffic, meanwhile the camera and the display generate and consume data at constant rates which are determined by image sensor and LCD panel. In Fig. 5(a) , media cores with bursty traffic obtain most of the bandwidth in the beginning, resulting in high NPI value. On the other hand, the display fails to achieve the target performance. The display's NPI drops as low as 0.13 which means only 13% of the target performance is achieved.
When round-robin policy is applied, the competition among media cores becomes more intense since they share the same transaction queue in the memory controller. In Fig. 5(b) , the display and the camera both fail due to the interference from other media cores. Less than 10% of their target performance is achieved in the worst case. In the meantime, all the system cores meet their target performance because they avoid the interference from media cores by using a separate transaction queue. The frame-rate-based QoS policy helps all media cores achieve NPI value above 1 in Fig. 5(c) . However, all system cores fail due to the absence of adaptations for the cores with different QoS targets other than frame rates.
In Fig. 5(d) , all the cores reach their target performance when QoS-aware scheduling is performed, because prioritybased adaptations help arbiters serve the cores in urgent needs. Note that the NPI of the other cores such as the GPU are not shown because no failure is observed from these cores.
The results by test case B are shown in Fig. 6 . Similar to Fig. 5 , the latency-sensitive DSP suffers when FCFS policy is adopted (Fig. 6(a) ). When round-robin policy is applied (Fig. 6(b) ), the DSP suffers less since it has its own transaction queue, while the display fails due to the increased interference from other media cores sharing the same transaction queue. Again, the frame-rate-based QoS policy fails to serve non-media cores. At last, the dynamic priorities help the memory system deliver target performance to all cores ( Fig. 6(d) ).
Next, we take the image processor from test case A as an example to examine the priority-based adaptation in a single core. Fig. 7 shows the distributions of the image processor's priority levels during one frame period, while DRAM frequency decreases from 1700MHz to 1300MHz. Each horizontal bar is designated to a certain DRAM frequency. In a single bar, each block represents the percentage of time during which a certain priority level is adopted. Different shades of blue represent different priority levels, as higher priority levels in darker shades. As shown in Fig. 7 , when DRAM frequency is set to 1700MHz, for 90% of the time the image processor is adapted to the priority of 0. As frequency decreases, less memory requests can be processed by DRAM. More memory interferences and competitions happen as the result. To maintain target bandwidth, the self-adaptation leads to a gradual increase in priority levels, which can be observed through the increasing area of blocks in dark shades. When DRAM frequency is lowered to 1300MHz, the image processor has the priority of 7 for 60% of the time. In addition, as frequency decreases, the average bandwidth of the image processor remains above target bandwidth thanks to the priority-based adaptation. Figure 7 : Distributions of the image processor's priority levels during one frame period (33ms) with respect to different DRAM frequencies.
Row-buffer Hits Optimization
As explained in Section 3.3, row-buffer hits optimization helps improve available DRAM bandwidth. With the knowledge of heterogeneous cores' urgency levels, the memory controller in the SARA framework is capable of optimizing rowbuffer hits without degrading system performance.
For comparison, we compare with another scheduling policy named first-ready first-come-first-serve (FR-FCFS) which prioritizes transactions going to open rows whenever it is possible, and otherwise schedules transactions based on FCFS. FR-FCFS policy is expected to achieve the most row-buffer hits and the highest DRAM bandwidth. Fig. 8 shows the average DRAM bandwidth during one frame period when test case A is applied. Four memory scheduling policies are tested, including RR, FCFS, QoS (Policy 1), QoS-RB (Policy 2) and FR-FCFS. Fig. 9 shows the NPI of critical cores as QoS-RB and FR-FCFS are adopted. As expected, FR-FCFS policy achieves the highest bandwidth, whereas performance degradations happen to the GPS and the display as the expense. The bandwidth by QoS-RB is slightly lower (by 1%) than FR-FCFS, but much higher than other policies. Specifically, the average DRAM bandwidth obtained by QoS-RB policy is 24%, 12% and 10% higher than RR, FCFS and QoS policies respectively. In the meantime, no performance degradations are caused to heterogeneous cores. In this work, we proposed the self-aware resource allocation (SARA) framework for memory management in heterogeneous systems. Lightweight performance meters are distributed in each core to monitor end-to-end QoS with low cost. The priority-based adaptation allows cores to customize their target performance and adjust their priority levels according to the observed performance. The memory system with non-partitionable resources responds to QoS demands by performing priority-based management which does not require complicated computations. Experimental results show that with the priority-based adaptation and management, SARA framework helps all the heterogeneous cores achieve their target performance. By comparison, without using priorities, performance of critical cores can drop lower than 10% of the target.
