Energy is a scarce resource in real-time embedded systems due to the fact that most of them run on batteries. Hence, the designers should ensure that the energy constraints are satisfied in addition to the deadline constraints. This necessitates the consideration of the impact of the interference due to shared, low-level hardware resources such as the cache on the worst-case energy consumption of the tasks. Toward this aim, this article proposes a fine-grained approach to analyze the bank-level interference (bank conflict and bus access interference) on real-time multicore systems, which can reasonably estimate runtime interferences in shared cache and yield tighter worst-case energy consumption. In addition, we develop a bank-to-core mapping algorithm for reducing bank-level interference and improving the worst-case energy consumption. The experimental results demonstrate that our approach can improve the tightness of worst-case energy consumption by 14.25% on average compared to upper-bound delay approach. The bank-to-core mapping provides significant benefits in worst-case energy consumption reduction with 7.23%.
Introduction
Real-time embedded systems are becoming widespread, ranging from sensor networks, Internet of Things (IoT) systems, 1, 2 and surveillance systems to satellite subsystems. For real-time embedded systems, energy consumption are important design issues, since most of them operate on batteries or drain energy from limited sources. Several authors have argued that when energy plays an important role, it should also become a key factor when it comes to making scheduling decisions. [3] [4] [5] That is, in addition to ensuring the deadline constraints, designers also consider whether or not there is enough energy available in the system for the task to complete execution. As a consequence, besides bounding worstcase execution time (WCET) of a task, designers need to analyze the worst-case energy consumption (WCEC) of the task for avoiding potential system failures due to inadequate energy supply at runtime.
Currently, real-time systems are increasingly moving toward multicore architectures. To mitigate the high latency of the off-chip memory, multicore architectures are usually equipped with the on-chip caches. The caches can significantly improve the performance, but its energy consumption is a concern, Several studies 6, 7 report that the cache energy consumption accounts for up to 50% of the overall chip due to its large on-chip area and high access frequency. Clearly, the caches are good candidates for energy optimization. Various techniques have been proposed over the years to reduce the energy consumption of the caches; however, in many of these works, 8, 9 cache energy models are not tailored to the worst case, and bank-level interference issues are not considered in all these works .
In multicore architecture, the shared cache consists of multiple banks, and cache requests to different banks can be serviced in parallel. However, a bank can only handle one cache request at a time. When two or more cache requests try to access the same bank at the same time, the bank conflict occurs. The bank conflict complicates system behavior, leading to difficulties for the WCET analysis and an important waste of energy. Therefore, its influence on WCET and WCEC has to be taken into account for ensuring safety of systems. To the best of our knowledge, only Paolieri et al.'s 10 work and Yoon et al.'s 11 work considered bank conflict on WCET estimation. But they all employ the upperbound delay (UBD) approach to estimate the interference delay, in which a potential maximum delay (i.e. UBD) that each cache request suffers is bounded, then, this delay is added to each request during WCET analysis. However, not all requests can suffer from bank conflict, even though bank conflicts occur among a group of requests, the delay of bank conflict suffered by each request is different. This method will not only cause pessimistic WCET estimation, but also provide a conservative over-approximation of the WCEC of the task.
In this article, we investigate the impact of banklevel interference on WCEC for real-time multicore systems where the shared cache with multiple banks is used to improve performance. We assume that the target real-time embedded system is a hard real-time system where the deadline constraint of each hard realtime task (HRT) must be met. We make the following major contributions. 1. We model the WCEC of shared cache from the perspective of HRTs and analyze the WCEC of the HRT. 2. We present a fine-grained approach to analyze bank-level interference based on request timing, which can bring benefit for the tightness of WCEC. In our approach, we assume that the access to the shared cache is granted using a Interference-Aware Bus Arbiter (IABA). 10 3. We apply bank-to-core mapping to optimize interference delay and develop an algorithm for finding the best bank-to-core mapping according to the queue of cores, such that the impact of bank-level interference on the WCEC is minimized.
The rest of this article is organized as follows. Section ''Related work'' reviews the related work on the energy-optimization techniques of cache and banklevel interference analysis. Section ''System model'' introduces the system model, task model, and cache energy model. In section ''Bank-level interference analysis and WCEC computation,'' we analyze the delay of bank-level interference suffered by a HRT. In section ''Bank mapping optimization for WCEC reduction,'' we design an algorithm for finding the best bank-tocore mapping. Section ''Evaluation'' presents experimental results. Finally, we conclude this article in section ''Conclusion and future work.''
Related work
In the existing works, various techniques have been proposed to reduce inter-core interference and optimize the energy consumption of the cache. Most of these techniques aim to improve the average case energy consumption of the cache. Furthermore, they mainly focus on the effect of cache storage interference on energy consumption. Here, we analyze some of the works that we have separated into two different categories: (1) works that focus on cache reconfiguration and (2) works that focus on cache partitioning.
Reconfigurable cache
Several reconfigurable 7,13 cache have been proposed for performance improvement, energy saving, and contention reduction. Zhang et al. 7 proposed a highly reconfigurable cache architecture, where cache characteristics (such as cache way, block size, and associativity) could be tuned via hardware configuration registers. This cache architecture can achieve up to 40% energy saving, but which cannot guarantee the strict cache isolation among real-time applications. In Hajimiri et al., 14 inter-task dynamic cache reconfiguration (DCR) technique has been proposed to reduce the contention and optimize energy consumption of the cache in real-time systems. In Mittal et al., 15 a multicore cache energy saving technique using dynamic cache reconfiguration was proposed to save cache energy by periodically allocating suitable amount of cache space to each running programs.
Cache partitioning
Cache partitioning technique partitions the shared cache into separate regions and designates one or a few regions to individual cores, which can fully eliminate the cache storage interference. Many research works have been done to reduce the interference and optimize energy consumption by cache partitioning. Qureshi and Patt 16 presented a low-overhead cache partitioning technique based on online monitoring and cache utilization of each application. By leveraging configurable cache architecture, authors 17 proposed a technique to eliminate inter-task cache storage interference and optimize cache energy. Suhendra and Mitra 18 proposed the use of shared cache in a predictable manner through a combination of locking and partitioning mechanisms, and explored possible design choices and evaluated their effects on the worst-case application performance.
However, cache partitioning only eliminates cache storage interference and cannot avoid the bank conflict. Paolieri et al. 10 partitioned shared cache using bank-level partitioning or column-level partitioning. In bank-level partitioning, each task is assigned to private bank, and this partitioning requires as many banks as the number of tasks in the system. In column-level partitioning, the shared cache is partitioned into columns, allocating exclusively a subset of the total number of columns to each task, but tasks still experience bank conflict when accessing the same bank at the same time. In this case, they bound the UBD of the bank conflict to compute interferences delays. Yoon et al. 11 proposed harmonic round-robin bus arbitration and bank-column cache partitioning scheme, in which bank conflict can be limited in one bus round through optimizing the allocation of bus slots among cores. These approaches took the effect of bank conflict delay into consideration, where a potential maximum delay is added to the time that a request accesses L2 cache. In fact, not all requests can suffer from inter-core interference. Even though inter-core interferences occur among a group of requests, the interference delay suffered by each request is different due to the state of different interference. These approaches based on the maximum delay leads to significant overestimation on WCET, which has a negative impact on the schedulability analysis, performance checking and WCEC of the task.
Unlike the existing works, we investigate the impact of the bank-level interference. Such work is fundamental in establishing tighter WCEC bounds and providing the safety of energy.
System model
In this section, we introduce the basics of the system model, which is composed of architecture model and task model, and then we present the cache energy model from the perspective of HRTs.
Architecture model
We present our architecture model here. Our architecture model assumes a real-time multicore architecture shown in Figure 1 , which consists of N core homogeneous cores, C = fC 1 , C 2 , . . . , C (N core ) g. Each core has its own private L1 instruction cache (IL1) and L1 data cache (DL1). All the cores share an L2 combined cache B which is partitioned into N bank banks fB 1 , B 2 , . . . , B (N bank ) g, and each bank is subdivided into N column columns. That is, the shared L2 cache has total N bank Á N column columns. Bank access latency is L M cycles (same for read/write operations for all banks). The real-time shared bus connecting cores and the shared L2 cache adopts the IABA. 10 The IABA is composed of one inter-core bus arbiter (XCBA) that schedules among requests from different cores in a round-robin fashion and several intra-core bus arbiters (ICBAs), one per core, which schedules among requests from the same core in first in, first out (FIFO) policy. Bus access latency is L B cycles.
In this multicore architecture, when tasks running on different cores send requests to access shared bus and shared cache, these requests will be handled by its corresponding ICBA, which selects the next request to be sent to the XCBA. Then, the XCBA is responsible for deciding which of those requests from different cores access the bus for avoiding inter-core bus conflict and bank conflict.
Real-time task model
This article considers a periodic task model, assuming that a real-time task sets T = fHRT 1 , HRT 2 , . . . , HRT (N task ) g comprising N task ( N core ) HRT. For a HRT i , it has a deadline D i and a period P i . The task set hyperperiod is named H and is the least common multiple of all period of HRTs in T. We assume the task-to-core mapping is given at design time and task migration is not allowed.
In bank-column cache partitioning, each HRT is assigned a subset of column of the shared L2 cache, the columns allocated to a HRT do not be accessed by any other HRT and remain the same through the system execution. This column-level cache partitioning can avoid cache storage interference since we mainly focus on bank conflict in this article. We use M to denote the all combinations of potential bank-to-core mappings. The jth mapping in M is denoted by Map j . The WCET of a task in multicore architecture can be divided into two parts: 10 fixed execution time (also referred to as single-core bound) and interference delay. While the former is the maximum time duration that a task could take to execute the instruction over its critical path, which is analyzed in isolation and not affected by the other tasks. The latter is the sum of delays incurred for its cache access over the same path. For the multicore architecture adopting IABA, interference delay consists of conflict delay (i.e. bus access delay and bank conflict delay) of cache access in XCBA and bus waiting time of cache access in ICBA. Let DX i be the total conflict delays of a HRT i in XCBA and DI i be the total bus waiting time of the HRT i in ICBA. Then, the WCET of the HRT i under bank-to-core mappings Map j can be expressed as follows
where WCET i fixed is the fixed execution time of the HRT i , which can be computed by well-known techniques in WCET analysis, 19 and is the same for all job instance of the HRT i . On the contrary, the delay DX i (Map j ) and DI i (Map j ) are not same for different job instances of the HRT i .
Let QR i be the sequences of request issued by HRT i , the requests of L2 cache access hit and miss in QR i can be profiled using static timing analysis tool, such as RapiTime 20 and Chronos. 21 Let QR i L2hit ( QR i ) and QR i L2miss ( QR i ) be the sequences of L2 cache access hit and miss, respectively, dxcba j represents the conflict delay suffered by request q j ( 2 QR i L2hit ) in the XCBA, and dicba j denotes the waiting time of request q j in the ICBA. The total conflict delay (DX i ) and total waiting time (DI i ) can be expressed as follows
Cache energy model
The WCEC of L2 cache can be expressed as follows
where E i L2 (Map j ) is the energy consumption of L2 cache consumed by a HRT i when bank-to-core mapping is Map j , the energy dissipation of L2 cache comprises dynamic energy and static energy, 6 
in equation (4), E i dyn and E i sta (Map j ) are the dynamic energy and static energy of L2 cache consumed by HRT i , respectively. The dynamic energy dissipation E i dyn originates from cache hits and cache misses
where N i hit and N i miss are the number of L2 cache hit and miss of HRT i , respectively. E hit represents the cache access energy of a L2 cache hit. E miss denotes the energy dissipation of a L2 cache miss and it is calculated as
where E memaccess is the energy dissipation for accessing the off-chip memory, E block fill is the energy dissipation for filling the fetched data to the L2 cache. The static energy consumption of L2 cache consumed by HRT i can be calculated as
In equations (9) and (10), P i sta is the static cache power caused by HRT i . We use S i to represent the demanded column amount for the HRT i . Let P sta (N bank Á N column ) be the total power of L2 cache when the capacity of L2 cache is N bank Á N column columns. Each HRT exclusively use the allocated columns, and the unused column can also be turned off to reduce power consumption. Thus, the static power consumed by HRT i can be defined as
The data for P sta (N bank Á N column ) with a given cache capacity N bank Á N column , the E hit , and E block fill can be obtained using simulation tools like CACTI. 22 The value for E memaccess can be obtained from memory specification. 7
Bank-level interference analysis and WCEC computation
In this section, first, an example is given to explain the working of our interference analysis. Next, we provide an analytical formula for calculating conflict delay and waiting time of request. Finally, we present a finegrained approach to compute interference delay based on request timing, which is the base of bank-to-core mapping.
Example for interference analysis
In this example, as shown in Figure 2 , we assume a four-core system with L B of 2 cycles and L M of 4 cycles, core C 1 ,C 2 , and C 3 share bank B 1 , and core C 4 is exclusively bank B 2 . Each HRT has only two cache requests, which are issued at cycles 1 and 9, respectively. In UBD method, UBD is defined by Max((N core À 1) Á L M , (N core À 1) Á L B )) in IABA; therefore, the conflict delays that each HRT suffer from is ((4 À 1) Á L M ) Á 2 = 24 cycles, where 2 is the number of cache requests issued by HRT. However, at cycle 1, the first request issued by HRT 1 can access bus immediately since C 1 is the first to be served among cores, no other requests access B 1 , so the first request of HRT 1 can access B 1 at cycle 3. In other words, the first request issued by HRT 1 does not suffer from bus access delay and bank conflict delay. For the first request from HRT 2 , it can access the bus at cycle 5 due to bus access conflict and bank conflict from the first request of HRT 1 , its bus access delay and bank conflict delay are 2 cycles, respectively. Similarly, the conflict delay suffered by the first request of HRT 3 is 8 cycles, in which the bus access delay is 6 cycles and bank conflict delay is 2 cycles. However, the first request of HRT 4 access different bank; therefore, it only suffers from the bus access conflict, and the bus access delay is 10 cycles.
The second round of XCBA starts at cycle 17 since all requests in first round of XCBA finish in this time point. The second request of HRT 1 is granted access to the bus at cycle 17; hence, its bus waiting time in ICBA is 17 À 9 = 8 cycles, and its bus access delay and bank conflict delay are 0 cycle in XCBA. For the second request issued by HRT 2 , the first request of HRT 2 does not complete at cycle 9, and the time overlapping exists between the bank conflict delay suffered by first request and the bus waiting time suffered by the second request. In this case, the non-overlapping bus waiting time suffered by the second request is 17 À 11 = 6 cycles since the first request of HRT 2 completes at cycle 11, and the bus access delay and bank conflict delay suffered by the second request of HRT 2 in the second round of XCBA is 2 cycles, respectively. So, the interference delay of the second request of HRT2 is 6 + 2 + 2 = 10 cycles. Similarly, the interference delay of the second request of HRT 3 and HRT 4 is 10 cycles, respectively. Based on the above analysis, we can conclude that the total interference delays suffered by four HRTs are 8,14,18, and 20 cycles, respectively. Clearly, these interference delays are less than the interference delays estimation based on UBD method.
Analyzing conflict delay and waiting time
From the above examples, we can see that it is necessary to analyze the conflict delay in the XCBA and the waiting time in the ICBAs for accurately estimating WCEC of HRTs. Let us suppose there is a request rq j from core C j (2 C), which tries to access bank B k , arriving the bus at cycle Tarr j . If Tarr j is more earlier than start time XCBA sta of current round of the XCBA, the rq j has to stall in ICBA until it is forwarded to the XCBA (that sends it to the bus) at cycle XCBA sta . In XCBA, the rq j may encounter bus access interference and bank conflict, which depend on previous request of rq j in current round of the XCBA. Let the request rq p from core C p (2 C) be the previous request of rq j in current round of the XCBA and Tacc p be the time that the rq p is granted access to the bus. The bus access delay that the rq j suffers can be computed by the following expression
Let request rq k from core C k (2 C) be the previous request of rq j to access bank B k in current round of the XCBA, the time that the request rq k is granted access to the bus is Tacc k . The finish time of the request rq k to access bank B k is Tacc k + L M + L B , and the start time of the request rq j to access bank B k is XCBA sta + dbus j + L B . If (XCBA sta + dbus j + L B ) is more earlier than (Tacc k + L M + L B ), the request rq j will suffer from bank conflict, the bank conflict delay is (Tacc k + L M + L B )-(XCBA sta + dbus j + L B ), otherwise, the request rq j does not suffer from bank conflict, that is, the bank conflict delay is 0. So, the bank conflict delay that the rq j suffers can be computed by the following expression
Based on equations (12) and (13), the total conflict delays that the rq j suffers in XCBA and the time that the rq j is granted access to bus can be expressed, respectively, as dxcba j = dbus j + dbank j : ð14Þ
Algorithm 1 shows the outline of calculating the conflict delay suffered by a request in one XCBA round. This algorithm takes the current requests to access the XCBA per core, the start time of current XCBA round and a bank-to-core mapping as an input. The T arr½i to hold the time that request rq½i is ready to access bus. The used½i indicates whether or not the request can be handled in current XCBA round, if the request is handled, the used½i is set to true. The T pre keeps track of the start time of the request to access bank, which is used for computing bus access delay incurred by the request. In line 1, the T pre is initiated with the start time of the current round of XCBA, due to the fact that the first request in each XCBA round does not suffer from bus access delay. Line 5 analyzes that whether or not a request can be handled in current XCBA round. The bus access delay of request is computed in line 6. The bank conflict delay of a request is initialized to 0 in line 8, then the bank conflict delay suffered by the request is computed in lines 9-16. Line 10 uses a procedure IsSameBank() that determines whether two requests access the same bank based on BtoCmapping[][] and their own address. The time that a request finishes its bank access is computed in line 18. The finish time of the current round of XCBA is computed in line 19.
As disscussed earlier, ICBA schedules among request from the same core in FIFO policy to access the XCBA. The delay (the bus waiting time) suffered by a request in the ICBA is the time interval between the time that the request reach the ICBA and the time that the request is selected to be sent the XCBA. Let us suppose that requests rq½ j À 1 and rq½ j are two requests from the same core, where the rq½ j À 1 is granted access to the bus in the previous XCBA round and the rq½ j is granted access to the bus in the current XCBA round, the XCBA sta is the start time of the current round of the XCBA. Since the rq½ j is granted access to the bus in the current XCBA round, the Tarr j is less than XCBA sta . IF Tarr j is later than T acc½ jÀ1 which T acc½ jÀ1 is the finish time of the rq½ j À 1 to access bus, the waiting time suffered by the rq½ j in the ICBA is XCBA sta À Tarr j , otherwise, the time overlapping existing between the rq½ j À 1 and the waiting time suffered by the rq½ j, and the waiting time suffered by the rq½ j is XCBA sta À T acc½ jÀ1 .The non-overlapping waiting time suffered by the rq½ j can be computed by
Algorithm 1: Calculating the conflict delay within one XCBA round.
Input: rq½i, XCBA sta , BtoCmapping½½ Output: Bus access delay suffered by request rq½i (dbus½i), bank conflict delay suffered by rq½i(dbank½i), the finish time of the rq½i to access the bus (T acc½i ), the finish time of the current XCBA round (XCBA fin ) 1: T pre = XCBA sta ; 2: XCBA fin = XCBA sta ; 3: for i = 1; i N core ; i+ + do 4: used½i = false ; 5: if T arr½i = = XCBA sta then 6: dbus½i = T pre À XCBA sta ; 7: used½i = true ; 8: dbank½i = 0; 9:
for k = i À 1; k ! 1; k À À do 10:
if used½k==true and Is SameBank(rq½i, rq½k) then 11:
if dbus½k + dbank½k + L M .dbus½i then 12: Based on the above analysis, we develop an algorithm to compute the total interference delays suffered by each HRT in the XCBA and ICBAs. Algorithm 2 presents the details of the algorithm. The ispop½k indicates whether or not the request can be fetched for QR i , the C fin½k is the finish time of request rq½ j. In line 3, the request of L2 cache access hit, QR i L2hit , is obtained from QR i which is determined by static timing analysis tool. In line 9, we pop the request from QR i L2hit and update QR i L2hit . According to equation (16), the total waiting time of a HRT in ICBA is computed in line 16. In line 18, algorithm 1 is called to compute the interference delay in one schedule round. The total interference delay is computed in line 19. The start time of the current round is computed in line 24.
Computing the WCEC of HRTs
Algorithm 3 estimates the WCEC of HRTs. In the algorithm, we can see that lines 2-19 analyze all job instance of HRT in one hyper-period. For each job instance of HRT, we first call algorithm 2 to estimate the total conflict delays and waiting time suffered by it (line 5), then compute its WCET estimation based on equation (1) . Line 7 judges if this job instance meet the timing constraints. If false, then this job instance is not schedulable. we set the WCEC of the job to infinity. Otherwise, we compute the energy consumption of this job instance in lines 10-14. The total energy consumption of HRTs is computed in line 16.
Bank mapping optimization for WCEC reduction

Problem formulation
In this section, we will present our bank-to-core mapping algorithm to optimize bank-level interference and improve the WCEC. This optimization problem can be formally defined as
where H denotes the hyper-period of all HRTs, and E ik L2 (Map j ) is the energy consumption of L2 cache consumed by kth job instance of HRT i within one hyperperiod when bank-to-core mapping is Map j .
The optimization problem is subject to the following several constraints
where x ik denotes that whether B k ( 2 B) has columns mapped to HRT i ( 2 T) or not. If B k has columns mapped to HRT i , x ik = 1; otherwise, x ik = 0. ncol ik represents the columns of B k mapped to HRT i . If x ik = 1,ncol ik .0; otherwise, ncol ik = 0. Input: N core , N task , H, BtoCmapping½½, QR i , D i , P i ( HRT i 2 T) Output: The WCEC of all HRTs. Total energy 1: Obtain N i L2hit and N i L2miss from QR i ( HRT i 2 T) ; 2: for (k = 0; k H; k + + ) do 3: for (i = 1; i N task ; i + + ) do 4:
if k mod(P i ) == 0 then 5:
Call algorithm 2 to compute the interference delay DX½i, DI½i suffered by current job instance of HRT i ; 6:
if WCET½i.D i then 8:
WCEC½i = Infinity; 9: else 10:
Obtain N i L2hit and N i L2miss from QR i ; 11: Let S i denote the number of cache columns required for HRT i ; obviously, the number of cache columns allocated to HRT i must be equal to S i for each bank-to-core mapping. Thus, the following conditional constraints is satisfied
The number of cache columns allocated to all HRTs must be less than or equal to the total number of cache columns in multicore system. Thus
In bank-column cache partitioning, the column cannot be shared between any two HRTs. In other words, the columns of one bank allocated to HRTs must be less than or equal to the capacity of one bank, that is
Fore each Map j ( 2 M), all HRTs must be completed before their deadline. Therefore, for each job instance of HRT i , the following constraint must be met
where WCET ik (Map j ) represents the WCET of the kth job instance of the HRT i within one hyper-period when bank-to-core mapping is Map j . According to equation (1), we can estimate W ik (Map j ) by the following equation
In equation (23), DX ik (Map j ) and DI ik (Map j ) are the total delays of the kth job instance of the HRT i in XCBA and ICBA, respectively.
Algorithm for bank-to-core mapping
We will present bank-to-core mapping for optimizing WCEC of HRTs. Intuitively, the bank conflict can be fully eliminated by exclusively mapping a task's instructions and data to specific bank. But doing so requires as many banks as the number of HRTs in the system. When the number of banks is less than the number of HRTs, a proper method which can efficiently utilize the shared cache space while minimizing the energy consumption is needed. In this article, we optimize bankto-core mapping for WCEC reduction. Bank-to-core mapping can be divided into three cases.
In this case, we exclusively allocate S i =N column d ebanks to the core C i .
Case 2:
P 8C i 2C S i =N column d e .N bank and P 8C i 2C S i . (N bank À 1) Á N column .The process of making bank-to-core mapping can be described as follows. (1) Make bank-to-core mapping according to a core queue. We first allocate columns for the first core in the core queue. Next, we allocate columns for the second core, and so on and (2) we first allocate the columns of bank B 1 to cores. Next, we allocate the columns of bank B 2 , and so on.
j (N bank Á N column c columns for the rest banks that do not take part in bank mapping, and then apply the method of Case 2 to make bank-to-core mapping.
For Case 1, we eliminate the bank conflict, and the WCEC of the system is minimized. For Case 2 and Case 3, we develop the algorithm to find the best bankto-core mapping with minimal WCEC when the columns are allocated to HRT i ( 2 T) is S i . The algorithm 4 shows the concrete detail of bank-to-core mapping, which is based on recursion strategy. In algorithm 4, line 1 initializes the initial WCEC of the system MinWCEC to infinity and initializes decision variables used[] to false for performing recursion call. Line 2 computes the number of columns on each bank, and these columns are allowed to assign HRT and keep on working until the WCET of the HRT, other columns on each bank can be shut down for energy saving. A recursive function FindBestMapping() is defined to search the optimal bank-to-core mapping in the solution space in lines 3-34. Lines 6-20 generate the bankto-core mapping BtoCmapping½½ based on the core queue c seq½. Then, based on bank mapping BtoCmapping½½, we call algorithm 3 to calculate the interference delays of each HRT in line 21. The best bank mappings are saved in lines 22-25. The recursion of the algorithm is practiced in lines 26-33. In line 28, a core queue is generated in the recursive walk, and the generated core queue is stored in the array c seq½.
Evaluation
In this section, we evaluate the effectiveness of interference analysis and bank-to-core mapping approach on energy saving. Before the results are presented, we first introduce the experiment setup.
Experimental setup
We assume our target architecture has six cores, each core has an in-order, five-stage pipeline. CPU clock speed is 500 MHz. The instruction fetch queue size is 4, fetch width is 2, and the instruction window size is 8. Private L1 instruction and data caches are set to 128 B (1-bank, 2-way associativity, 16-byte line, and 1-cycle access latency). The L2 cache is shared among all cores, and it is 8 KB, 4 banks, 4-way associativity, 32-byte line, and 4-cycle access latency(L M ). Each bank is 2 KB and comprises 8 columns, each of which is 128 B. The real-time bus applies the IABA policy, access latency (L B ) is 2 cycles for a request/access to cross them. The main memory is set to 4 MKB, 8 banks, and 30 cycles access latency. The energy parameters of the L2 cache are generated by CACTI, 23 an integrated cache leakage power model developed by HP.
The WCET analysis of the task is built on top of the open-source timing analysis tool Chronos, 21 Chronos is originally a single-core WCET analysis tool, and we extended it by adding support for IABA model. The task sets used in our experiment are shown in Table 1 . All tasks are from Malardalen WCET benchmarks, 23 which are compiled with GCC cross-compiler for a MIPS-like instruction set. 24 The mapping of task to core is given in third column of Table 1 . To obtain demanded column amount for each task, we use Chronos to measure the WCET of each task by varying the L2 cache size from 1 to 32 columns. According to the measured results, the demanded column amount for each task is listed in the fourth column of Table 1 .
Experimental results
Based on above experimental setup, we conduct four experiments. In the first experiment, the bank-to-core mapping is given in advance, and we show the impact of the bank-level interference on WCEC. The second experiment evaluates the impact of the bank-to-core mapping on WCEC saving. The third experiment compares our interference analysis approach to UBD approach. Finally, the final experiment investigates the effect of timing constraint on our approach.
Impact of the interference on WCEC. To quantify the impact of interference on WCEC, we assume the bankto-core mapping is given as shown in Table 2 , and the deadline of HRT i is equal to its period; then, we compute the WCEC of each task considering interference delay and not-considering interference delay, respectively, and compare the difference between them. Figure 3 shows the comparison of WCEC for these two Algorithm 4: bank mapping for optimizing WCEC Input: N core , QR i , D i , P i (HRT i 2 T) Output: The minimal WCEC of the system MinWCEC, corresponding best bank-to-core mapping BestMap½½ 1: MinWCEC = Infinity, used½ j = false(1 j N core ); 2: NT column = P 8HRTi2T S i mod N bank l m ; 3: function FindBestMapping(N ) 4: if N .N core then 5: ncol = NT column ; nbank = 1; 6: for each core C i in c seq½. do 7:
if S i ! ncol then 8:
while S i ! ncol do 9: BtoTmapping½i½nbank = ncol; 10: scenarios, where the WCEC of task not-considering interference delay is normalized to 100%. We can see that the WCEC of all tasks for two task sets increased in different levels. For example, the WCEC of cnt increases by 11.3%, and the WCEC of prime increases by 11.7%. The only exception is fibcall, where its WCEC increases by 4.3%. This is because fibcall is a compute-intensive application with little amount of L2 cache access. To sum up, the WCEC of tasks for two task sets can on average achieve 10.27% and 11.38% WCEC increment, respectively. This shows that if the impact of the interference on WCEC is not considered, this may lead to unsafe results when the analysis is to be relied on for guarantees of system behavior within a given energy budge.
Impact of bank-to-core mapping on WCEC. It is valuable to disclose that the ability of bank-to-core mapping can affect the WCEC of HRTs. Using task set 2 in Table 1 as an example, Figure 4 shows that the total WCEC of HRTs in solution space of bank-to-core mapping, we can see that the solution space of bank-to-core mapping is 720 and the total WCEC of HRTs varies from 49076.36 to 52904.4 nJ. The difference in energy consumption of L2 cache is 3828.04 nJ between the best bank mapping and the worst bank mapping, namely, the bank-to-core mapping results in the 7.23% WCEC reduction. This is mainly due to the fact that the task suffers from different interference delays under different bank-to-core mappings, and these interference delays affect the static energy consumption of L2 cache.
Our interference analysis approach versus UBD approach. In this experiment, our focus is the effectiveness of interference analysis approach, the bank-to-core mapping is based on Table 2 . We compare our approach with UBD approach where the UBD can be expressed as UBD = (N core À 1) Á Max(L B , L M ). Figure 5 shows the comparison of WCEC of the approaches normalized with respect to UBD for two task sets, respectively. One of the bank-to-core mappings with the minimum WCEC is shown in Table 3 . As it can be seen, for most of the HRTs in two task sets, our approach can Table 2 . A bank-to-core mapping. significantly improve the WCEC of each HRT. For example, our approach results in up to 19.72% WCEC reduction for cnt, 22.64% WCEC reduction for bsort100, and 21% WCEC reduction for matmult over UBD. In summary, our approach for two task sets can on average achieve 14.4% and 14.1% WCEC reduction compared to UBD approach, respectively. The two task sets as a whole reaches 14.25% WCEC reduction on average. In addition, to observe the effectiveness of the proposed interference analysis approach, we compare the estimated WCEC with the observed WCEC through simulation. We have extended the SimpleScalar toolset 24 to facilitate our experimental evaluation. Figure 6 compares our approach and simulation results. In Figure 6 , we can see that for a number of benchmarks, such as fir, the proposed approach can obtain a very tight WCEC, which are within 3.56% of the observed simulation result. On average, the estimated WCEC of approach is 5.31% more than the observed WCEC through simulation.
Deadline effect. The final experiment shows the effect of deadline on WCEC. Using task set 2 in Table 1 as an example, we vary the deadline of each HRT i from 1:0 Ã period i ms to 0:82 Ã period i ms in step of 0:03 Ã period i ms (there is no solution for deadlines shorter than 0:82 Ã period i ms ). Figure 7 shows the result for both our approach and UBD approach. We can observe that our approach can find efficient solutions and outperforms UBD approach consistently at all deadline levels. 
Conclusion and future work
In this article, we have presented an analysis for the bank-level interference (bank conflict and bus access interference) on a multicore platform with a shared cache, and our analysis approach can provide a tighter bound on the interference delay, which is crucial for WCEC. Experiment results show that our approach can improve the tightness of WCEC by 14.25% on average compared to UBD approach. Moreover, in order to reduce the negative impact of bank-level interference and improve WCEC, we propose to use bankto-core mapping and develop an algorithm; the experimental results indicate that bank-to-core mapping yields significant benefits in WCEC, with the 7.23% WCEC reduction. As multicore architecture are already ubiquitous, interference in shared resources should be seriously alleviated. We believe that our analysis and bank mapping can be effectively used for designing predictable real-time multicore systems for processing various complex jobs, e.g., performance optimization, 12,25,26 learning & classification, 27, 28 and content searching. [29] [30] [31] [32] We believe that plenty of future work exists in this field. We plan to (1) extend our techniques for off-chip memory so as to leverage system-wide energy consumption and (2) explore the effect of hardware pre-fetchers on cache interference delay.
