Abstract-This paper addresses the problem of balancing the on-chip packet latencies in a chip multi-processor (CMP), which is simultaneously executing multiple applications. Specifically, this paper presents a balanced application-to-core mapping algorithm that aims to minimize the maximum on-chip packet latency of all running applications. The paper starts by formulating the balanced mapping problem for CMPs and proving its NP-completeness. Next it presents an efficient heuristic algorithm for solving the aforesaid problem, which utilizes the characteristics of on-chip cache and memory accesses in CMPs and takes into account the workload variations among applications. Simulation results on PARSEC benchmark suite show that the proposed algorithm lowers the maximum average packet latency of all applications by 11 percent while cutting the standard deviation of on-chip packet latencies by 99 percent. This is achieved by very little overhead in terms of the overall packet latency and power consumption averaged over all packets.
INTRODUCTION
W ITH tens to possibly hundreds of cores integrated in current and future multiprocessor systems-on-chips (MPSoCs) and chip-multiprocessors (CMPs) [1] , [2] , [3] , networks-on-chips (NoCs) have been proposed as the primary shared media for providing high-performance and scalable communication between cores on the chip [4] . As a single application is unlikely to use up all the computing resources on a many-core chip, multiple applications can usually run concurrently on the system. However, due to the planar layout of the cores (e.g., tile-based 2D mesh topology), on-chip access latencies to cache and memory controllers are not necessarily the same for packets initiated from different source locations. It is important to account for this on-chip delay characteristic when mapping applications onto cores to optimize the system performance.
While the issue of application mapping has been receiving increased attention in many-core chip designs, the problem of mapping multiple applications onto CMPs presents several new challenges. Mapping techniques proposed thus far are mainly for MPSoCs [5] , [6] , [7] , [8] , [9] , [10] . These techniques for MPSoCs, unfortunately, cannot be applied directly to CMPs due to their inherent differences. In MPSoCs, the shared cache/memory is clustered into some of the tiles while other tiles contain heterogeneous IP blocks with specific functionalities. In contrast, in typical CMPs, the shared cache is distributed to all tiles, each of which contains a homogeneous general-purpose processor core and only some of the tiles have a memory controller. The task of application mapping in an MPSoC consists of assigning caches, IP cores, and other customized blocks to tiles, whereas the task of application mapping in a CMP consists of assigning the running threads to the fixed and homogeneous physical cores. Consequently, the latency model on which the mapping algorithms are based for MPSoCs no longer holds for CMPs.
Moreover, when mapping multiple applications to CMPs, not only should the overall on-chip latency be reduced as in the single application case, the mapping process should also balance the average packet latencies experienced by different applications. That is, each application should expect minimized on-chip network latency which ishardware and software overheads can be greatly mitigated or even entirely avoided.
Although balancing the on-chip latencies is a desirable and necessary objective, its realization in multi-application mapping faces several major challenges. First, balancing packet latencies among applications may conflict with minimizing overall packet latency of all threads. In other words, mapping methods which have minimization of the overall latency as the sole objective are potentially counter-optimal in terms of latency-balancing across applications, as explained in Section 3. A desired mapping algorithm should achieve both minimized overall latency as well as latency balancing. Second, it is observed that certain core locations have low access latency for one type of traffic (e.g., cache traffic) but have high access latency for another type of traffic (e.g., memory controller traffic, particularly under different memory controller placements). This increases the difficulty of balancing the packet latency of traffic that is initiated even from the same core. Third, significant variations may exist among applications that are being executed concurrently (e.g., different numbers of threads, traffic load rates, memory-to-cache request ratios, etc.), which further complicates the design of an effective latency-balancing algorithm. Fourth, if dynamic application mapping is needed, such as when certain applications are finished much earlier than others, mapping algorithms need to be sufficiently fast so as to accomplish dynamic mapping of new threads onto tiles for utilizing idle cores.
This paper addresses the issue of balancing on-chip packet latencies in multi-application mapping on CMPs. In doing so, we adopt an objective function in the form of minmax (i.e., minimizing the maximum of) average packet latency among the latencies of all applications. We demonstrate that the min-max form is superior to other objective functions such as minimizing standard deviation or maximizing minimum-to-maximum ratios since it not only reflects the balance between latencies of different applications but also takes into consideration the overall performance of all the threads. Based on this balancing metric, we formulate the On-chip latency Balancing Mapping (OBM) problem mathematically. It aims at minimizing the maximum of average packet latencies experienced by all applications. This OBM problem, however, is proved to be NPcomplete even with cache traffic alone, precluding a polynomial time optimal solution.
To solve the OBM problem efficiently, we propose a twostep heuristic, named hOBM. It utilizes the traffic characteristics of NoC-based CMPs and also takes into account the variations among applications. Based on the observation that cache traffic typically dominates on-chip communication, the first step, application-level assignment, assigns tiles to applications such that each application has almost the same average on-chip cache access latency. The second step, fine-tuning, refines the mapping result by swapping tile-to-thread mapping across applications to further minimize the maximum latency of applications. During both steps, important variation information is utilized to generate appropriate orders of applications at different stages, thus increasing the effectiveness of the mapping process and results. The proposed mapping algorithm is evaluated with traces gathered from full-system simulation, and is assessed extensively from various aspects such as average and maximum packet latency, power consumption, algorithm runtime, dynamic application mapping scenarios, different memory controller placements, and scalability.
The main contributions of this work are the following:
We identify and demonstrate the counter-optimality of balancing on-chip latency in traditional mapping methods that aim at minimizing overall average latency;
We formulate the multi-application mapping problem for CMPs which targets both balancing and performance, and prove its NP-completeness; We propose an efficient heuristic which is applicable to both static and dynamic mapping with awareness of application variations.
BACKGROUND AND MOTIVATION

Related Work
As the number of cores continues to grow with increasing non-uniformity of on-chip latencies within the same chip, the importance of application mapping has been rising rapidly and gaining increasing attention [16] . Prior art of application mapping mostly targets NoC-based MPSoCs. Hu, and Marculescu, address energy consumption in mapping tasks for tile-based MPSoC architectures [5] . Murali, and De Micheli, focus on overall latency minimization under minimum routing and traffic splitting for SoCs [6] . Jang, and Pan, propose mapping solutions for various chip layouts [7] . Singh, et al., focus on accelerating algorithms for run-time mapping [8] . Zhu, et al., propose mapping algorithms for high-radix NoC topologies [9] , and Kang et al. consider the situation of mixed-critical tasks in MPSoCs [10] . These techniques assume MPSoC systems which have different characteristics from the CMP systems considered in this paper. A few existing works consider the application mapping for NoC-based CMPs. Chen, et al., present a set of comprehensive mechanisms that optimize the mapping of one application onto CMPs [17] . Das, et al., introduce a memory controller traffic-aware application mapping method [18] . In contrast, our work aims to address the multi-application mapping problem in CMPs to optimize the overall NoC performance while simultaneously balancing NoC latencies among applications.
Many techniques have been proposed to provide qualityof-service support for various system components including cache, memory, and on-chip networks [11] , [12] , [13] , [15] , [19] , [20] , [21] . This set of research has very different objectives from NoC latency balancing and, therefore, are orthogonal and complementary to our work. As mentioned in Section 1, the fairness in NoC packet latencies for multiple applications has also become an important aspect of the quality of service at both the user level and the system level. In fact, it is possible to integrate the NoC latency-balancing approach developed in this work with previous mechanisms to further improve the quality of service, which can be investigated in the future. Fig. 1 shows a typical structure of a 64-tile CMP with meshbased NoC. Each tile comprises of a processing core, a private L1 cache, and a slice of shared L2 cache bank. The shared L2 cache is distributed among all the tiles on the chip. In most commercial CMPs, when a data block is fetched from memory, the L2 cache bank in which to place the block is determined by hashing on the lower-order bits of the data address [22] , [23] , [24] . Routers are interconnected to form a mesh network, and tiles are connected to routers via network interfaces (NIs). A typical placement of memory controllers is attaching one memory controller onto each of the four corner tiles (shown as four shaded tiles in Fig. 1 ) in addition to the regular core/cache structure. There are several other popular placements of on-chip memory controllers (as discussed in Section 3.2).
Many-Core Chip Multiprocessor Architecture
The basic data access procedure in a NoC-based CMP is as follows. When a processing core has a data request, no network packet is needed if the request hits its private L1 cache. Otherwise, one of the two types of traffic may be generated depending on where the data block is located, namely cache traffic (i.e., the data block is on chip) or memory controller traffic (i.e., the data block needs to be fetched from the off-chip main memory). The cache traffic includes (i) packets initiated at the requesting core destined for an L2 cache bank, (ii) the checking/forwarding packets from the L2 cache bank to other private L1 caches, and (iii) the reply packets from the L2 cache bank to the requesting processor core. In all of these cases, either the source tile or the destination tile is an L2 cache bank. Note that no packet is generated if the destination tile is the same as the source tile. For memory controller traffic, a requesting packet is generated and then forwarded to one of the tiles with memory controllers through the on-chip network (e.g., the four gray tiles in the corners of Fig. 1 ). The packet forwarding typically follows the proximity principle [25] , i.e., the packet is sent to the nearest memory controller tile. Data are then fetched from the main memory and returned to the memory controller after a fixed number of cycles.
Uneven Packet Latencies in Cache and Memory Controller Traffic
As more and more cores are integrated on a chip, the nonuniformity of on-chip packet latencies among different tiles continues to increase-not only within each of the above two traffic types but also between the two traffic types. This is the fundamental cause for the imbalanced latencies among concurrently running applications. To enable further study, this section presents the packet latency models that analyze the phenomena mathematically.
We first introduce the tile numbering rule used throughout this paper. The number of a tile k is determined by
where i k ; j k are the row number and column number of the tile, respectively, and n is the number of tiles in a row. For example, in Fig. 1 (where n ¼ 8) , the tile located at the fourth row (from the top), fifth column (from the left) is numbered 29.
We calculate the on-chip latency T ðk; k 0 Þ of a packet generated at the kth tile and heading for the k 0 th tile on a mesh network as follows, based on [26] 
where Hðk; k 0 Þ is the number of hops through which the packet travels. Note that l r , l w , and l c are the per-hop latency for the router, wire, and in-network contention, respectively. The serialization latency, l s , is determined by the ratio of the packet length to the channel bandwidth, which is fixed with a given packet format and NoC structure. To avoid deadlocks, dimension-order routing (e.g., XY routing) is adopted to minimize design effort and implementation cost [26] .
As mentioned, the hashing for the shared L2 cache banks uses the cache index in a physical address, as shown in Fig. 2 . Take the 64-core CMP in Fig. 1 as an example. The shared L2 cache is separated into 64 pieces and distributed to each of the tiles. If the size of one data block (i.e., one cache line) in the L2 cache is 32 B, the lowest 5 bits (Bit 0 to Bit 4) are reserved for block offset. The next lowest 6 bits, Bit 5 to Bit 10, are the cache index used to hash and decide which tile the block is located among the 64 tiles. Hence, any consecutive chunks of 64 cache blocks (2 KB in total) are uniformly distributed across all the L2 cache banks. For a typical application running on a CMP, it is reasonable to assume that the destination tile in the cache traffic has statistically the same probability to be any tile on the CMP (including the source tile). Therefore, for a CMP with N ¼ n 2 tiles, the average number of hops H C k of all cache traffic packets generated at the kth tile only depends on its location, where H C k is calculated by
Hðk; iÞ:
With Equation (2), the average cache traffic latency for packets generated at the kth tile T C k is calculated by
The value of H C k is smaller for the tiles in the chip center and larger for the tiles in the corners. For example, on the CMP shown in Fig. 1 chip perimeter, as shown in Fig. 3a , where darker areas indicate tiles with larger cache request packet latencies.
For memory controller traffic, the average number of hops, H M k , is determined by the memory access behavior as well as the memory controller placement. For the popular four-corner memory controller configuration shown in Fig. 1 , the chip is divided into four quadrants relative to the center of the chip. All the memory request packets generated by the tiles in one quadrant are sent to the memory controller in that quadrant. Precisely, the average number of hops for a memory controller request packet generated at the kth tile can be calculated by
The average memory request latency for packets generated at the kth tile, (2) . As shown in Fig. 3b , with this four-corner memory controller placement, tiles close to the corners have smaller average on-chip latency of memory controller traffic than tiles close to the center. This aspect is different from that of cache traffic, which further complicates the problem of balancing on-chip latency.
CHALLENGES
Difficulty in Utilizing Existing Mapping Algorithms
As mentioned, on-chip latency balancing is an important design requirement in CMPs to guarantee qualify of service in case of multiple users, provide a uniform on-chip access for cache and memory system, and eliminate the overhead of hardware support for latency balancing. However, traditional application mapping algorithms which target minimizing the overall packet latency of all the threads are potentially counter-optimal in terms of balancing latencies. The primary reason is that, to be most productive towards minimizing the overall packet latency, these algorithms map threads with higher data access rates to tiles with smaller average on-chip latencies while threads with lower NoC traffic rates are mapped to large-latency tiles. Consequently, the latencies of low traffic-load applications are greatly increased, leading to significant imbalance in perapplication average packet latency, or APL for short.
To quantify the imbalance, we apply a mapping algorithm that minimizes the overall packet latency, referred to as Global, to PARSEC 2.0 benchmarks [27] (details of the simulation setup are given in Section 6). Global optimally solves the problem of finding the minimum overall packet latency for all threads, as explained in Section 5.1. Five different configurations (i.e., sets) of applications are tested on an 8Â8 mesh network, denoted as C1, C2, C4, C5, and C9. 1 C1 and C2 contain four 16-thread (or 16T for short) applications, C4 and C5 have sixteen 4T applications, and C9 has two 4T, one 24T, and one 32T application. Besides Global, we also evaluate Random, the average of a large number (! 10 4 ) of random mappings. The Random result represents the expected result achieved by any mapping method.
We compare the mapping results from three aspects, namely (i) the overall average latency of all threads, or g-APL, which is calculated by the sum of all packet latencies divided by the total communication volume, (ii) the maximum APL of all applications, or max-APL (the APL of each application is calculated first, and the largest APL among all applications is the max-APL), and (iii) the standard deviation of the APLs of all applications, or dev-APL. Larger max-APLs or dev-APLs indicate more severe imbalance among the applications. The results of C1, C2, C4, C5, and C9 are listed in Table 1 . Although Global reduces g-APL by 7.32 percent on average compared to Random, the max-APL is increased by 15.62 percent and the dev-APL is about three to seven times that of the random average result. This highlights that Global improves the overall performance at the cost of making the APLs of one or more applications dramatically larger.
We show two of the mapping results of Global in Fig. 4 to further elucidate the imbalance issue. All the applications have a small percentage of memory requests (around 10 percent as shown in Fig. 5 ), making cache accesses the dominant factor of on-chip traffic. Application 1 in both C1 and C9 has the lightest cache traffic. As depicted in Fig. 4 , the threads of Application 1 are assigned with tiles close to corners whose cache access on-chip latencies (T These mapping and APL results demonstrate that a mapping algorithm that solely aims at reducing g-APL may intensify imbalance in packet latencies between different applications and, thus, cannot be utilized directly.
Variations in Applications
Yet another challenge in providing balanced mapping results is the potentially large variations among applications 1. The numbering of the configurations is nonconsecutive here in order to be consistent with the numbering in Section 7.
that are being executed at the same time. There are a couple of sources of these variations. First, applications may vary in their levels of parallelism, i.e., the number of threads they have. Second, they differ in the average cache and memory access rates (hence, the resulting traffic loads). The average access rate of the threads in an application can be several times larger or smaller than that of another application, as shown in Fig. 5 . Third, although cache traffic accounts for the majority of the on-chip traffic, applications may have quite different percentages of memory controller traffic. For example, the ratio of memory controller access rate to cache access rate, referred to as memory-to-cache ratio hereinafter, of the applications in the PARSEC 2.0 benchmarks can range from 0.108 to 0.607. Fourth, besides the four-corner memory controller placement, several other memory controller placements have also been studied (e.g., [28] ) as shown in Fig. 6 , which changes the memory controller traffic behavior considerably. All the above factors significantly impact APLs and need to be considered in order to achieve balanced mapping among applications. Furthermore, differences in the runtime of applications place additional requirements in the mapping algorithms. Some applications may finish earlier than others, resulting in some of the tiles on chip becoming idle. In order to utilize these idle tiles, new applications may be introduced to be executed on these tiles, thus requiring mapping algorithms to have a sufficiently low time complexity so as to allow dynamic mapping of these new threads.
MAPPING FOR ON-CHIP LATENCY BALANCING 4.1 Selecting Metrics for Latency Balancing
An ideal multi-application mapping algorithm minimizes the imbalance in APLs of different applications while keeping the overall APL low. To design such an algorithm, we need to find an appropriate metric that quantifies the degree of balance. Besides max-APL, two other popular metrics of balance are the standard deviation of APLs of applications (dev-APL) and the ratio of minimum to maximum of the APLs (min-to-max ratio) [29] . However, dev-APL and min-to-max only gauge the relative differences among applications, they both suffer from one weakness if used as the objective function. That is, optimizations based on these two objectives cannot ensure overall NoC performance (in terms of minimizing packet latency) and may result in such a solution that makes the APLs of each application close to each other but larger than otherwise.
We use an example to illustrate the potential problems with using dev-APL or min-to-max ratio as the objective function. Assume there are four 4T applications, totaling 16 threads to be mapped onto the 16 tiles of a 4-by-4 mesh network. Suppose the four threads of each application have L2 cache access rates of 0.1, 0.2, 0.3, and 0.4, respectively. For simplicity, suppose all the applications require zero memory accesses. Assume a router latency per hop of l r ¼ 3, wire latency per hop of l w ¼ 1, and serialization latency of l s ¼ 1. An optimal mapping solution is easily found as shown in Fig. 7a , which achieves the overall minimum APL as well as exactly equal APLs (10.3375 cycles) among the four applications. However, if we choose the dev-APL or min-to-max ratio as the objective function, we find that Fig. 7b is also one of the 'optimal' mapping results since it has zero dev-APL and min-to-max ratio equal to one, both . Different memory controller configurations on an 8Â8 mesh network [28] . Shaded tiles represent a memory controller co-located with the core/cache structure. optimal values for these objective functions. However, in this case, although all the applications have the same APL, they experience large latencies (11.5375 cycles). Therefore, although dev-APL and min-to-max ratio are good metrics for optimizing balance, neither of them are suitable as the objective function for the mapping algorithm to achieve balanced APL while also minimizing overall packet latency.
To avoid the drawbacks of dev-APL and min-to-max ratio as objective functions, we adopt max-APL, which uses the maximum APL of all the applications as the metric. By minimizing max-APL, the mapping method takes into consideration both the overall NoC performance and the balance among individual applications, as it prevents any of the applications from having a significantly large latency.
Problem Statement
We first derive the mathematical expression of the APL of one application. For simplicity, we assume one physical tile can run no more than one thread at the same time, which means the number of threads is equal to or less than N. If the number of threads N 0 is less than N, we can add one application of N À N 0 pseudo threads with zero communication rate to make N threads in total. Given an N-tile NoC-based CMP and a set of applications fa i g; 1 i A with N threads in total, a mapping solution is a permutation of N, i.e., pðjÞ ¼ k, denoting mapping of the jth thread onto the kth tile. There are two parameters regarding each thread, namely the shared cache request rate, c j , and the memory controller request rate, m j . We index the threads in the following way: The threads of the ith application, a i , are indexed from N iÀ1 þ 1 to N i , and
Note that the numbers of threads of each application are not necessarily the same. With T C k and T M k as defined in Section 2.3, the APL of application a i with mapping solution pðjÞ is calculated by
where c j Á T C pðjÞ is the total latency of cache request packets when thread j is mapped to tile pðjÞ, and similarly m j Á T M pðjÞ is the total latency of memory request packets. Therefore, the goal of mapping to achieve latency balancing is to minimize the max-APL d max , which is the maximum APL of all applications.
Formally, we formulate the On-chip latency Balancing Mapping problem as follows:
Given: 
where d i is the APL of the ith application defined in (6). Part 2. For a known NPC problem G, prove G P DOBM. We adopt the well-known set-partition NPC problem as G, which is stated as follows: Given a set of numbers S ¼ fs k g; k 2 f1; 2; . . . ; Ng, does there exist two sets A 1 and A 2 with equal size, satisfying
NP-Completeness of OBM
. Assume we have a subroutineD that solves DOBM, i.e., D returns whether there exists such a mapping that the APLs of all applications are no larger than t. In order to solve the above problem G, we set up a DOBM problem of the following form. Build an N-tile chip such that the set of APLs of the L2 cache access of each tile is equal to S, i.e., 8k 2 f1; 2; . . . ; Ng, T C k ¼ s k . There are a total of two applications with equal size, a 1 and a 2 , making
In this given setup, the APLs of a 1 and a 2 are calculated as
We then call the subroutineD to find if there exists a mapping j ! pðjÞ such that the APL of each application is no larger than t, where
Note that t is constant for a given chip layout. 
G holds if and only ifD holds. The solutions to the two subsets are
. . . ; Ng: SubroutineD is called once, thus proving G P DOBM.
Therefore, the NP-completeness of DOBM is proved, and equivalently the OBM problem is NPC.
PROPOSED APPROACH
The NP-completeness of the OBM problem precludes a polynomial-time optimal solution. Prior art on NoC mapping problems has tried general neighborhood search algorithms such as simulated annealing [31] and genetic algorithms [7] . These algorithms, however, are too timeconsuming to reach a satisfactory solution.
In this section, we present an efficient heuristic to solve the OBM problem. The algorithm not only utilizes the traffic characteristics of NoC-based CMPs but takes into account the variations among applications to increase mapping effectiveness. The proposed algorithm consists of two steps, namely application-level assignment, which assigns tiles to applications to balance cache traffic latencies, and fine-tuning, which refines the mapping result to further minimize max-APL by swapping tile-to-thread mapping across applications.
Subproblem: Single Application Mapping (SAM)
Before presenting the algorithm to solve OBM, we first introduce its sub-procedure, namely single application mapping. Given N a tiles and an application a with N a threads, the SAM sub-procedure derives an optimal tile-tothread mapping so that the APL of a is minimized. We formulate the SAM problem as follows.
Given:
1) number of tiles (threads) N a ; 2) L2 cache communication rates C ¼ fc j g and the memory controller communication rates M ¼ fm j g; and 3) tile APLs fT C k g and fT M k g, denoting the average packet latency from the kth tile to the distributed L2 cache and to the memory controller, respectively; Find: thread-to-tile mapping p a ðjÞ ¼ k, where j; k 2 f1; 2; . . . ; N a g and Minimize: the APL of application a:
Note that the Global algorithm mentioned in Section 3.1 is a special form of SAM. Global minimizes the APL of all the N threads on the chip, or the g-APL. If we consider only one application, which has N threads, is running on the CMP, i.e., N a ¼ N, minimizing the APL of a is equivalent to minimizing g-APL of all the threads.
As discussed in Section 2.3, the APL of thread j assigned to tile k depends on the communication rates c j and m j and the tile APLs T C k and T M k . In the calculation of T C k given by (4), l r ; l w ; l s are fixed with the NoC design. The in-network contention latency, l c , is approximated as a constant in the proposed problem solution for the following reasons. First, on-chip networks typically have wide link width (e.g., 128-or 256-bit) with multiple virtual channels per link [32] , making the in-network contention latency relatively small (less than one cycle per hop, on average, for injection rates up to as high as 0.15 packets per cycle). Second, due to the backpressure resulting from flow control mechanisms (e.g., credit-based flow control), the majority of the packets are queued in the source nodes when the traffic load is high, and the contention in the network is often limited. With l r ; l w ; l s ; l c all constant values, the APL of thread j assigned to tile k is determined by c j ; m j and H C k ; H M k , independent of the mapping results of other threads. In other words, the cost, cost jk (latency), of assigning thread j to a certain tile k is fixed once p j ¼ k, regardless of which tiles other threads are mapped to.
Given the cost function cost jk , the SAM problem is hence an instance of the combinational assignment problem, solvable in polynomial time. One efficient solution to such assignment problems is the Hungarian algorithm, which is a cubic time complexity algorithm [33] . The detailed SAM solution is shown in Algorithm 1.
Algorithm 1. HungarianSAM
Input: An application a, its number of threads N a (or tiles), tile latency arrays fT The overall complexity of Algorithm 1 is OðN 3 a Þ because the first step of generating the cost matrix has OðN 2 a Þ complexity and the second step of calling the Hungarian algorithm has OðN 3 a Þ complexity.
Variation-Aware Heuristic Algorithm for OBM
With HungarianSAM solution, we develop a complete heuristic OBM (hOBM) algorithm as follows.
The first step is to perform application-level assignment based on L2 cache traffic characteristics as previously mentioned. To implement this, all the tiles are sorted according to their L2 cache access latencies (i.e., fT C k g). We then assign a set of tiles to each application in such a way that tiles with large cache latencies and tiles with small cache latencies are equally distributed among different applications. Specifically, to assign tiles for an application a with N a threads, the sorted tile list is divided into N a sections with equal number of tiles, and then the median tile from each section is selected for application a, as shown in Fig. 8 . The assignment is then followed by calling HungarianSAM to map the N a threads of Application a to these selected tiles to achieve minimum APL for this application. All the tiles are assigned to one application in this manner.
A crucial factor in the first step is the order of applications to be assigned. As applications may exhibit large variations, the effectiveness of the first step can be greatly influenced by the application assignment order. Take the mapping of six threads onto six tiles as an example. Assume the six tiles have been sorted according to their cache access latencies, denoted as l 1 to l 6 . One 4T application a 1 and two 1T applications a 2 ; a 3 with six threads in total are to be mapped onto the six tiles. If a 1 gets assigned with tiles first and a 2 gets assigned next as shown in Fig. 9a , when a 3 gets assigned last, the tile list has only one tile left to choose. This leads a 3 to have a large APL and eventually results in a large max-APL and, therefore, severe imbalance for the three applications. In general, with multiple applications, it is easy to end up with a more imbalanced solution with a shorter available tile list in later assignment phases.
To solve this problem, we assign tiles to smaller applications first, i.e., the applications with fewer numbers of threads. This maximizes the length of the remaining tile list and mitigates the impact caused by application variation. Return to the same example of the six threads. The better solution depicted in Fig. 9b assigns tiles to smaller applications a 2 and a 3 first. At the last step, since four tiles are still remaining, a 1 gets a more averaged packet latency and the max-APL of the three applications is, hence, much lower compared to (a). Conclusively, the application-level assignment should follow the ascending order of application sizes in terms of the number of threads.
The second step of the proposed (hOBM) algorithm is to perform fine-tuning by swapping certain thread-to-tile mappings across applications. This swapping is conducted based on two observations. First, we observe that some applications are more memory-intensive than others, such as raytrace and swaptions in PARSEC 2.0 benchmark suite as shown in Fig. 5 . Second, we observe that the threads in the same application can also have quite different memory access rates. Take the bodytrack benchmark (body tracking of a person) as an example. When it is parallelized into 16 threads, the L2 cache miss rate of each thread can range from a minimum of 0.859 MPKI (misses per kilo instructions) to a maximum of 2.35 MPKI. These observations on inter-and intra-application variations prompt us to perform following swaps to further optimize the mapping result.
After the first step, every thread of each application has been mapped onto a tile. In the second step, the threads of all the applications are first sorted in descending order based on the memory-to-cache ratio of each thread to obtain a sorted list ft m g (if the memory-to-cache ratios of two threads are very close, they are ordered based on cache access rate). The rationale is to adjust the tile mapping for threads that have relatively high memory controller traffic but were not mapped onto tiles with smaller memory access latencies. To implement the adjustment, for each thread t i in the first half of ft m g (i.e., those with higher memory-to-cache ratios), we find all the tiles that have smaller memory controller latencies than the current tile where t i is mapped, and greedily choose to swap t i to one of those tiles that yields the largest latency reduction for the two threads (i.e., thread t i and the thread on the other tile before the swap). Finally, after all the swapping is done, the algorithm calls the HungarianSAM once more for each application to reduce their APLs, thereby possibly reducing further the overall max-APL. The pseudo code is shown in Algorithm 2.
Time Complexity
The overall time complexity of the proposed hOBM algorithm is OðN 3 Þ as each of the two steps takes OðN 3 Þ time complexity.
Step 1. Sorting tiles and applications takes OðN log NÞ times of calculation. There are A applications, each requiring one-time assignment. In each assignment, selecting DN i tiles has OðDN 
Dynamic Application Mapping
As mentioned in Section 3, a desired variation-aware application mapping method should also be able to perform runtime mapping of new threads when applications are dynamically added or removed (completed) in the CMPs. Owing to its low computational complexity, the proposed hOBM is applicable in these scenarios as application change happens at a much coarser time-granularity. We collect the statistics of fc j g and fm j g of the new applications in a certain interval at runtime, and then solve the OBM problem to determine the new mapping solution which is used until the next application change occurs on the chip. Call ALGORITHM 1 to assign these DN i tiles to the threads ofâ i so that the APL ofâ i is minimized; 7 Remove the assigned tiles from the list fl k g; 8 end /* Step 2. Fine-tuning by swapping based on memory-to-cache ratio. */ 9 Sort the threads in descending order of memory-to-cache ratios to get the sorted list ft m g; 10 for t m from t 1 to t bN=2c do 11 The memory access latency of current tile l pðmÞ is T 
Swap pðn max Þ and pðmÞ; 24 end 25 end 26 for a i from a 1 to a A do 27 Call ALGORITHM 1 to remap the current DN i threads of a i to minimize its APL. 28 end
EVALUATION SETUP
We evaluate the effectiveness of mapping algorithms by utilizing traces gathered from running multi-threaded PAR-SEC 2.0 benchmarks [27] on full-system simulation using Simics [34] . The GEMS [35] and GARNET [36] simulators are integrated with Simics for detailed timing of the memory system and the on-chip network, respectively. The NoC power model DSENT [37] is adopted for power estimation under 45 nm technology and 1 V power supply.
Key parameters are listed in Table 2 . We assume a canonical credit-based wormhole router with a three-stage pipeline and look-ahead routing optimization. With 128-bit link width, short 16-bit packets are single-flit while long packets carrying 512-bit data plus a head flit have five flits.
We compare the following five algorithms:
1) global optimization (Global), which minimizes the overall average latency (g-APL) of all the threads; 2) Monte Carlo method (MC) for the OBM problem, which selects the one with minimum max-APL from a large number (! 10 4 ) of random mappings; 3) simulated annealing-based algorithm for the OBM problem (SA) in which a random move is defined as swapping the mapping of two randomly selected threads; 4) the heuristic for the OBM problem using descendant order in application assignment (hOBM_desc); and 5) the proposed heuristic for the OBM problem using descendant order in application assignment (hOBM). Note that hOBM_desc is introduced to demonstrate the importance of the awareness of application variation, and therefore we only include its results in comparison in Sections 7.1 and 7.2. The remaining sections analyze the results of the other four mapping algorithms.
As different applications have various intensities of network load (i.e., the sum of shared cache requests and memory controller requests), we construct ten different configurations with varying loads and application sizes in the evaluation, as shown in Table 3 .
EVALUATION RESULTS
Impact of Application-Level Assignment Order
As discussed in Section 5.2, the order of applications in the application-level assignment step makes a significant difference. Fig. 10 compares the APL results after the applicationlevel assignment step, with descending (hOBM_desc) and ascending (hOBM) orders in terms of the number of threads in an application. The four applications in (a) have four threads, four threads, 20 threads, and 32 threads, respectively, and the five applications in (b) have one thread, one thread, 2 threads, 20 threads, and 40 threads, respectively. The max-APL of descending-order assignment of C9 shown in Fig. 10a is 25 .26 cycles, which is 12.6 percent higher than the ascending-order assignment result, and the max-APL of C10 increases to 28.56 cycles, which is 27.2 percent higher than the ascending order assignment result in Fig. 10b . This confirms that the application-level assignment should follow the ascending order of the number of threads of applications to improve the mapping results in the first step of hOBM.
Max-APL Comparison
Fig . 11a shows the hOBM mapping results of C1. Smaller application numbers indicate lower overall cache access rates. Application 1 (A1), which requires the lightest onchip traffic, is no longer placed in the four corners of the chip, whereas in the mapping results shown in Fig. 4 , the conventional Global assigns corner tiles to the threads of A1. Fig. 11b compares the APLs of the four applications in C1.
The imbalance between applications is almost negligible with the proposed hOBM mapping algorithm. It reduces the max-APL to 22.31 cycles, or a 11.29 percent decrease, making the APLs of the four applications nearly the same. Fig. 12 compares the max-APL results of the five mapping methods applied to all the 10 configurations. The proposed hOBM achieves the best latency balancing for different applications, reducing the max-APL by 15.45 percent on average compared to Global for the ten configurations. MC and SA achieves results close to hOBM (still 2.6 and 1.7 percent higher APLs than hOBM, respectively), but they require high runtime as all the search-based algorithms. The hOBM_desc algorithm has similar results as hOBM for C1-C6 where the applications have the same sizes, but it has significantly worse balance when applications have different sizes, as reflected in the 15.32 percent increase in max-APL compared to hOBM for C7-C10. This demonstrates the importance of ordering in the application assignment step. The evaluation analysis in the following sections no longer includes hOBM_desc and only compares the other four mapping schemes. Fig. 12 also shows that the imbalance between applications becomes even more severe under Global with higher numbers of smaller applications: the Global max-APL of mapping 4T applications in C4 to C6 increases, on average, by 13.60 percent compared to that of mapping 16T applications in C1 to C3. This is because applications with smaller sizes have a high probability of being assigned with all high-latency tiles or all low-latency tiles.
Standard Deviation
Although standard deviation of APLs (dev-APL) is unsuitable as the objective function for latency balancing mapping algorithms as mentioned in Section 4.1, it is still a direct and well-acknowledged indicator for measuring the variance among multiple values. Table 4 lists the dev-APL of the four mapping methods Global, MC, SA, and hOBM for the ten different configurations. Global has the largest dev-APL among the four mapping algorithms. Both MC and SA have moderate reduction in dev-APL compared to Global. The proposed hOBM algorithm reduces dev-APL significantly by 99.75, 92.91, and 80.60 percent compared to Global, MC and SA, respectively, demonstrating its superior advantage in balancing latencies among multiple applications.
Performance and Power
As mentioned above, hOBM is a performance-aware latency balancing mapping algorithm. Although balanced mapping may result in overall on-chip packet latency increase, the proposed mapping problem formulation uses max-APL as the metric by which it is able to achieve balancing with much less performance loss compared to other criteria such as standard deviation. Fig. 13 plots the overall average APLs (g-APLs) of four mapping methods. As expected, Global has the minimal g-APL since its sole objective is to minimize g-APL. The performance loss percentages of the other three algorithms are all within 9 percent because minimizing max-APL is in the proposed problem formulation as their optimization objective. Among them, hOBM only slightly increases g-APL compared to Global, by up to 6.02 percent, which is less than SA (7.91 percent increase) and MC (8.84 percent increase). This proves that the benefits of balanced latency in the proposed hOBM do not introduce large penalties in overall packet latency. In addition to the NoC performance, we also evaluate the NoC power consumption of the proposed algorithms. While the static power is approximately the same for different schemes, the dynamic NoC power depends on the total number of packets injected into the network per unit time and the average number of hops packets travel, both affected by the application mapping results. Fig. 14 depicts the dynamic power comparison results. As can be observed, the proposed hOBM algorithm has almost negligible power overhead (2.59 percent on average) compared to Global, which is also the least overhead among the three max-APL minimization algorithms MC, SA, and hOBM. This highlights that the latency balancing feature of the proposed mapping does not significantly penalize NoC power.
Algorithm Runtime Comparison
Search-based algorithms such as simulated annealing typically yield trade-offs between runtime and performance. Fig. 15 plots the max-APL results of SA when it is allowed to run for different CPU times. The result is normalized to the runtime of hOBM and is plotted in a logarithmic scale. To reduce the impact of randomness in SA, we show the average max-APL results of the configurations C1-C10. As can be seen from the figure, hOBM outperforms SA even when SA's runtime is 100X larger than that of hOBM. In addition, while the max-APL difference between SA and hOBM is not very large, we have seen from Table 4 that the dev-APL between the two methods has an average of around 7X difference when SA is allowed to have the same runtime as hOBM.
DISCUSSION
Dynamic Application Mapping
The low computation complexity of the proposed hOBM allows the algorithm to be applied to dynamic mapping scenarios. We evaluate the mapping algorithms when some of the applications finish earlier and new applications are available to be mapped onto those idle tiles (assuming the unfinished applications/threads do not change their mapped tiles 2 ). In Fig. 16 , the dots labeled as "Original" denote the original static mapping results of C1 and C4. C1_new represents the case when two out of four applications in C1 finish and are replaced by two new applications, and C4_new represents the case when eight out of 16 applications in C4 finish and are replaced by eight new 4T applications. The three 2. The proposed algorithm is also applicable when thread migration is used, but the evaluation of the overall costs and gains with thread migration is out of the scope of this paper. max-APL minimization algorithms, MC, SA, and hOBM are allowed to have the same runtime. It can be seen that the max-APL results of the four algorithms after new applications are added are slightly increased by 2.67, 2.09, 1.92, and 0.77 percent, respectively. This is because, in the updated configurations with new applications mapped, the mapping of the existing unchanged applications is likely not the optimal mapping any more. Nevertheless, the proposed hOBM achieves the minimum max-APL increase among the four algorithms, showing its capability in balancing latency in case of dynamic application mapping.
Impact of Memory Controller Placement
We also evaluate the proposed solution under different memory controller placements. As shown previously, there are various memory controller placements, resulting in different on-chip memory request latencies. The max-APL achieved by the four mapping algorithms are compared in Fig. 17 . "Four corners" represents the default case adopted in previous evaluations where four memory controllers are placed in the four corners. The other six placements all have 16 memory controllers each, as shown in Fig. 6 . The APLs of the six new configurations are reduced compared to "Four corners" as a result of the increased number of memory controllers. With the memory traffic-aware swapping in the proposed algorithm, hOBM is able to maintain a good balance of APLs among applications, and performs consistently better than other algorithms across different memory controller placements.
Scalability
As the network size increases, the imbalance between applications becomes increasingly aggravated under Global which only targets minimizing the overall APL. For a 16Â16 mesh network that runs all the applications in C1 to C4 together (i.e., 16 applications, each having 16 threads), Global results in a max-APL of 55.70 cycles, which is a 35.1 percent increase as compared to its g-APL result. This is more severe compared to the 26.8 percent increase in the 8Â8 network. In contrast, the proposed hOBM provides very good scalability and is suitable for large network sizes as its timing complexity is only OðN 3 Þ. Other search-based algorithms such as MC and SA need much more runtime to provide satisfactory results on larger networks with an exponentially growing search space. For the abovementioned example where sixteen 16T applications are mapped onto a 16Â16 network, hOBM achieves a max-APL of 45.81 cycles, which is 17.8 percent lower than Global, 9.54 percent lower than MC, and 7.25 percent lower than SA. These results highlight the need for latency balance-aware mapping algorithms and the importance of this work for future large on-chip networks.
Impact on Application Execution Time
We conduct full-system simulation with PARSEC benchmark suite and investigate the impacts of hOBM on the application execution time. As hOBM balances the average packet latency, some applications experience higher APLs while others experience lower APLs. We demonstrate the usefulness of the balanced APLs by showing that the changes in APLs lead to sizable influence on application speedup. Fig. 18 plots the speedup of hOBM over Global for each application in C1 and C9. Take C1 as an example, of which the APL results are shown in Fig. 11 . Out of the four applications of C1, hOBM reduces the APL of A1 by 9.87 percent, which results in 5.49 percent increase in speedup; whereas the APL of A4 is increased by 8.62 percent in hOBM, which translates to 4.43 percent decrease in speedup. This indicates that the reduced gap in APLs helps to reduce the gap in application performance. Meanwhile, the average speedup of hOBM over Global for the four applications in C1 is 1.01 (and 0.994 for C9), indicating that the reduced gap in hOBM is not achieved at the cost of too much average speedup.
CONCLUSION
This paper addresses the important issue of balancing onchip network latency in multi-application mapping for chip multiprocessors. We formulate the problem of on-chip latency balanced mapping for multiple concurrently running applications. After proving the NP-completeness of the OBM problem, we propose an efficient heuristic-based algorithm that leverages the characteristics of shared cache and memory controller traffic as well as variations among applications, while taking into consideration the overall NoC performance. Simulation results show that the proposed algorithm can achieve an average reduction of 11.29 percent in maximum average packet latency and 340 times in standard deviation, with only 2.60 percent more in power consumption. This demonstrates the viability of exploiting thread-to-tile mapping to balance the on-chip latencies among different applications while incurring little overhead in the NoC performance.
Di Zhu received the BS degree in electrical engineering from Tsinghua University in 2011. She is currently working toward the PhD degree in electrical engineering at the University of Southern California, Los Angeles, CA. She received Provost Fellowship from USC in Fall 2011, and has been working with Prof. Massoud Pedram in the SPORT lab ever since. She became a 2015-2016 MHI PhD scholar in Fall 2015. Her research interests include system-level design for many-core processors, on-chip networks, hybrid electrical energy storage systems, and dynamic power management.
Lizhong Chen received the BS degree in electrical engineering from Zhejiang University in 2009, and the MS degree in electrical engineering and the PhD degree in computer engineering from USC in 2014 and 2011, respectively. He is an assistant professor in the School of Electrical Engineering and Computer Science, Oregon State University. His research interests are in the areas of architecture, application and emerging technology of computing systems, including embedded and mobile devices, many-core processors and GPUs, data centers, and high-performance computing systems. He is a member of the IEEE.
Siyu Yue received the BS degree in electrical engineering from Tsinghua University in 2011 and the MS degree in electrical engineering from the University of Southern California. He was with Prof. Massoud Pedram in the SPORT Lab from Fall 2011 to Spring 2014. His research interests include system-level low-power design, electrical energy storage systems, dynamic power management for electric vehicles, and smart grid.
Timothy M. Pinkston received the BSEE degree from The Ohio State University in 1985 and the MSEE and PhD degrees from Stanford University in 1986 and 1993, respectively. He is currently a professor in the Ming Hsieh Department of Electrical Engineering and the vice dean in Faculty Affairs, Viterbi School of Engineering, University of Southern California. His research interests include interconnection networks and communication architectures for parallel processing systems, in particular multicore and multiprocessor computers. His professional service includes serving on the editorial board of the IEEE Transactions on Parallel and Distributed Systems (TPDS), and serving on technical committees for many conferences and workshops in the field, including ISCA, HPCA, ICPP, IPDPS, NOCS, HiPC, and ICPADS. He is a fellow of the IEEE.
Massoud Pedram received the PhD degree in electrical engineering and computer sciences from the University of California, Berkeley in 1991. He is the Stephen and Etta Varra professor in the Ming Hsieh Department of Electrical Engineering, University of Southern California. He holds 10 U.S. patents and has published four books, 13 book chapters, and more than 140 archival and 380 conference papers. His research ranges from low-power electronics, energy-efficient processing, and cloud computing to photovoltaic cell power generation, energy storage, and power conversion, and from RTlevel optimization of VLSI circuits to synthesis and physical design of quantum circuits. For this research, he and his students have received seven conference and two IEEE Transactions Best Paper Awards. He received the 1996 Presidential Early Career Award for Scientists and Engineers, a fellow of the IEEE, an ACM distinguished scientist.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
