Abstract-Modern processors spend significant amount of time and energy moving data. With the increase in core count, the relative importance of such latency and energy expenditure will only increase with time. Inter-core communication traffic when executing a multithreaded application is one such source of latency and energy expenditure. This traffic is influenced by the mapping of threads and data onto multicore systems. This paper investigates the impact of threads and data mapping on traffic in a chip-multiprocessor, and exploits the potential for traffic reduction through threads and data mapping. Based on the analysis and estimation of the lowest traffic, we propose a threads and data mapping mechanism to approach the lowest traffic. The mapping takes both the correlation among threads and the affinity of data with individual threads into account, and results in significant traffic reduction and energy savings.
INTRODUCTION
MULTICORE systems have emerged as a mainstream platform. A typical architecture includes multiple tiles, each comprising one processor core with its private caches, a slice of the shared lastlevel cache (LLC), and a part of the on-chip interconnect fabric. It requires communication through on-chip interconnects to access a data element when it is cached on a remote tile. While much effort has been spent on investigating the interconnects to carry this traffic faster and in a more energy-efficient manner [1] , [2] , less attention has been paid on studying the nature of the traffic, including whether some of it is avoidable. In particular, one can imagine that a more intelligent placement of threads and data can reduce the amount of data moved from a remote node as well as the distance such data need to travel.
Some prior works explored data placement optimization without considering thread placement [3] , [4] , [5] , [6] , [7] , [8] , while others optimized threads mapping to reduce traffic [9] , [10] , but ignored the affinity between threads and data. In fact, data and thread placements are closely intertwined. Different threads access a given data page with different frequencies. Put differently, threads have different affinity with a data page. Similarly, different threads have different communication affinity with another thread. Communication ultimately manifest as data and coherence traffic between the nodes and the home node of the data acting as the intermediary. Fig. 1 shows the varying affinity visually in one example application.
To a large extent, a typical system is agnostic to such varying affinity relationship and does not try to reduce unnecessary traffic by optimizing the placement of either the threads or data. Indeed, rules such as cache line-interleaved round robin mapping is-from the perspective of traffic reduction-a completely arbitrary approach. In this context, keeping one aspect (say, thread mapping) unchanged from the default solution while changing the other will necessarily limit the potential of placement optimization. To minimize unnecessary traffic, we have to co-optimize data and thread placement.
To understand the potential of traffic reduction, we start with a statistical analysis of the impact of different threads and data placement on global traffic in a chip-multiprocessor. We find that with varying placements of threads and data pages, the total traffic follows a Pearson I/Pearson IV/Gaussian distribution, which allows us to make some statistical estimations. For example, not surprisingly, the traffic resulted from the (rather random) round-robin baseline placement can not be expected to be any better than the mean of the distribution. Furthermore, there is a large headroom between an arbitrary mapping (e.g., round-robin) and the best possible one. The optimal mapping is achievable if and only if both threads mapping and data mapping are optimized.
While such statistical analysis can provide some insights about the problem of minimizing traffic, its prediction has inherently large uncertainties since the optimal settings lie at the tail of the statistical distribution. Furthermore, it does not provide a road map to find the optimal mapping. Consequently, we also propose a heuristic-based algorithm that optimizes data and thread placement and significantly reduces traffic-more than predicted by the statistical method.
IMPACT OF MAPPING ON TRAFFIC
In a conventional chip-multiprocessor including multiple tiles, a shared LLC is typically assumed to be address-interleaved in a round-robin fashion among the tiles. Such address mapping has the benefits of statistical load balance and simple implementation, but at the cost of locality. In fact, more intelligent page-based mapping could be supported in a straightforward manner as every access already needs to access the translation lookaside buffer (TLB). The TLB can provide home node information to realize a locality-aware data mapping.
When running a multithreaded application on a chip-multiprocessor, typically the operating system (OS) naively assigns threads to cores in a sequential manner. This is simple to implement, but oblivious to the communication requirement of different threads. In a multithreaded application, threads have different access patterns and frequencies to a given piece of data. They also communicate with each other with different intensities. Thus their relative location is an important contributing factor to the distances messages have to travel. Given the flexibility to map pages and threads to nodes, there are virtually an infinite number of possible configurations. To get some sense of how traffic is affected by thread and data page mapping by a particular configuration, we sample randomized mapping configurations and estimate the traffic as follows: 1) We start with the default mapping configuration and measure detailed traffic. We record the number of data and metadata messages of each thread to the LLC, broken down by pages. 2) We then randomly assign the threads to nodes (making sure every node has only one thread) and randomly assign a node for every page. 3) Based on the access matrix collected in step 1, we recalculate the total amount traffic (measured in flit-hop).
Result of Monte Carlo Analysis
After performing this Monte Carlo analysis for 10 8 samples per application, we can analyze the resulting traffic in terms of probability distribution. We find that the result of traffic for many applications can be approximated by a (variant of) Gaussian distribution. Fig. 2a shows one such example for application cholesky.
From a probability standpoint, if we map data pages and threads in the default manner-round-robin fashion, we should not expect the mapping to produce a traffic amount significantly better than the mean with high probability. For Gaussian distributions, there is less than 0.1 percent chance for a randomly selected configuration to be three standard deviation (s) better than the mean. 1 If we take all applications as a whole, we can almost guarantee that the default configuration will produce traffic amount close to the mean. In our experiments with 22 applications, the traffic amount produced by the default mapping is on average 0.5s (or 1.5 percent) worse than the mean. Clearly, the default scheme did not try to improve traffic and is thus a mediocre solution in that respect.
We now look at the actually observed minimum traffic from the Monte Carlo sampling. The value ranges between À3:0s and À7:0s away from the mean. Depending on the magnitude of s, the traffic amount is between 1.3 and 33 percent (and average 11 percent) less traffic than the mean (recall that the mean itself is slightly better than that of the default configuration).
To illustrate the point that both thread and data mapping need to be considered in conjunction, we perform two more Monte Carlo experiments in which we fix one aspect of the mapping (thread or data, respectively) as the same with the baseline. In these cases, the distribution is tighter around the mean, with the sample minimum thus closer to the mean, shown in Fig. 3 . Depending on the application, the minimum is between À2:1s to À4:8s away from the mean when threads mapping is fixed and between À1:7s and À3:7s away from the mean when pages mapping is fixed. The minimum traffic is achievable only when both thread and data mapping are optimized (larger deviation from mean). With only thread mapping, the data accessed by the same thread may be placed far from each other instead of close together, and the traffic could not be reduced wherever the thread is mapped. Similarly, with only data mapping, the threads communicating frequently may be assigned to cores far away and the communication traffic could not be low. We can also find that data mapping alone has much of the contribution to overall traffic reduction. It is because that the population space of data mapping is much larger than that of thread mapping, there's much more room to exploit for traffic reduction through data mapping than thread mapping.
In a very rough sense, this result can be interpreted as follows:
1) if we can only change one mapping at a time, changing data mapping has probably more potential than changing thread mapping in traffic reduction; 2) changing both gives more potential than either one, with almost additive benefits.
Estimation of Minimum Traffic
A straightforward estimation of the minimum value in a large population or sample can be made as follows. The probability that the minimum value is greater than X in N independently selected samples is ð1 À CDF ðXÞÞ N , in which CDF is the cumulative probability distribution function. Then, the CDF of sample minimum is 1 À ð1 À CDF ðXÞÞ N . The challenge is to obtain an accurate CDF, since we use it on the fringe of its intended use range.
Since we are estimating minimum values (or extremes), using normal distribution for the configurations can lead to significant errors as it assumes the higher order statistical characteristics like skewness (standardized third moment) and kurtosis (standardized fourth moment) are zero. We use a professional statistics tool (R language) to fit the distribution. The results show that Pearson type I (also known as beta distribution), type IV (also known as Student's t-distribution), and type 0 (equal to Gaussian distribution) match the probability distribution for many applications well. Fig. 2b shows the cumulative probability distribution of the minimum traffic in a sample of 10 8 for cholesky. As the figure shows, with a sample size of 10 8 , we are virtually certain to see the sample minimum in the range of ½À6:2s; À5:2s from the mean. For applications with a Gaussian distribution, the observed sample minimum ranges from ½À5:6s; À5:4s. For other distributions, we can similarly estimate sample minimum, the prediction agrees with theoretical calculation. If we consider the entire population space, the size is enormous, around 10 N where N is in the thousands. Fig. 4 shows the estimated minimum with increasing N. Not surprisingly, larger population results in smaller minimum. At this level, we are no longer certain of the error margin. But with a relatively conservative estimate of 20s, the minimum configuration will still result in about 30 percent traffic reduction.
To recap, the default mapping for data and threads is clearly suboptimal from a traffic standpoint. Statistical analysis provides some evidence of significant opportunities for traffic reduction. However, statistical analysis is limited in its power to predict extremes and does not provide a way to achieve the potentials. We next study some heuristics for affinity-away data and thread mapping.
AFFINITY-AWARE MAPPING HEURISTICS
The goal of the heuristics is to find a placement for threads and pages such that those with higher affinities are placed closer. In this paper, we explore a feedback-driven approach. We first perform an offline profiling of an application with a training input, collecting thread-to-page access frequencies. Based on this profile, Fig. 3 . Sampled minimum traffic when only considering threads mapping, data mapping, and both. The minimum traffic is expressed as multiples of standard deviation away from the mean. The larger the deviation, the smaller the traffic.
1. Even if we assume the distribution is not Gaussian, Chebyshev's inequality guarantees that there is less than 1 percent probability for the randomized sample to be 10s better than the mean.
we first place threads onto the tiles and then place data pages based on frequency and positions of the threads that access them. Simulated annealing is applied in this process to further fine-tune and produce the profile-time static placement. Finally, at run time, threads are mapped according to the static placement, while data pages are first analyzed for their access frequency pattern, then matched to a page with most similar pattern in the profile, and moved to its location according to the static placement.
Profiling Access Affinity
We obtain access affinity between threads and data pages with a training input. We form an access matrix A, where A i;j is the traffic induced due to thread T i accessing page P j in the LLC. For each page, the IDs of m threads producing the most traffic accessing the page and their relative weight in traffic are recorded. This information serves as a "signature" and will be used to match a new page with unknown characteristics at production run in order to place the new page (Section 3.3).
Mapping Algorithm

Threads Mapping
We start threads mapping by finding a "seed" thread for initial placement and progressively place other threads based on their affinity with threads already mapped. If over half of the remaining pages in the processed access matrix are broadly-shared and evenlyaccessed by every thread, the thread with the highest total traffic is selected as the "seed" thread. Otherwise, there are differences in affinity between threads and data pages, and different threads have different communication affinity with another thread. Then the thread with most partner threads is selected as the "seed" thread.
After placing the initial thread in the center of topology, we iterate and select the remaining unmapped threads with high correlation. The selected threads are placed in unoccupied neighboring cores. This process goes on until all threads are thus mapped. This way the threads with high correlation are mapped close together and their communication traffic is reduced.
Data Pages Mapping
Data pages are mapped based on access pattern of threads and their location. Pages with (near) private access patterns are, of course, mapped to the (dominant) owner. Other pages are placed at the node where the overall traffic (in flit-hop) from all threads is minimized as follows. We use a matrix to track total communication from different threads to a particular page. This matrix reflects the topology. For instance, an n Â n matrix A represents a n Â n mesh network-on-chip. Suppose thread T maps to ði; jÞ in the topology, then the total communication volume (in flits) of thread T to page P will be recorded in element A i;j in P 's matrix. The "center of mass" for this matrix is the optimal position for page P . 
Load Balancing
To mitigate the performance issue resulted from uneven distribution of data pages, a balancing procedure is employed.
First, the number of data pages mapped to each node is tallied. Next, we iterate over nodes with excessive loads to divert some pages. Finally, we try to re-map nearby pages to the still underloaded nodes in essentially the same way.
Tuning with Simulated Annealing
As a complement to the above heuristics, simulated annealing is applied to tune the final result. Since our mapping is essentially a two-step problem (first mapping the threads, then the data pages), our annealing process is also a two-level nested one. After swapping two threads, we first heuristically adjust the mapping of pages affected by the the two threads. We then go through one complete round of pages annealing. We take the best score seen in the process to decide whether to keep the thread configuration in the end.
Online Mapping at Application Running
The steps above are offline before application running. The mapping algorithm can be integrated as part of the operating system task scheduler. At production run with regular inputs, the operating system scheduler could allocate threads to cores based on the offline mapping result.
In the beginning, pages are placed by the first-touch rule. Both page table entries and TLB entries are extended to record the mapping of pages. Then we use a short phase of online profiling to trace the m most recently accessing (MRA) threads for the page, to approximate the m threads with most traffic accessing the page. Remember that for each page in profile-time, the m threads producing most traffic accessing the page were recorded when postprocessing the offline accessing affinity data. We compare the online m MRA threads with the recorded m threads of each page in profile-time after the short online profiling phase finishes. Each new page in the production run could be matched to a page in profile-time (the m threads are the same). Then the new page is remapped to the same tile where the matching page in profile-time should be according to the offline mapping result. And the mapping information in both page table and TLB is updated. (The data migration details are outside the scope of this paper.) Afterwards the placement of the page is not changed again.
PERFORMANCE ANALYSIS
We simulate Reactive NUCA [5] , an intelligent block placement strategy, on a 16-way chip multiprocessor with private L1 caches and a 16-way banked shared L2 cache as the baseline of our evaluation. We evaluate our mapping (Affi-Map) and other state-of-theart mapping methods including cluster-based annealing (threads mapping, referred to CA-Map) [11] , dynamic directories (data mapping, referred to Dyna-Dir) [4] , and compare with the estimated minimum traffic (Esti-Min) in Section 2. Fig. 5 shows the 2. It is possible that the center's x (or y) position falls exactly on the boundary between two cells-the sum of columns to the left is exactly the same as that to the right. In that case there is a tie. evaluation result of normalized traffic. Affi-Map consistently reduces total traffic for all of the applications in our evaluation (the average traffic is reduced to 0.81x of baseline), which is better than CA-Map and Dyna-Dir (on average 1.51x and 0.97x of baseline respectively). The estimated minimum traffic is on average 0.60x of baseline, and Affi-Map comes on average at 1.35x of the estimated minimum traffic.
CONCLUSION
This work presents an analysis of a static threads and data mapping mechanism that reduces total traffic in network-on-chip based multicore systems. We study the influence of different mapping configurations on traffic amount, and find that the traffic under different mapping configurations obey PearsonI/PearsonIV/Gaussian distribution for many applications. With threads' correlation and affinity with data taken into account, our mapping mechanism assigns threads with high correlation in neighboring cores and places data pages close to threads accessing them without violating distribution balance. The proposed mechanism can be an effective way of reducing total traffic and network energy consumption.
