Abstract-Several
I. Introduction
Programming distributed memory architectures to maximize performance is difficult. One of the main reason is the low abstraction level of many primitives provided by the message passing paradigm. Basic send/receive routines are indeed found in almost every distributed library or language (e.g. MPI [1] ). When writing a parallel program, the programmer needs to take care not only of the semantic correctness of it but also about the positioning of the communication routines within the code. This problem has been investigated in the past and the common practice, in MPI, is to schedule nonblocking communication operations as early as possible in order to maximize communication/computation overlap [2] .
However there are many architectural aspects which may play a role in determining a good program point to place send or receive calls. For example, the request management overhead of asynchronous routines -which are commonly used to hide communication costs -may penalize performance for small message This research has been partially funded by the Austrian Research Promotion Agency under contract nr. 824925 (OpenCore) and under contract nr. 834307 (AutoCore).
sizes. Runtime systems for distributed memory libraries (e.g. MPI [1] and UPC [3] ) usually employ optimizations to hide many architectural details to the user code. For example long messages may be split into smaller chunks to enable pipelining [4] . Oppositely, when too many short messages are sent, the runtime system may try to coalesce information into larger messages reducing the injection rate [5] , [6] . Optimizations done at runtime are highly effective since the system is fully aware of the underlying architecture. However, most of the decision have already been made by the programmer in the source code and therefore, at this stage, is often too late to overcome performance bugs. For these reasons, production codes are usually handtuned for particular target architectures.
In this paper we study the impact of CPU cache on MPI communication routines. Indeed, in order to hide network latency, MPI libraries aggressively use buffering. For example, when small messages are sent, the MPI library does not wait for the receiver process to be ready-to-receive, instead MPI buffers the message data on the receiver side (often called eager send). MPI not only uses the main memory for buffering reasons but also for allowing efficient communication of MPI processes allocated on the same multi-core machine. Intra-node communication is performed by means of shared memory (SM) transfer layers which are provided by all major MPI implementations [7] , [8] . Because buffering is implemented using the main memory, it is subject to cache hierarchies, and thus the reason for our study.
We measure, with a synthetic benchmark, the differences in terms of execution time, for point-to-point operations performed when the data being sent is fully loaded into the CPU cache or not. We repeat the experiment with multiple configurations, i.e., intranode and inter-node. In the same way we measure the impact of those point-to-point communication routines on the application cache by accessing application data, previously loaded into the cache, right after the communication is performed. From the gathered data we derive a set of rules and guidelines which can be utilized to transform the input program for improved cache utilizations and thus performance. To the best of our knowledge, this aspect has been largely neglected until now. Work in literature focuses on quantifying the impact of local memory on communications [9] . Those works are principally concerned with non-regular data types which involve expensive packing/unpacking operations and optimizing the way the MPI library handles them; whereas our work focuses on contiguous data and how the impact of communication routines can be exploited, by a programmer or a compiler, to optimize the input code.
Experiments show that a send, or receive, operation can be up to 25% faster if the data is already in the CPU cache. Furthermore, cache pollution generated by communication routines can negatively impact on the application performance if not carefully placed. Indeed, send and receive operations can invalidate the content of the message buffers if they have been preloaded into the cache. Based on the guidelines derived by our benchmark data, we propose a communication/cacheaware code transformation which, when manually applied to a 3-point stencil code, it improves code performance up to 40% for specific message sizes. Our transformation always shows a positive effect on performance for messages which are smaller than the last level cache size.
The contributions of the paper are multiple:
• It presents a benchmark to measure the performance of send/receive operations with different configuration of the data cache; • It derives, from gathered data, a set of optimization guidelines which can be used to tune an input program for improved cache behaviour; • It demonstrates the efficacy of the derived optimization strategies by applying them to a 3-point stencil code.
II. Analyzing MPI Cache Behaviour
In order to highlight the effects of CPU caches on MPI communication routines we wrote a synthetic benchmark. The main goal of the MPI cache benchmark is to capture differences in terms of execution time between communication routines with multiple configurations of the CPU cache and additionally, to measure their impact on the application cache. In doing so, we also collect the value of several performance counters using the PAPI library [10] . Many benchmarking suites for MPI exist in literature [11] , [12] . Coddington et al. wrote a survey of benchmarking tools for MPI's point-to-point communication [13] . However none of those is designed to capture cache behaviour of MPI routines. Some of the tools, e.g. MPIBench [12] and SKaMPI [11] , provide options to pre-load messages into the cache before performing the communication but they do not provide a way to precisely capture the level cache pollution caused by MPI communication routines. The benchmark code which has been developed for this purpose follows the guidelines for reproducibility of measurements described in [14] ; the code is publicly available at [15] . Beside the execution time the benchmark takes care of registering the values of multiple PAPI performance counters which will be used to understand low level implementation details of the underlying MPI library.
The benchmark is split into two scenarios, SCN1 and SCN2, which are further described in this section. Because of space limitations in [15] .
A. Scenario 1 -SCN1
SCN1 studies the behaviour of single MPI send/receive routines. With this benchmark we are interested in capturing the behaviour, in term of performance, of two basic MPI routines, i.e. MPI Send and MPI Recv, considering different states of the data cache. We therefore perform a ping-pong operation with three different initial cache states. In the first case, INV, we make sure all the content in the cache is wiped out and none of the data elements being sent or received are present into any of the CPU caches. The second cache configuration, EXCL, entirely pre-loads into the cache the message data right before the communication is performed. Data elements are only read which means the corresponding cache lines are in the "exclusive" state. In the last cache configuration, MOD, cache lines are preloaded in the "modified" state.
B. Scenario 2 -SCN2
In the second scenario, referred as SCN2, we want to capture the level of cache pollution caused by send/receive communication routines. This is obtained by measuring the time, together with other performance counters, required to traverse the array containing the message data previously exchanged in the ping-pong operation. This is again done considering multiple configurations of the cache. In INV, we start by cleaning the caches, we then perform the message exchange and, upon competition of the send/receive, data is traversed and the measurement is performed. In the second configuration, PRE, we pre-load the message data into the cache before performing the message exchange. It is worth noting that, in both cases, the code for which we perform the measurements does not contain any communication statements. Obtained data is compared with the values measured while traversing the message buffer without previously performing any communication. Also in this case we consider two cache configurations, i.e., cache is invalidated before We repeat the experiment with two different process allocations in order to test intra-node and inter-node point-to-point communications. This is obtained by allocating the two MPI processes respectively on different computing nodes or on the same multi-processor machine. In both cases, the use of affinity settings ensures the MPI processes are bound to a specific core of distinct CPUs. This is done in order to take full advantage of the CPU cache and avoid conflicts which arise when multiple processes share the same last level cache.
C. Hardware Platforms
We evaluated the code on 2 computing platforms summarized in Table I . The LEO3 cluster system consists of 162 compute nodes (with a total of 1944 cores). All nodes are connected through an Infiniband 4x QDR high speed interconnect. Each node contains two Intel Xeon CPU based on the Nehalem architecture where Hyper Threading (HT), or 2-fold SMT, has been disabled from the BIOS. The Vienna Supercomputing Cluster 2 (VSC2) is a HPC system which consists of 1.314 nodes, with 2 AMD Opteron processors each, for a total of 21.024 CPU cores. CPU cache layout for the two system is also summarized in Table I . These are both production clusters and the measurements have been taken while the clusters were fully operational, therefore we expect some noise to show up in the measurements. In order to reduce it we repeat each measurement 100 times and take the median.
D. MPI Communication Protocols
The cache benchmark treats the underlying MPI library as a black box. This allows us to make considerations which are not biased towards a particular feature of an MPI implementation. However, MPI libraries are very complex and in order to be able to correctly interpret the gathered data, implementation details cannot be completely neglected. Indeed every MPI library exposes several "knobs" which can be used to tune the performance of a particular application on the underlying target platform [17] , [18] . One of the most relevant threshold for point-to-point communication is the so called "eager limit". The eager protocol is not standardized by the MPI specification, however it is an implementation technique utilized by all MPI implementations. Every message exchanged between peer processes is subject to this protocol. MPI libraries typically use (at least) two algorithms, eager and rendezvous. When the size of the transmitted message is smaller than the specified threshold value, the message (together with an MPI header) is eagerly sent to the receiver. For larger messages the rendezvous protocol is utilized instead, i.e. the sender process sends a readyto-send message (RTS) to the receiver and blocks waiting for the acknowledgment, the clear-to-send (CTS), from the matching receive. The eager protocol is useful when latency is important because it avoids CTS/RTS round-trip overhead. However it requires additional buffering at the receiver side. Rendezvous protocols are typically used when resource consumption is critical.
For example, the Open MPI library [7] uses multiple protocols, detailed in [19] . In the case of eager send, the behaviour is the same as described before. The rendezvous protocol however enables better latency hiding. When the communication is performed over RDMA-enabled networks (such as an OpenFabricsbased network, e.g. InfiniBand) the protocol is divided into three phases. In the first phase the RTS message is sent to the receiver, while the sender is waiting for the CTS message, it starts "registering" the rest of the large message with the OpenFabrics network stack. Since the registration is slow the process is pipelined so that registration latency is hidden.
In shared-memory, the rendezvous protocol can use several implementation mechanism which have been presented over the last decade because of the increasing relevance of multi-core systems. Most sharedmemory message passing implementations, such as Nemesis [20] device in MPICH2 and the SM component in Open MPI, depend on a double buffering memory scheme. An extra memory buffer is pre-allocated as an exchange zone between processes. Communication between the processes is performed using the so called copyin/copyout semantics (CICO). The sender process copies from the message buffer into the shared memory and in the same way the receiver reads it out and copies into the receiver buffer. In order to reduce latency, the copy happens in a pipelined way. 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M 16M 32M
Message size (in bytes) 
Message size (in bytes)
However approaches exist, such as KNEM [16] , which via a kernel extension, allows the direct copy from the sender to the receiver buffer. This mechanism has the advantage to eliminate the additional memory copy and therefore both reduces latency and cache pollution. We perform our measurements using the default settings provided by the chosen MPI library. We use Open MPI with the default eager limit, which is set by default to 12 KiB for communication over Infiniband and to 4 KiB for intra-node communication, on both systems. In the LEO3 cluster we used the default shared memory provided by the Open MPI library which is based on the CICO mechanism. On the VSC2 cluster shared memory communications are performed using the KNEM kernel extension.
III. Benchmark results
In this section, the data gathered by running our cache benchmark for cluster architectures, listed in Table I , is shown. For space limitations, we show values of the PAPI performance counters only for the LEO3 architecture. Figures 1 and 2 depict the values obtained by the two benchmark scenarios (SCN1 and SCN2) using internode communication, over Infiniband.
A. Inter-node communication -Infiniband
1) SCN1: Figure 1 shows several performance counters associated with the MPI Send operation, in the first line, and MPI Recv, in the second line, using the three cache configurations: INV, EXCL and MOD. The first column shows the execution time which, in order to be as precise as possible, is expressed in terms of number of CPU clock cycles. Differences in terms of the execution are barely noticeable. However, we can 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M 16M 32M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M  32M • BASE_INV BASE_PRE INV PRE Figure 2 : LEO3 Inter-node -SCN2 -Cache Pollution see that having data preloaded into the cache (as in EXCL and MOD) reduces the amount of L2 data cache misses (PAPI L2 DCM counter in the second column) up to the eager limit, this is visible especially at the receiver side where buffering happens. Indeed, the two routines have a reduced execution time, which reaches its peak of around 20% for messages of 8KiB, when the message data is preloaded into the cache. After the eager threshold is exceeded we still have better behaviour of L2 cache however we notice an increase of L3 cache misses (PAPI L3 TCM hardware counter) which is similar for the EXCL and MOD cache states. While the reduced cache misses in L2 cache are constant for increasing message sizes, L3 cache misses proportionally grows with the message size. 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 
MEM_LOAD_RETIRED.L3_MISS
el:4K cache:12M   64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M To better understand the reason for this we show another performance counter, in the last column of Figure 1 , which depicts the number of snoop invalidation requests addressing the CPU. It can be noticed that during the rendezvous protocol, the number of invalidation requests increases considerably if the message data is preloaded into the cache. This is more marked for the receiver as the NIC driver updates the message buffer in main memory and therefore eventual dirty copies in the cache need to be invalidated.
2) SCN2:
The measurements for the second scenario, SCN2, are depicted in Figure 2 . As previously stated, this benchmark measures the performance resulting from accessing the message buffer right after being sent/received. We keep performance values for BASE INV and BASE PRE as a upper and lower bound for what we expect to be the performance from this scenario. Interesting is the number of L3 cache misses, in the case of the sender process, we notice that accessing the data after the send operation (INV) causes the same amount of misses measured for BASE INV. This means the send operation does not pollute the application cache. However this is not true for messages which are smaller than the eager limit. In that case there are no L3 cache misses for both INV and PRE configurations.
Major differences between sender and receiver happen beyond the eager threshold. In PRE, while at the sender side the amount of cache misses is comparable with the one measured for the BASE PRE configuration; the receiver behaviour is instead similar to the BASE INV case. Indeed, the receive operation invalidates the entire L3 cache (as suggested by the memory bus snoop operations shown in Figure 1 ) and accessing the received elements costs as many 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M 16M 32M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M 64  128  256  512  1K  2K  4K  8K  16K  32K  64K  128K  256K  512K  1M  2M  4M  8M  16M  32M • BASE_INV BASE_PRE INV PRE Figure 4 : LEO3 Intra-node -SCN2 -Cache Pollution memory operations as accessing it from a completely invalid cache (BASE INV). Additionally, loading the data after the receive routine causes more misses than the BASE INV configuration (which should be the performance upper-bound). Unfortunately we could not find a reasonable explanation for this. The increased amount of L3 cache misses has also a significant impact on the execution time which for INV and PRE is slightly higher than BASE INV. In our opinion, the reason for this is consequence of the memory pinning operation performed by the MPI library. Also it is worth saying that the same kind of behaviour has been observed at the sender side when the data is preloaded in a "modified" state. In that case, the send operation invalidates all the preloaded cache lines and therefore accessing the buffer data after the communication routine is slower.
B. Intra-node Communication -SM
In Figures 3 and 4 , the data obtained for shared memory configuration for the LEO3 cluster is shown.
1) SCN1: Figure 3 depicts the measurements for SCN1. In this case we observe overall a much higher number of cache misses since the actual data exchange between the two MPI processes happens in shared memory. However, for the sender process, we see only small differences among the three configurations. We show the value of the MEM LOAD RETIRED:L3 MISS performance counter which proves the advantage, i.e. reduced number of memory load misses, due to fact of having the message buffer available in the cache. At the receiver side instead, we observe a smaller number of both L2 and L3 cache misses for messages up to the last level cache size. Overall, the performance of MPI routines is improved when data is preloaded into the cache and the gain reaches its peak, around 25%, before the cache size is exceeded. As already stated, in this machine shared memory communication is performed using a CICO mechanism. Because this transfer between sender and receiver is done using a shared buffer, which for the Open MPI library is of 32 KiB, only a portion of the data cache gets polluted during the transfer.
2) SCN2: This is visible in Figure 4 . Differently from what observed for inter-node communications, in shared memory the message buffer is fully loaded into the cache for both INV and PRE configurations. However while the amount of L3 cache misses for PRE, BASE PRE and INV is almost the same up to 4 MiB, at 8 MiB we start seeing a gap between the three configurations. The amount of cache pollution is higher at the sender side since the difference in terms of cache misses between PRE and BASE PRE is noticeably higher than the receiver side. This is unexpected since the data transfer from the user buffer to the shared memory segment should be implemented using non temporal move instructions (e.g. MOVNTDQ), which avoids the target address to be loaded into the CPU cache. However, this penalty happens only for message sizes which are larger than half of the last level cache size.
IV. Considerations and Optimization Guidelines
From the output of the MPI cache benchmark we derive, in this section, a set of intuitive rules to find a good placement for send/receive communication statements which better exploit the properties of the CPU caches. We divide our consideration into three subsections applying to specific ranges of the transmitted data, i.e., (i) from 1 byte up to the eager limit, (ii) from the eager threshold up to the last level cache size and (iii) beyond the available cache size.
A. From 1 Byte to the Eager Threshold
When the eager protocol is utilized, messages are transfered to the NIC using a memcpy() operation which has the side-effect of loading the content of the send buffer into the CPU cache. Therefore if the transmitted data is accessed right after the send operation, the data will be still available in one of the CPU caches. Additionally the memcpy() routine also benefits from having the source and target buffers preloaded into the cache. However, the input program could present dependencies which does not allow this transformation to be applied. In such situation, the sent/received data should be accessed immediately after the communication routines or as late as before the message buffer content gets kicked out from the data caches.
RULE 1: For messages up to the eager limit, it is always preferable to perform the communication when the message data is cached. Received data should be immediately accessed.
B. From the Eager Limit to the Last Level Cache Size
We now consider the second message range, from the eager limit up to the cache size. In this range intra-node and inter-node communication differ and we treat them separately.
1) Inter-node communication:
As far the communication statement is concerned, we observe an increase in the number of L3 cache misses which is proportional to the message size, Figure 1 . However the overall number of cache misses is small that the execution time is not affected by it. More interesting considerations can be done for Figure 2 . At the sender side, we noticed no cache pollution caused by the send operation. Therefore we expect no changes in the application performance from changing send statements placement.
However things change dramatically at the receiver side. The receive operation invalidates all the preloaded cache lines in the case the message data was preloaded into the cache. Additionally, because of the memory pinning, utilized by the rendezvous protocol in Open MPI, loading the received data right after the communication statement has a negative impact on performance. A similar behaviour was also observed for the sender process when the data is preloaded in a "modified" state as discussed in Section III-B. 2) Intra-node communication: For shared memory communication we notice a reduction of L2 and L3 cache misses, Figure 3 , which is proportional to the size of the message being transferred. This has positive effects on the execution time which reaches a maximum improvement, of around 25%, both for sender and receiver processes, for 8 MiB messages. For shared memory communications, both the send and the receive routines populate the cache with the content of the message buffer and in the case the data is preloaded before the communication routine, the cache lines will not be invalidated. However, when the CICO mechanism is utilized, cache pollution may occur for large messages.
RULE 3: Access the message data after the communication statements, if the data is not already loaded into the cache, when the message size is smaller than LAST LEVEL CACHE SIZE/2 bytes. If the data is already into the cache, perform all the computation before invoking any communication routine.
C. Beyond the Last Level Cache Size
Beyond the cache line the behaviour of our benchmarks tend to converge, therefore no meaningful optimization rule can be defined. However, large messages can be divided into smaller chucks using a well known MPI code transformation referred in literature as software pipelining or message strip mining [21] . If the splitting size is chosen accordingly, the cache effects can be enabled. However this aspect is orthogonal to the argumentation of this paper and we are set to explore it in future work. of a 2-dimensional matrix which is updated by the following stencil computation. It is worth noting that while the receive operation must be performed before the last iteration of the i loop, the send operation has no dependencies and can be therefore issued at any program point, but before the swap procedure. We derive two versions of the stencil code depicted respectively in Listings 2 and 3.
Based on the our observations, the code has a bad cache behaviour as the array elements being sent, which are in a "modified" state, are accessed right after the communication statement, therefore after being kicked out from the CPU cache (when the rendezvous protocol is utilized). In order to optimize this aspect we can rewrite the code by moving the communication right after the first iteration of the loop is performed. This has two advantages: (i) it makes sure the matrix rows which are going to be sent/received are freshly loaded into the cache; (ii) avoid to access the received data right after the communication routine. The transformed code is depicted in Listing 2, we refer to this code version as OPT1.
The OPT1 code version can however be further improved for the receive operation. As a matter of fact, the received data is not immediately consumed but accessed only in the last iteration of the stencil loop. This could not be optimal for messages which are smaller than the eager limit. For optimizing this aspect we can derive a second code version which utilizes the received data immediately after the data is available in the receiver buffer. This is obtained by reversing the order of execution of the stencil code. We traverse the 2-dimensional matrix from ROW-3 backwards until the first row. In this way we make sure to have the send data already into the cache. We then perform the communication and successively complete the stencil by updating the last row. We refer to this code version as OPT2, the code is depicted in Listing 3. It is worth noting that in this version, the receiver buffer may not be into the cache before the message exchange if the entire problem does not fit into the cache. 
A. Evaluation
We evaluated the three versions of the stencil code on the two clusters described in Table I . Each version has been executed multiple times with different problem sizes using two different process allocations, i.e. intra-node and inter-node. We ran the stencil code using two MPI processes to correlate the outcome with the results gathered by the cache benchmark. We measured the execution time of each code versions and used the value of the median obtained from 100 repetitions of the program. Figure 5 shows the execution time of code versions OPT1 and OPT2 relative to the baseline solution, i.e. 1 for the LEO3 cluster. The x axis refers to the size of the message (in bytes) being exchanged by the stencil computation in every iteration. As expected, the OPT2 version has better performance for small message sizes reaching, for shared memory, a performance improvement of around 20% for 256 bytes messages. For larger messages OPT1 has a better performance reducing the execution time of the stencil code by 40%. For larger message, the advantage becomes smaller as the communication/computation ratio diminishes. Figure 6 shows the results for the VSC2 cluster for both intra-and inter-node communications. Also on this machine, OPT2 has an advantage over the original stencil code for very small message sizes. However, for larger messages this version is noticeable slower. The OPT1 version, on the contrary, is faster for both inter-and intra-node communication. However, the measured performance improvement is contained, i.e. around 10%. We believe the bad performance of the OPT2 version is due to the reversed access of array elements which may inhibit the CPU prefetcher from correctly determine the data access pattern.
Overall, the tuned code is faster on both architectures. We demonstrate how our simple guidelines, derived from the LEO3 cluster, are portable to different architectures. However the experiment also shows how sensible the performance might be because of peculiarities of the underlying hardware.
VI. Conclusions and Future Work
In this paper we studied, using a synthetic benchmark, the impact of CPU caches on MPI communication statements and conversely the effects of MPI routines on the application cache. We described interesting findings regarding implementation details of the MPI library and we derived three simple optimization guidelines which can be used to tune MPI programs for improved cache utilization.
We followed and apply our optimization rules to a simple stencil code showing a performance improvement of up to 40%. To some extent, we demonstrated that performance is portable among different architectures. However, experiments showed that the details of the underlying CPU architecture may prefer different styles of optimizations. Therefore we expect, in the future, that by combining the semantics of MPI communication routines and the knowledge of the underlying architecture, the code transformations here proposed can be automatically applied by an MPIaware compiler.
