Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.
Introduction
One of the most important obstacles to achieving the next three orders of magnitude performance increase in large-scale computing systems is power. Current multipetascale systems require several megawatts of power, and future exascale systems are projected to consume one to two hundred megawatts if this issue is not addressed (Kogge et al., 2008) . The need to significantly decrease power usage and drastically increase energy efficiency has become pervasive in the high performance computing community. Chip designers and manufacturers, such as Intel, AMD, IBM, and Nvidia, are pursuing a myriad of strategies aimed at reducing power consumption, while still trying to increase, or at least preserve, overall performance characteristics. One strategy pursued by nearly all chip makers is the use of simpler cores that forego more advanced features, such as out-of-order execution and sophisticated branch prediction. The industry is trending towards a socket with two distinct core types: a small number of complex cores optimized for single-thread performance and a larger number of simple cores designed to optimize throughput. Such a design also requires an intelligent scheduling system to determine how best to balance overall application performance with energy consumption.
One important aspect of the trend towards combining cores with drastically different processing capability is the ability of future processors to adequately perform computations associated with network protocols, providing a new perspective when considering offload versus on-load approaches. In the past, proponents of an on-load approach have argued that an embedded processor on a network interface will always be significantly less capable than a single core of a host processor. Further, the cost of designing and manufacturing more sophisticated network interface hardware is often prohibitive compared to simply dedicating those functions to a host core. As the number of host cores continues to increase, the impact of dedicating a small number of host cores to network protocol processing becomes less costly. However, this argument was based largely on the assumption that host cores would continue to be much more powerful than the embedded cores typically found on network interface cards and that there would be several such host cores to dedicate. This assumption may not hold for more energy-efficient hybrid-core processors, such as the Intel Xeon Phi and ARM Cortex-A9.
MPI's strict ordering requirements and unexpected message semantics require the capability to quickly walk an ordered list with out-of-order removal semantics. For implementations that support offload, this operation is typically done on a network interface and is not performed by the host cores. For host-based implementations, this search operation is done by an application thread, either via the main application thread through an explicit MPI library function call or via a separate progress thread created by the MPI library specifically to support asynchronous processing of incoming messages. In this paper, we compare throughput-oriented cores to single-thread optimized cores in terms of the ability to perform MPI match processing. The intent of this study is to gain insight into the ability of throughput-oriented cores to adequately perform MPI matching and to better understand how MPI implementations on future hybrid-core processors should allocate computing resources to try to optimize performance-critical MPI operations.
This paper is organized as follows: Section 2 provides a general background, and Section 3 discusses related work. Results of benchmarking MPI message rate performance on a variety of architectures is shown in Section 4. Finally, we present conclusions from our work in Section 5.
Background
There are three commonly used high-level measures of network performance: bandwidth, latency, and message rate. An application's sensitivity to these parameters depends heavily on the communication patterns used. Sending large messages tends to increase dependence on bandwidth, while sending small-to medium-sized messages increases sensitivity to latency and/or message rate. Over the past decade, the growth in node compute capability has far outpaced the growth of off-node network bandwidth. As a consequence, many applications are looking to refactor their communication to reduce their sensitivity to peak bandwidth. The result will be applications which will require higher message rates for efficient scaling.
The historical characteristics of HPC networks have driven many scientific applications to using distinct computation and communication stages. Such a paradigm increases the dependence on high peak bandwidths and leads to a significant amount of application runtime being spent in the communication library (Pedretti et al., 2011) . This bulk synchronous model leads to inefficient use of the system interconnect as the network sits essentially idle through the application's compute phase and is heavily oversubscribed during the communication phase. The result is a requirement for high peak bandwidths to move through the communication phase as quickly as possible. One way to mitigate the need for excessively high peak bandwidth requirements is to refactor applications to enable a more uniform temporal distribution of communication. This can be done by blending the compute and communicate stages by sending smaller, more frequent messages, throughout the computation, instead of batching them for single communication step. This will result in an increasing number of smaller messages, leading to higher message rate requirements.
This intermingling of computation and communication may also be a natural consequence of code restructuring designed to take advantage of modern many-core architectures. The need to exploit higher levels of available parallelism by utilizing larger numbers of less powerful cores could lead to more fine-grained communications, as programmers try to make better use of computation/communication overlap to improve the latency tolerance of their applications. Providing effective support for overlap in this way will require high message rate and a good message progression model.
Even as message rate requirements are increasing, the move to lower-performance cores may make reaching even today's message rates difficult. Most of today's HPC interconnects use an on-load model with respect to MPI matching and queue traversal, which are key bottlenecks in achieving high MPI message rates (Underwood and Brightwell, 2004) . MPI matching performance, in turn, relies heavily on a core's ability to effectively manipulate the posted receives and unexpected message queues, which can become fragmented in memory over time, leading to irregular, noncontiguous memory accesses. Modern processors can overcome this issue with high clock frequencies and aggressive use of out-of-order pipelines. The lightweight cores in modern many-core architectures have largely removed or greatly simplified these features to improve energy efficiency. The reduced capabilities of these cores may make it difficult to provide sufficient MPI message rates.
This paper explores the MPI message rates of several modern processors, including both heavy-and lightweight processors, operating at nominal performance for each processor type. It also includes an analysis of a heavy-weight and light-weight processor operating at approximately the same clock frequency to get a better understanding of the impact of out-of-order pipelines on the matching problem.
Related work
Previous work related to measuring MPI message rate performance has been aimed at developing a suitable micro-benchmark that provides accurate and realistic results. The Ohio State University (OSU) benchmark suite (The Ohio State University, 2014) was the first benchmark suite to measure MPI message rate performance, and subsequent work from Sandia (Barrett and Hemmert, 2009 ) provided a more sophisticated microbenchmark intended to better reflect real application behavior. Like the OSU micro-benchmark, the Sandia micro-benchmark measures unidirectional message rate, but also bidirectional message rate, message rate with a configurable number of pre-posted receives (through communication with multiple peers), and a model that combines communication and computation to simulate the impact of cache effects on message rate performance. Both of these benchmarks have been used to complement the traditional latency and bandwidth performance results in evaluating high-performance interconnects.
In Chen et al. (2012) interconnect designers from IBM describe a mode of operation in which an application running on a BlueGene/Q architecture can utilize one of many hardware threads present in a node to accelerate communication. As a potential sentinel for future architectures, the BlueGene/Q processor is of interest since the four-way hardware threaded architecture cannot be fully utilized by many applications; therefore, the use of underutilized hardware resources for MPI threads is possible when applications are heavily reliant on efficient communication. Such an approach may be appealing to consider more generally; however, the custom design of BlueGene/Q hardware and software is a difficult model to apply across numerous processor designs or indeed processor vendors, motivating the need to consider whether weaker-performing, throughput optimized cores will provide sufficient MPI matching performance or, alternatively, whether the use of such cores may in the end become a bottleneck for the complex communication scenarios frequently seen in large-scale parallel production codes.
In this paper, we use both the Sandia and OSU message rate micro-benchmarks to measure and compare the performance of cores that have vastly different processing capabilities. The OSU benchmark has been modified to allow a specified number of either posted receives or unexpected messages to be inserted into the front of the appropriate queue (details are provided in Section 4.1). These entries are never matched, allowing us to study the effects of traversing multiple list entries before finding a match. There is a growing trend towards dedicating host processor cores to network processing functions, both for general-purpose networking as well as for high performance networking. For example, the IsoStack (Shalev et al., 2010) project is a recent example of an approach to using dedicated host cores to improve the TCP/IP network stack. Similarly, the Core Specialization (Pritchard et al., 2012) capability from Cray provides a mechanism for dedicating a host processor core to running an MPI progress thread. This paper is an extended version of previous work (Barrett et al., 2013) analyzing MPI message rate performance of hybrid-core processors.
Results
Two benchmarks are used to examine message rate, the OSU MPI Multiple Bandwidth/Message Rate test (with modifications as described) and the Sandia Micro-Benchmark (SMB) Message Rate test. The OSU benchmark provides the best-case single-direction message rate, while the Sandia benchmark utilizes communication patterns that are closer to those generally seen in scientific applications. We examine two classes of architectures: traditional high single-thread performance designs and low-power throughput-oriented cores. Intel's Sandy Bridge, Ivy Bridge, and Haswell processors, as well as IBM's POWER7 + , are examples of single-thread performance cores. The Intel Xeon Phi co-processor and the ARM Cortex-A9 are used as examples of throughput cores.
Sandy Bridge experiments were run on a machine with dual-socket Intel Xeon E5-2670 processors. Each processor contains eight cores running at 2.6 GHz with hyperthreading enabled, 20 MB of cache, and four DDR3-1333 memory channels with 64 GB of main memory. The machine was running Red Hat Enterprise Linux Workstation 6.3.
For the Ivy Bridge results, we performed benchmarking on a dual-socket Intel Xeon E5-2695v2 processor, each of which comprises 12 cores running at 2.4 GHz, also with hyperthreading enabled. The cores in each socket share a 30 MB last-level cache and connect to 64 GB of DDR3-1866 memory. The node was configured using a standard installation of Red Hat Enterprise Linux 6.3.
Benchmarking activities for the Haswell processor were performed on an Intel Core i7-4770S quad-core processor clocked at 3.10 GHz. The 8 MB last-level (L3) cache is shared by the cores and is backed by 16 GB of DDR3 memory. These experiments were performed using the Fedora 19 operating system.
Experiments on the Intel Xeon Phi co-processor card are performed on a B1-stepping co-processor with 57 cores, each being four-way SMT-enabled (giving a total of 228 hardware threads) clocked at approximately 1.1 GHz. The card provides 8 GB of GDDR5 memory with approximately 7 GB being available to applications after the effects of operating system and RAM-disk file system allocation. The Xeon Phi runs an adapted version of the Linux 2.6.38 kernel designed to support large-scale SMP processing as well as changes to the ISA required for the Knights Corner processor.
The ARM results were run on a PandaBoard-ES, which contains a dual core 1.2 GHz Cortex-A9 and 1 GB of low power DDR2 memory. The machine was running Ubuntu 12.04 with hardware floating point support.
Finally, the IBM POWER7 + comprises a dualsocket, eight-core processor running at 3.5 GHz. Each core operates a four-way SMT allowing 32 threads to be executed per socket (providing a total of 64 hardware threads per node). Each node is configured with 64 GB of DDR3 memory and has the Fedora 19 Linux operating system installed.
For all experiments on the Sandy Bridge, Ivy Bridge, Haswell, ARM, and POWER processors, MPICH 3.0.4 (Gropp et al., 1996) was compiled with the system version of GCC and -enable-fast was used. Intel's Xeon Phi Composer XE compiler toolkit (version 13.1 (Update 2)) was used with -O3 -mmic compiler flags for runs on the Knights Corner co-processor. Processes were pinned to adjacent cores (but not on the same core for machines with hyper-threading enabled or the Xeon Phi co-processor). All tests were linked against Intel MPI 4.1.1 and used the shared memory transport for communication. Shared memory presents an ideal case for examining message rate, as the overhead of transferring data and headers is minimal and artifacts due to PCI buses or network adapters is removed.
Queue depth
Due to the limited out-of-order performance and memory architecture that is generally weak for HPC applications, we initially expected low-power cores to exhibit poor MPI match list walk capabilities. To examine this thesis, we modified the OSU message rate benchmark to modify the length of either the unexpected list or posted receive queue traversed during the benchmark. For the unexpected queue, the sender in the message rate benchmark begins the performance loop by sending N messages using a tag different than that used in the performance loop. Matching receives have not been posted, so the unexpected message queue that must be traversed for every receive operation grows by N entries. For the expected queue, we instead post N receives on a tag different from that used in the performance loop, and that list must be traversed for every incoming message. Figure 1 demonstrates the effect of adding additional queue entries to peak unidirectional message rate, with Figure 1(a) demonstrating the effect of increasing the posted receive queue length and Figure 1(b) demonstrating the effect of increasing the unexpected receive queue length. In both situations, the effect on all size hardware platforms is noticeable with only a small number (less than 32) of entries in the queue. This suggests that for applications which are message-rate bound, estimates of future requirements based on peak message rate may be overstated for offload designs which are able to sustain high message rates deeper into a queue traversal.
Normalizing the data to peak message rate results in Figure 2 . The data suggests that despite the drastic difference in peak message rate, the fall-off in message rate is not as severe between different processor designs. Somewhat surprisingly, the drop-off for the Cortex is slowest of the six processors (while also being the slowest peak performance). Figure 3 attempts to demonstrate the cost per item of list walk using equation (1), where N is the length of the list traversed:
Previous work suggests that the MPICH shared memory transport is largely limited by receive performance (Buntinas et al., 2007) , suggesting that this methodology is correct. While the cost of walking a single item in the list is relatively constant for the single-thread performance processors out to 1024 items, the Xeon Phi and ARM results see an increase in cost relatively early in the list size, suggesting that the smaller caches may impact message rate on those processors.
Frequency comparison
Figure 4(a) and (b) shows the impact that clock frequency has on message rate. For both the Sandy Bridge and Xeon Phi, reducing clock frequency reduces message rate. Reductions in clock frequency were accomplished with the cpufreq Linux kernel module by explicitly setting the frequency via the user-space frequency governor. An interesting comparison can be made between the Sandy Bridge 1.2 GHz data in Figure 4 (b) and the 1.1 GHz Xeon Phi data in Figure  4(b) . The different architectures are at similar clock frequency, yet the Sandy Bridge has an approximately 4 3 performance advantage over the Xeon Phi at smaller queue depths. This highlights the architectural advantages of the complex out-of-order Sandy Bridge core and the in-order Xeon Phi core. Figure 4(a) also illustrates the potential impact of the trend towards reducing individual core frequency in future generation processors. While some of the performance degradation can be addressed through architectural improvements, it can no longer be assured that message rate will improve in subsequent processor generations. Figure 5 examines the effect of peer count on message rate. Similar to the frequency scaling results, only Sandy Bridge and Xeon Phi results are shown. For the pair-based tests, ranks are paired off and communicate bidirectionally. The Sandy Bridge results show high initial message rates, but tail off rapidly, while the Xeon Phi shows a relatively steady performance curve as the number of pairs communicating is increased. We believe that this is due to memory bandwidth constraints in the Sandy Bridge processor relative to its peak performance.
Peer count effects
The pre-post tests from Figure 5 post a large number of receives, equally distributed by the number of peers in the job, then send an equal number of messages to all peers in the job. The messages are all pre-posted and use an un-timed barrier to ensure that all messages are expected on arrival. The test mimics communication patterns seen on many halo-exchange applications, in which the long time between communication due to computation replaces the barrier synchronization. Both processor designs show a significant performance hit as the number of peers is increased, suggesting the cost of synchronization of shared memory queues and the longer queues impact performance significantly. 
Conclusions
The current trend in high performance computing (HPC) is to combine high single-thread performance processors (Xeon or POWER) with high-throughput cores (GPUs, Xeon Phi, etc.) that provide high FLOPs count with low energy usage. Future trends for HPC systems point to either a more tightly integrated heterogeneous system of 'big core/little core' processors or a homogeneous system of throughput cores. Both options will result in a node with much higher computational capability compared to a single-thread performance processor. Network performance is expected to grow at a slower rate than the increase in computational performance seen by either processor design. The growing processor/network performance gap will result in applications having to adjust to maintain scalability. This Figure 5 . Effect of peer count on message rate.
adjustment will likely increase the number of applications that require high message rates. This paper analyzes the message rates delivered by current high-throughput or low-power general-purpose processors (the Xeon Phi and ARM) and compares them to the current high single-thread performance processors. The single-thread performance processors exhibit peak message rates an order of magnitude higher than the Xeon Phi and ARM processors. The perelement list walk time is also significantly lower on the single-thread performance processors, suggesting that there is both a much higher constant overhead to message processing on the Xeon Phi and ARM, and also a higher per-element cost. Clocking the Sandy Bridge processors to within 10% the clock rate of the Xeon Phi results in a four to five times higher message rate than the Xeon Phi, suggesting that more than clock rate is responsible for the performance difference.
The one bright spot of the Xeon Phi is that it is capable of maintaining performance as more and more processes are running the MPI benchmark, suggesting that the performance drop-off as more application processes are making MPI calls may be minimal. Even if we assume that the Xeon Phi can maintain peak percore message rate with all cores communicating, the performance difference between throughput and singlethread nodes is similar. However, the throughput core nodes have significantly higher compute capabilities, suggesting that applications restructuring to use higher message counts to mitigate lowering network bandwidth availability might result in unexpected performance difficulties on future machines.
The poor message rate of throughput or energyefficient cores suggests that an offload approach to MPI is necessary for large-scale HPC systems. The high single-thread performance cores, such as the Intel Xeon processor line, provide sufficient message rate capabilities to provide the offload for throughput cores. In systems without a big-core/little-core approach, hardware offload network adapters may be necessary to sustain message performance. communication subsystem for discrete event simulators while at the Information Sciences Institute.
Ron Brightwell received his BS in Mathematics in 1991 and his MS in Computer Science in 1994 from Mississippi State University. He joined Sandia National Laboratories in 1995 and is currently Manager of the Scalable System Software Department. While at Sandia, he has designed and developed software for lightweight compute node operating systems and high performance networks on several large-scale massively parallel systems, including the Intel Paragon and TeraFLOPS, and the Cray T3 and XT series of machines. He has authored more than 75 peer-reviewed journal, conference, and workshop publications. His research interests include high performance, scalable communication interfaces and protocols for system area networks, operating systems for massively parallel processing machines, and parallel program performance analysis libraries and tools. He is a Senior Member of the IEEE and the ACM.
Ryan Grant is a post-doctoral appointee in the Scalable Software Systems group at Sandia National Laboratories in Albuquerque, NM, USA. He graduated with a PhD in Computer Engineering from Queen's University in Kingston, ON, Canada in 2012. His research interests are in high performance computing, with emphasis on high performance networking for exascale systems. He is an active member of the Portals Networking Interface design team, a high performance interconnect specification. He is an IEEE member and is actively involved in IEEE sponsored conference organization.
Simon Hammond is a member of the Scalable Computer Architectures Group at Sandia National Laboratories. His research focuses on the analysis of novel processor and supercomputer designs with activities including the porting and optimization of miniapplications, the development of efficient mathematical kernels for future processors, and the implementation of architectural simulators to enable before-installation predictions of performance to be obtained. He currently works with several Department of Energy Office of Science and National Nuclear Security Administration Co-design centers.
Scott Hemmert is a Principal Member of Technical Staff at Sandia National Laboratories, where he leads the advanced supercomputer interconnect research. He is also a member of the joint Sandia/Los Alamos National Laboratory Alliance for Computing at Extreme Scale (ACES) team, where he serves as project lead over hardware architecture for the Trinity supercomputer. His research interests include supercomputer interconnects and exascale architectures. Hemmert has a PhD in Electrical Engineering from Brigham Young University, UT, USA.
