Good performance monitoring is the basis of modern performance analysis tools for application optimization. We are providing a variety of such performance analysis tools for the new Blue Genet/L supercomputer. Those tools can be divided into two categories: single-node performance tools and multinode performance tools. From a single-node perspective, we provide standard interfaces and libraries, such as PAPI and libHPM, that provide access to the hardware performance counters for applications running on the Blue Gene/L compute nodes. From a multinode perspective, we focus on tools that analyze Message Passing Interface (MPI) behavior. Those tools work by first collecting message-passing trace data when a program runs. The trace data is then used by graphical interface tools that analyze the behavior of applications. Using the current prototype tools, we demonstrate their usefulness and applicability with case studies of application optimization.
Introduction
The Blue Gene * /L (BG/L) supercomputer is a new massively parallel system being developed by IBM in partnership with Lawrence Livermore National Laboratory (LLNL). BG/L uses system-on-a-chip (SoC) integration [1] and a highly scalable architecture [2] to assemble 65,536 dual-processor compute nodes. When operating at its target frequency of 700 MHz, BG/L will deliver 180 or 360 teraflops of peak computing power, depending on its mode of operation.
Each BG/L compute node can address only its local memory, making message passing the natural programming model for the machine. This paper discusses the current ongoing work on performance analysis tools to support the analysis of the execution of programs in BG/L. We are currently developing and porting such tools, and, at the same time, helping application programmers to port their applications to BG/L. This paper is organized as follows: We first present a discussion of BG/L hardware, followed by a description of the implementation of the Message Passing Interface (MPI) communication library for this machine. The performance analysis tools on which we are working are introduced, followed by descriptions of the experiences and lessons learned after using our tools in a set of experiments with microbenchmarks and real applications. Related work is briefly described, and conclusions drawn.
A short discussion of Blue Gene/L hardware
The Blue Gene/L hardware [2] and system software [3, 4] have been extensively described elsewhere. In this section, we remind the reader of the hardware features most relevant to the discussion to follow.
Blue Gene/L processors: The 65,536 compute nodes of BG/L are based on a custom SoC design that integrates embedded low-power processors, high-performance network interfaces, and embedded memory. The lowpower characteristics of this architecture permit very dense packaging. One air-cooled BG/L rack contains 1,024 compute nodes (2,048 processors) with a peak performance of 5.7 teraflops.
The BG/L chip incorporates two standard 32-bit embedded IBM PowerPC* 440 (PPC440) processors with private L1 instruction and data caches, a small (2-KB) L2 cache and prefetch buffer, and 4 MB of embedded dynamic random access memory (DRAM), which can be partitioned between shared L3 cache and directly addressable memory. A compute node also incorporates 512 MB of double-data-rate (DDR) memory.
Cache coherency: The standard PPC440 cores are not designed to support multiprocessor architectures: The L1
ÓCopyright 2005 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
errors, as detected by the CRC. Irreversible packet losses are considered catastrophic and stop the machine. The communication library considers the machine to be completely reliable. Network ordering semantics: Adaptively routed network packets may arrive out of order, forcing the message layer to reorder them before delivery. Packet reordering is expensive because it involves memory copies and requires packets to carry additional sequence and offset information. On the other hand, deterministic routing leads to more network congestion and increased message latency, even on lightly used networks.
CPU/network interface: The torus network is mapped into user-space memory. Packets are read and written with the special 16-byte single-instruction multiple-data (SIMD) load-and-store instructions of the custom FPUs. These require that memory accesses be aligned to a 16-byte boundary. The communication software does not have control over the alignment of user buffers. In addition, the sending and receiving buffer areas can be aligned at different boundaries, forcing packet realignment through memory-to-memory copies.
Hardware performance counters: The CPU core used in the system has no hardware performance analysis capabilities. Instead, performance counters have been implemented as a separate unit of the die, the universal performance counter (UPC) unit. Further, the doublehummer FPU has its own performance counters. As a consequence of the design, hardware performance counters are available for a large number of events, with the exception of events internal to the CPU cores.
The UPC unit consists of 16 control registers used to manage the behavior of 48 32-bit counter registers. In total, 311 UPC events are available, exposing the behavior of all aspects of the BG/L die outside the CPU cores. This includes the prefetch unit, the L3 cache controller, and the collective and the torus network controllers. Additionally, one control register in each of the double-hummer units manages two counter registers for events related to this unit. Finally, a 64-bit timestamp register is available. The timestamp register can be read by user-level code, while the UPC and FPU registers are available only in privileged mode.
The UPCs can be individually controlled to count the rising or falling edge of an event, or the duration (in CPU cycles) of an event state being either active or inactive. The UPC counters can individually be set to generate interrupts on user-selectable count thresholds. The FPU counters are divided into one floating-point arithmetic operation counter and one load/store counter. Each FPU counter is user-programmable to count the occurrence of a subset of operations, such as, for example, arithmetic trinary operations and quadword stores.
Architecture of Blue Gene/L MPI
The BG/L MPI is an optimized port of the MPICH2 [5] library, an MPI library designed with scalability and portability in mind. Figure 1 shows two components of the MPICH2 architecture: message passing and process management. MPI process management in BG/L is implemented using system software services. We present the architecture of the message-passing component as it is relevant to the performance analysis tools.
The upper layers of the message-passing functionality are implemented by MPICH2 code. MPICH2 provides the implementation of point-to-point messages, intrinsic and user-defined data types, communicators, and collective operations, and it interfaces with the lower layers of the implementation through the Abstract Device Interface Version 3 (ADI3) layer [6] . The ADI3 layer consists of a set of data structures and functions that have to be provided by the implementation. In BG/L, the ADI3 layer is implemented using the BG/L message layer, which in turn uses the BG/L packet layer.
ADI layer
The ADI layer is described in terms of MPI requests (messages) and functions to send, receive, and manipulate these requests. The BG/L implementation of ADI3 is called bgltorus. It implements MPI requests in terms of message-layer messages, assigning one message to every MPI request. Message-layer messages operate through callbacks. Messages corresponding to send requests are posted in a send queue. When a message transmission is finished, a callback is used to inform the sender. Correspondingly, there are callbacks on the receive side to signal the arrival of new messages. The callbacks match incoming message-layer messages to the list of MPI posted and unexpected requests. This implementation is the equivalent for BG/L to that usually implemented in CH3 over sockets in Transmission Control Protocol/Internet Protocol (TCP/IP) networks.
BG/L message layer
The BG/L message layer is an active message system [7] [8] [9] [10] that implements the transport of arbitrarily sized messages between compute nodes using the torus network. It can also broadcast data using special torus packets that are deposited on every node along the route they take. The message layer breaks messages into fixedsize packets and uses the packet layer to send and receive the individual packets. At the destination, the packets may arrive out of order, and the message layer is responsible for reassembling them into a message. The software structure of the message layer is shown in Figure 2 .
The message layer addresses nodes using the equivalent of MPI_COMM_WORLD ranks. Internally, it translates these ranks into physical torus x, y, z coordinates, which are used by the packet layer. The mapping of ranks to torus coordinates is programmable by the user and can be used to optimize application performance by choosing a mapping that supports the logical communication topology of the application.
Message transmission in the message layer is implemented using one of multiple available The message layer is able to handle arbitrary collections of data, including noncontiguous data descriptors described by MPICH2 data loops. The message layer incorporates a number of complex data packetizers and unpacketizers that satisfy the multiple requirements of 16-byte aligned access to the torus, arbitrary data layouts, and zero-copy operations.
Packet layer
The packet layer is a very thin stateless layer of software that simplifies access to the BG/L network hardware. It provides functions to read and write the torus and collective hardware, as well as to poll the state of the network. Torus packets typically consist of 240 bytes of payload and 16 bytes of header information. Collective packets consist of 256 bytes of data and a separate 32-bit header. To help the message layer implement zerocopy messaging protocols, the packet layer provides convenience functions that allow software to ''peek'' at the header of an incoming packet without incurring the expense of unloading the whole packet from the network.
PMI component
The Process Management Interface (PMI) component for process management is also implemented on top of the bgltorus in BG/L. In this case, the bgltorus component provides the capability to load the application from the input/output (I/O) nodes into the compute nodes using the Control and I/O (CIO) Protocol over the collective network.
Performance analysis tools for BG/L
Several parallel applications are currently being ported to BG/L; in the near future, the performance of these applications running on BG/L will require analysis [11] .
HPM
We ported libHPM [12, 13] to run on the BG/L system simulator (BGLsim; see the next section). This port was done by extending the library to use the BGLcounters application program interface (API), adding support for new hardware counters and derived metrics that are related to the BG/L architecture, such as the two-element vector FPU, and by exploiting the possibility of counting both at user mode and at supervisor mode during the same execution of the program.
BGLsim
We have used a pseudo cycle-accurate simulator based on BGLsim, an architecturally accurate complete system simulator for parallel machines [13] [14] [15] . BGLsim exposes all key features of the hardware, including processors, FPUs, caches, memory, interconnection, and other supporting devices. This approach allows the user to run complete and unmodified code, from simple selfcontained executables to full Linux ** images. The simulator supports interaction mechanisms for inspecting detailed machine state, thus providing monitoring capabilities beyond what is possible with real hardware. BGLsim was developed primarily to support the development of system software and application code in advance of hardware availability. It can simulate multinode BG/L machines, but we restrict our discussion in this paper to the simulation of a single BG/L node system.
The BG/L pseudo cycle-accurate simulator [15] offers higher performance than traditional cycle-accurate simulators. Our model runs 100 to 1,000 times faster than a cycle-accurate simulator. The idea behind the pseudo cycle-accurate simulator is to attribute timestamps for all relevant processor resources (such as registers, internal pipelines, FPUs, memory subsystem, etc.); the model checks all of the operand dependencies, updating the corresponding timestamps. Although this is not 100% accurate because the queuing effects on memory buses are ignored, the obtained accuracy (error smaller than 15% compared with the hardware) is enough to validate most optimizations.
BGLperfctr
The large number of available events in the BG/L CPU design and the rather complex mapping of events onto possible physical counters is handled through a userlevel API, BGLperfctr. This API includes a set of predefined mnemonics for each available event and provides the user with an abstraction of 52 counters, unifying the UPC and FPU counters and extending them to 64-bit counters.
Since the system design is based on a single active thread per CPU, the bookkeeping of occupied compared with free counter registers is all provided in this API. The setup of the counters is transaction-based in that the user registers a number of intended changes to the running register configuration through the API. If no collisions are detected, these changes are committed through a separate call to the library, which finally results in a kernel invocation in which the content in the affected control registers is modified and the counters start counting the desired events.
The API has full support for all capabilities of the UPC counters and offers simplicity to the end user, such as the ability to generate interrupts at an arbitrary count threshold and find an appropriate free counter for each user-selected event.
PAPI As most APIs intend to expose a maximum of capabilities of the hardware counter design to the end user, BGLperfctr has the disadvantage of being system-specific. Writing application code that uses hardware counter information is generally a highly nonportable task. The high cost of maintaining such codes is addressed by PAPI, the Performance counter API [16] . This specification provides a platform-independent interface to control and read hardware counters on a large variety of platforms widespread in high-performance computing. The API provides mechanisms for portably naming commonly used events, setting up sets of events, and starting, stopping, and reading these events. BG/L provides a functional implementation of PAPI using the BGLperfctr abstraction of events as its implementation substrate.
An interesting aspect of the implementation of PAPI on BG/L, when compared with other platforms, is the large number of events unique to this particular system. All 319 events defined on BG/L can be programmed using PAPI through its so-called ''native events'' interface. In the BG/L PAPI implementation, a native event is described by its BGLperfctr mnemonic and a bit pattern describing the counting behavior requested (rising edge, falling edge, duration high, or duration low). To the extent possible, existing BG/L hardware events have been mapped to PAPI predefined event names. However, several events typically available on other platforms are not available in the BG/L hardware. These include events related to the CPU core internals, such as instructions completed, branch prediction information, and level 1 cache events. Although such events are technically possible to count in the BG/L simulator framework [15] , the PAPI implementation of BG/L has not incorporated such events into its event map.
Paraver Paraver [17] is a parallel program visualization and analysis tool that supports both shared and distributed memory applications. Paraver has three major components:
Tracing facility: For MPI, a library called MPItrace is used to collect traces of the application during execution. This library intercepts the calls to the MPI primitives and records events, generating a single file for every process involved in the application. In addition, this tool can collect hardware performance counters that appear as Paraver events in the trace. Trace merger: The individual trace files are then merged into a single Paraver trace file/ citeparavertrace, using the mpi2prv tool.
Visualizer: Paraver traces are visualized using the Paraver tool, which allows the visualization of the information collected and derives new metrics from it.
The existing MPItrace package has been ported to the BG/L. The package uses the PAPI interface to obtain hardware counter values and emit them into the trace file. Until now, work has focused on the basic functionality of the tool and its use to understand the behavior of different Linpack and MPI library implementations.
Scalability of the tool is one of the areas to which major effort will be devoted in the future. As of this writing, we have been able to obtain and analyze traces on up to 1,024 processors.
BGLnodes
We are developing a simple tool to display how a scalar value varies over a BG/L partition. This tool is being fed by a text file containing the coordinates of the BG/L nodes and the value to be represented for each one. Values are translated to colors, each color indicating an intensity of the value. That way, it is very easy to represent in three dimensions the values of various performance counters or other metrics derived using Paraver, showing where the hot spots are in the BG/L partition.
Experiences and lessons learned

BGLsim
In this subsection, we present two different experiences carried on the BGLsim simulator analyzing the performance of the BG/L processors. These experiments were executed in the simulator and in the first version of the hardware chip, running at 500 MHz.
IS benchmark case study
In this case study, we present a performance improvement made on the NASA Advanced Supercomputing (NAS) Integer Sort (IS) Benchmark [18] (serial version, class S) using our set of tools. The IS benchmark performs an integer sort. The goal here was to find a bottleneck in IS with enough resolution to enable optimizations that would lead to performance improvements.
To find the bottleneck in the program, we started by using the BG/L version of libHPM [13] . We reduced the number of iterations in the benchmark to one and instrumented the rank () function, which is called at each pass of the loop, in order to decompose its effects on the cycle count. We did this by inserting hpmStart () and hpmStop () calls around each of the rank () regions to identify which of them were the heaviest contributors. By doing this, we identified that two particular regions of the rank () function were responsible for 84% of the total number of cycles for that iteration. These regions were the ''copy keys into work array,'' which is referred to as loop 1, and the ''count key population,'' which is referred to as loop 2.
Loop 1 has a simple operation in the form
Loop 2, on the other hand, has a double-referenced operation in the form
For class S, NUM_KEYS is 2 16 , which means that there are 2 16 loads and as many stores in loop 1, which is responsible for approximately 30% of the total number of cycles of the whole iteration. Loop 2 performs two loads and one store instruction and is responsible for approximately 53% of the total number of cycles. Knowing the total number of cycles required by one loop to execute, we can estimate the number of cycles consumed by each iteration. However, this is just an average value, and we know that actual iterations may vary from one another. To find out what was going on inside these loops and to be able to optimize them, we obtained an instruction-level trace of the execution that shows every instruction performed by the CPU along with its timestamp. From this trace, we obtained the actual number of cycles each iteration of the loop takes to run ( Table 1) .
It is clear that loop 1 has a very regular and predictable behavior, taking 17 cycles at each iteration, with the exception of the first iteration in each block of eight iterations, where it takes 44 or 19 cycles. This can easily be explained by the fact that the cache size for the BG/L machine is 32 bytes, which means that eight integers fit into one cache line. The first element to be loaded forces a miss in the cache and takes a longer time, while the others have a very predictable behavior. Note that even when a miss occurs, the number of cycles it takes is not always the same. This is due to the L2 prefetching that occurs after the second L1 cache miss. The 19 cycles are explained by a misprediction by the branch predictor; at the last time the loop is executed, the branch is not taken, which increases the instruction fetch latency by two cycles.
Loop 2 has such an unpredictable behavior because it makes a load using a random index to the key_buff_ptr array, which is stored in the key_buff2 array. The next index in the sequence (that is, the next value in the key_buff2 array), is likely to be in cache. The next element in the key_buff_ptr array, however, is not, due to the inherent randomness of the index array. This leads to a high cache-miss probability, which results in the noticeably higher cycle times loop 2 takes at each iteration.
It is possible to optimize this kind of memory access by using an explicit prefetching technique that consists of loading the next element of key_buff_ptr one iteration before it is going to be used, therefore hiding its load latency. We implemented this as
The results of this optimization can be seen clearly in Table 2 and in Figure 3 , where we observe that the higher cycle-count spikes have disappeared and that a lower baseline has been set. Furthermore, there was considerable improvement in the IS benchmark overall performance: The original main loop (one execution of DGEMV case study DGEMV is the name of a subroutine that performs the matrix-vector operation y = a Á A Á x þ b Á y, where a and b are scalars, x and y are vectors, and A is an M 3 N matrix. In this case study, we describe the optimization process of this dense linear algebra kernel using the BGLsim timing model as a performance-tuning tool. The following example shows the optimization process of a level 2 basic linear algebra software (BLAS) kernel, which performs operations in the form y ¼ y þ A T x, where y and
x are vectors and A T is a matrix; A T is the transposition of A. The routines defined by BLAS are commonly called by a wide range of scientific software and have become a de facto standard for elementary linear algebra operations [19] . Therefore, a high-performance implementation of the BLAS kernels has been developed as part of the math library that will be delivered with the BG/L supercomputer.
As the main goal is to achieve the highest performance in a single-processor computation, some of the BLAS kernels are written in assembly language and then handtuned such that an efficient pipelined execution is created for the kernel. As we show here, the results produced by the BGLsim timing model helped us identify inefficient sequences of code that were leading to stalls in the pipeline and consequently degrading the execution performance. The simulator also gives us the total number of cycles needed for the execution of a piece of code under a specific workload. In the optimization process, we change the scheduling of the instructions on the basis of the information provided by the simulator timing model. After the changes, the new version is tested again, and a new output is generated. Therefore, each consecutive version is improved on the basis of the time information provided by the BGLsim.
The first version of the code was implemented and executed in the simulator. The output produced by the simulator gives pseudo cycle-accurate information on the instructions issued. The output of the execution of the code related to the inner loop of the DGEMV kernel is shown in Figure 4 . Every iteration of this loop computes the product
for every element of an 8 3 2 block of elements of the matrix A. As seen through the execution output, every iteration takes 12 cycles to complete. Therefore, the ratio of elements computed to CPU cycles is 4/3. Moreover, the output of the simulation tells us that every fused multiply-add (FMA) instruction is paired with a load instruction; consequently, both instructions are issued in the same cycle. However, many cycles are spent with just a load instruction being issued, which means that the computation pipeline is idle in that cycle.
Considering the results of the first version of the code, after collecting execution traces from the simulator with the cycles for each instruction, we did some improvement on the instruction scheduling to reduce the bubbles on the pipeline. This yielded Version 2, and we repeated the same process testing different instruction schedules until the final version was produced. 
Figure 4
Cycles of the inner loop in the initial implementation of the DGEMV kernel.
The output of the inner loop of the latest version is shown in Figure 5 . In this loop the product is computed for a 2 3 14 block of elements of the matrix A in each iteration, and the ratio of elements computed to CPU cycles is 28/15, which gives us a better utilization of the instruction pipeline. Moreover, more FMAs are interleaved with load instructions, which translates to a better use of processor resources, keeping the pipeline busy most of the time.
Figures 6(a) and 6(b) respectively show the running times for different versions of a DGEMV kernel as run on hardware and as predicted through BGLsim. The DGEMV kernel has been optimized for best performance considering that the data is already in the L1 cache. As the plot shows, the better the instruction scheduling, the faster the kernel execution for a given workload. For both experiments, similar performance trends are observed as the DGEMV is improved. Hence, we observe that by using the BGLsim timing model, we were able to generate a version of the DGEMV kernel optimized for execution by the hardware. The BGLsim timing model does particularly well when simulating straight-line code that accesses primarily the L1 cache, as represented by the code fragments in Figures 4 and 5 . The maximum discrepancy between simulation and hardware results that we observe in Figures 6(a) and 6(b) is 7.5%. The accuracy of the pseudo cycle-accurate timing model depends on the access pattern and the number of misses at L1. That is the reason why the different experiments have different accuracy with respect to the hardware execution.
Detecting message-passing overhead
One of the characteristics of the BG/L supercomputer is that it has a very fast network compared with the speed of the processor. During the project, we have developed some microbenchmarks to determine how much information the processor can deal with in and out of the network. Ideally, the processor should be able to manage six incoming and six outgoing links. In a machine running
Figure 5
Cycles of the inner loop in the optimized implementation of the DGEMV kernel. at 500 MHz, each link is able to sustain 110 MB/s, giving each node the capacity to move a total of 1,320 MB/s. We have instrumented one of these microbenchmarks with Paraver and hardware counters. This section presents the results of this study. The microbenchmark first selects a node in the middle of the current BG/L partition and its closest neighbors. It then starts a set of iterations in which the number of senders and receivers is increased. In each communication phase, 30 messages of 1 MB of data each are sent. For the analysis, we collected the hardware performance counters which indicated that a link in each direction was available but there were no tokens available for the node to send. When this occurs, the specific link is full, and the destination node is not draining it at the proper speed to sustain the required bandwidth.
The microbenchmark was executed in a 32-node partition, in which any of the central nodes has up to five neighbors. The top plot of Figure 7 (a) presents the behavior of the communication phases in which a single node (node 22-the node number is the middle term in the expression following ''Thread'') is receiving messages from one to five nodes simultaneously. The plot in the bottom shows the behavior of the counters indicating network congestion. These counters are incremented every cycle in which a network link is available but the hardware has no token to send data. Not having a token is usually caused by the fact that there are packets in transit, and the destination node is not able to collect them. As shown, the receiver (node 22) can deal with up to three incoming links without experiencing network congestion. As soon as a fourth sender becomes active, all senders start seeing a lack of tokens; this is because the receiver is not draining the links fast enough. Observe also that, as the counter value becomes higher, the execution time increases for these messages to be received. That is the effect of the sender being blocked while waiting for tokens. Figure 7 (b) shows the same information when, in addition to receiving messages, node 22 also sends messages to first one and then two destination nodes. Observe that when there is a single destination node to which node 22 is sending messages, node 22 is no longer able to deal with three incoming links. A lack of tokens appears first at nodes 6 and 18 and then later at nodes 23 and 26. Also observe that as soon as a node is busy receiving messages, it no longer has sending problems due to the lack of tokens. This is because it has to send more slowly. This happens to node 18 in Figure 7 Lack of sending tokens when (a) receiving from a different number of nodes; (b) receiving from many and sending to one and two nodes; and (c) receiving from many and sending to three and four nodes. nodes. Observe that in this case, the detection of a lack of send tokens is reduced to nodes 23 and 26, which are the ones not receiving messages. Any node receiving messages is unable to deliver messages at enough speed to detect the lack of tokens.
Analysis of the behavior of the message layer
We used the Paraver tool to evaluate different implementation alternatives inside the MPI message layer. In this experiment, we compared the performance obtained using two possible implementations (developed as prototypes) of the low-level message layer on which the MPI implementation relies. The Linpack benchmark indirectly uses this portion of the message layer through the MPI library to implement a hand-coded version of the broadcast collective. This hand-coded version of broadcast performs better than the built-in MPI broadcast using any of the implementations. Currently, this broadcast is being implemented inside the MPI library.
The difference between the two implementations was in the way messages are sent:
One message at a time (first in, first out, FIFO, mode): Considering that each node has six connections to its neighbors, the first implementation of the message layer allowed sending up to six messages at a time (one in each different direction). That way, the send queues in the connection manager (see Figure 2 ) contain a single outgoing message for each direction. Overlapping messages: We wanted to test whether allowing several outgoing messages for the same destination at the same time could improve communication performance, so we developed a version in which any outgoing message was immediately posted to be delivered to the network. In this case, packets are picked up in a round-robin fashion from all available messages in each direction.
After implementing both of these ways of dealing with messages, we evaluated the performance of the Linpack benchmark in 32 nodes. Table 3 shows the results of the comparison. We observed that the performance obtained in this application was slightly worse with overlapping messages. Using Paraver, we were able to look inside the application and detect which part of it performed worse and why. Figure 8 shows the behavior of one of the broadcasts that was hand-coded inside the Linpack benchmark using point-to-point communications. As can be observed, the transmission of the messages is different in the two versions of the message layer. The plot on the left in Figure 8 corresponds to the first version of the message layer, which sends a single outgoing message in each direction. The plot on the right corresponds to the alternative implementation, in which several messages are sent in a round-robin fashion to the same destination.
We can observe that in the plot on the left, the first message sent from nodes 3, 11, 19, and 27 reaches the destination earlier than that in the plot on the right. That is precisely because each single-link bandwidth is devoted to a single message. Instead, in the plot on the right, all outgoing messages from these nodes are sent in parallel, so the first and the last ones are complete at the destination nearly at the same time. Because of the way this broadcast is implemented, each destination node is going to retransmit the information to other nodes. The parallel implementation causes the retransmission of the first messages to be delayed because of the late arrival, and this causes the performance degradation. The actual degradation of the broadcast code was 30%.
In conclusion, time-sharing the links between MPI messages to the same destination results in all messages taking about the same time to arrive at their destination. Keeping a FIFO order in sending messages through the link also has the potential to keep the link fully used, but results in some messages arriving earlier than others. In a situation in which all of the messages are of similar length and most of them have to be retransmitted, the version that keeps the FIFO order has the potential for better performance. In this case, retransmissions will start earlier, increasing the number of simultaneously active links. This is visible in Figure 9 , which presents, at the same timescale, a set of messages sent by node 3 and its retransmissions, clearly showing the benefits of having the link dedicated to a single message at a time.
The Paraver traces helped identify this issue and provided a good understanding of its detailed impact in this situation. Conceptually, in other situations with messages of different sizes, the time-sharing version might be advantageous, depending on what the application does with different messages.
From the analysis of the traces, we also inferred some suggestions to the application developer about the way the broadcast is implemented. It might be useful, for example, first to send and receive messages that have to be retransmitted, and only at the end send messages that constitute the leaves of the broadcast.
Another suggestion comes from the observation that the broadcast is actually decomposed in four subtrees, where the root of the broadcast pipelines the message to the four roots of each subtree. Unfortunately, two of those trees end up being assigned to the same physical processor. In this case, the cause is that the neighbors in the zþ and zÀ directions are the same nodes. This is due to
Figure 9
Comparison of specific messages sent from node 3 in broadcast. (The node number is the middle term in the expression following "THREAD.") the 4 3 4 3 2 topology in a partition containing 32 nodes. A more balanced topology would probably result in better performance.
A final observation is that the root of the broadcast pipelines the messages to each subtree root, but it finishes sending long before the end of the whole broadcast. This suggests that imbalanced tree approaches, where the root keeps transmitting during the whole operation, would potentially improve the utilization of the links.
We were also interested in a more detailed analysis of the FIFO version. Figure 10(a) shows a view in which the effective cost of each MPI_Waitany call is reported in MB/s. By effective cost, we mean the ratio between the number of bytes received by the call and the time taken for the wait to complete. For any MPI point-to-point call, this local bandwidth is a fair metric of how efficient the call has been in handling the data it had to deal with. The view focuses on a few threads and a short period of time, and a light green color represents 40 MB/s and dark blue 400 MB/s. Surprisingly, even if the size is the same for all of them, different instances of the MPI_Waitany take rather different amounts of time. Figure 10 (b) shows a histogram for the whole trace of such effective cost. For each process (row), the column represents a range of 10 MB/s (up to a total of 1,000 MB/s). The color of each entry corresponds to the total number of times an MPI_Waitany call achieved the particular local effective bandwidth. As can be seen, there is a major mode around 100 MB/s that corresponds to the link bandwidth. There are a significant number of instances that achieve less than the link bandwidth. Finally, it is interesting to see some instances achieving close to a GB/s. Nevertheless, averaging over the whole duration of a broadcast, each processor performs MPI_Waitany calls at the rate of 100 MB/s. This value is still far from the physical limits of the interconnect.
The interpretation for this behavior is related to the fact that the data is sent to and drained from the network interface by the main node processor through polling. The processor becomes the bottleneck because it is not capable of feeding and draining the six I/O links at their full speed. Additionally, there are issues related to how to proceed if the processor is simultaneously sending one message and receiving another. When, for example, the reception is finalized, should control be returned immediately to the user or should the transmission be finalized? In both cases, somebody (the local or the remote node) is going to be delayed. Independent direct memory access (DMA) engines would certainly help here. The extreme variation in bandwidth achieved by some calls can be explained if we consider that several messages may be arriving at a node simultaneously. If the node cannot cope with it, the whole process is slowed down, and the first reception to finalize perceives a low bandwidth. By that time, it is quite possible that the next incoming message has almost been received, so when the next MPI_Waitany call is invoked, control returns very soon, resulting in a huge local bandwidth perceived by such a call. Figure 11 shows a set of views of the communication phase in a Linpack version that performs the broadcasts directly through the MPI broadcast call. The run is for a problem size of 40K on a 32-processor system. The upper view displays the MPI call (yellow: broadcast; red: barrier; blue: send; white: receive). In the second view, we see those processors that are the root of a given broadcast (the different colors represent the different communicators). The third and fourth views are derived from the hardware counter information emitted into the trace at the entry and exit of each MPI call. The third view is an estimate of the number of active links. Here, the important issue is that during most of the broadcast time, most nodes show only a single active link. Only the root processors achieve two active links. During the second broadcast region, which performs a column broadcast in Linpack, some nodes achieve three simultaneous active links (red). The fourth view is the equivalent bandwidth going out of a node along all links during the whole call. In accordance with the third view, light green is the predominant color across the first broadcast and large portions of the second. This means that the bandwidth achieved is far from the peak.
Analysis of the behavior of Linpack broadcasts
It is possible to compute a histogram of the bandwidth used during the major broadcast with message sizes above 7 MB. From the analysis of that histogram, we can see that the root processor achieves an effective total bandwidth of the order of 117 MB/s, while most other processors show either 78 MB/s or 39 MB/s.
Another capability of the tools environment is the possibility of using the powerful metric derivation and analysis capabilities of Paraver to generate the ASCII data for a locally developed tool, BGLnodes (discussed above). This tool displays a single scalar value for each processor in the physical topology. Figures 12(a) , 12(b) , and 12(c) respectively show the bandwidth obtained during the major broadcast, in the x, y, and z dimensions.
Performance counter limitations
From the experience with the current chip implementation of performance counters on BG/L, some lessons can be learned. BG/L is a machine targeted to address the grand challenges in high-performance computing. These applications typically amount to a large number of floating-point operations. In this context, the capabilities of the double-hummer FPU performance counters are limited. There is one counter in the FPU capable of registering arithmetic events. This counter counts operations that belong to any of the following groups: additions and subtractions, multiplications and divisions, trinary operations, and Oedipus operations. The first three groups relate to single-pipe operations. The trinary operations are operations of the form a 6 b Á c. This corresponds to two classical floating-point operations. The Oedipus operations are trinary operations that use both functional pipes in the FPU, using up to six operands and producing two results per instruction. Parallel single-and dual-operator instructions (such as, for example, fpadd) do not map into any of the countable groups of events. Thus, even with repeated runs of the same code, it is not possible to count the complete number of floating-point instructions performed. For the same reason, it is not possible to compute a corresponding number of floating-point operations of an algorithm by using only the performance counters.
End users doing advanced tuning of large applications would most likely gain from a CPU core implementation that incorporated a performancecounter infrastructure. The most noticeable events that are not possible to detect are issued loads and stores, L1 cache events, branch unit events (such as branches correctly predicted compared with mispredicted branches), and instruction issues. The impressive performance available in modern CPU design is highly dependent on the ability of the code developer and compiler to generate instruction sequences in which branch prediction is mostly correct and the instruction cache hit ratio is maximized. Without hardware performance counters capable of generating a view inside the units of the core that control these aspects of the CPU, the code developer has no accurate way to determine success in fully utilizing the inherent computational power of the platform.
Related work
Vampir [20] is a commercial product for performance analysis that allows tracing and analysis of MPI applications. Several execution environments such as ParaWise 1 [21] provide an interface for generating
Vampir traces. Two research projects on performance analysis are Paradyn ** [22] , developed at the University of Wisconsin, and Aksum, part of the Askalon [23] project conducted at the University of Vienna. Both aim at the automatic detection of performance bottlenecks. Tuning and Analysis Utilities (TAU) [24] was developed at the University of Oregon. It is a set of tools for analyzing the performance of C, Cþþ, Fortran, and Java** programs. The advantage offered by Paraver is a high level of flexibility in computing performance indices and statistics. This usually allows the exploration of metrics of interest and the influence of the parallelization choices on them.
Figure 11
Representation of the execution of the Linpack broadcasts. The upper view displays the MPI call (yellow: broadcast; red: barrier; blue: send; white: receive). In the second view, the different colors represent the different communicators. The third view is an estimate of the number of active links (blue: 1; white: 2; red: 3). The fourth view is the equivalent bandwidth going out of a node along all links during the whole call (gradient from green to blue; dark blue: 200 MB/s).
Conclusions
In this paper, we have presented a set of tools devoted to performance analysis of the Blue Gene/L supercomputer. The tools in this set range from a hardware simulator and low-level libraries to visualization and analysis tools. They are currently being ported and adapted to the BG/L environment, and should not be considered as finished work. We have made initial explorations of the possibilities this new architecture provides for performance analysis.
BGLsim is a pseudo cycle-accurate simulator that runs full-system simulations and provides monitoring capabilities beyond the level possible with real hardware.
LibHPM, BGLperfctr, and PAPI are user-level libraries capable of managing the hardware performance counters available in BG/L and extracting information during application runtime. The MPItrace library collects traces during execution for later visualization and analysis.
Paraver and BGLnodes are visualization tools that present the traces obtained (including performance counters) and allow the user to analyze in detail what is happening inside the application.
Finally, we have demonstrated the power of an environment for collecting information about the execution and using it to explain the performance obtained. We have also presented a set of experiences optimizing code using information obtained by simulation. We have used the specific hardware performance counters in the torus network to analyze the behavior of the communications and determine the limitations of the processor in each node when dealing with up to six incoming and outgoing links. We have also analyzed the implementation of the MPI message layer and a hand-coded broadcast in the Linpack application.
Figure 12
Linpack broadcast bandwidth in (a) x direction, (b) y direction, and (c) z direction. 
