Understanding the behavior of an application is rarely a trivial task, due to the complexity of the system in which the application is executed, and the complexity of the application itself. 
Introduction
Field-Programmable Gate Array (FPGA) reconfigurability allows us to gain performance by creating applicationspecific hardware. Some high-performance computing platforms now include FPGAs that can be plugged directly into an Intel FSB [12] socket or an AMD Opteron [2] socket using the hyper-transport protocol. Some systems rely only on FPGAs, taking advantage of their embedded cores (e.g. PowerPC) for general computation and specialized hardware engines for the more demanding computation. However, failing to have a complete understanding of the application may result in underachieved performance. To address this, FPGA reconfigurability may also be used to create a custom profiler that will provide the knowledge needed to configure the hardware system for a specific application. This paper's focus is a profiler for the TMD, a heterogeneous multi-core multi-FPGA system designed at the University of Toronto [9] that uses the message passing paradigm for communications. The main purpose of this profiler is to give an insight into the communications occurring between nodes and the computation performed by each node. The presence of different cores in the system increases the difficulty of profiling, since the profiler is required to be compatible with all of them. Furthermore, the possibility of having hardware engines introduces the need of having the profiler's functionality implemented both in software and hardware.
The remainder of this paper is organized as follows: Section 2 discusses related work while Section 3 and Section 4 provide background information on the TMD system and BEE2 platform. Section 5 then describes the profiler and in Section 6, two case studies are used to test it. The Profiler's overhead is discussed in Section 7. In Section 8 we present future ideas to improve this work and, finally, in Section 9 we conclude.
Related Work
A large amount of research has been done to create or improve profilers. Profiling data can be gathered from hardware performance counters, that collect low level performance metrics (e.g. clock cycles, instruction counts, cache misses) or from the source code itself by instrumenting it. In 1979, Unix systems included a tool called prof that would list each function and give the number of calls and time spent on them. Gprof [3] was released in 1982 and included a call graph analysis that would provide not only the time spent in each function but also how much time each function spent on behalf of each of its callers.
Performance Application Programming Interface (PAPI) [8] gives access to hardware performance counters on modern microprocessors allowing the monitoring of high-level hardware events. For parallel systems, such as multicore processors, profilers that can relate events over different processors and provide information on the interactions of different processing units during a parallel application execution are needed.
Reconfigurable systems can be more challenging to profile than typical clusters of computers. The profilers avail-able tend to gather data from the perspective of a CPU. However, FPGA systems may not have any conventional processor and will most certainly contain hardware engines that will require a more hardware aware profiler. Techniques have been developed to extract data out of an FPGA with little impact on performance like Xilinx's ChipScope [11] and Altera's SignalTap [1] . However, these tools were designed for hardware debugging and are far from adequate to be used as profilers.
The Snooping Software Profiler (SnoopP) [6] is a nonintrusive, real time, profiling tool developed at the University of Toronto. The profiler is able to count how many cycles an application running on a soft-core processor spent in a specific program counter (PC) address region.
The MPI Parallel Environment [4] provides the ability to profile different sections of MPI [10] code, including communications performed between nodes, through instrumentation of the source code. However, it is an intrusive profiler that will add close to 5% overhead to a typical application [5] . It includes a tool called JumpShot [4] that visualizes the profiler logs, allowing a user to see what each process is doing over a common timeline. The logs created by the profiler presented in this paper are compatible with the logs created by the MPI Parallel Environment. Moreover, it uses the same Application Programming Interface (API) to instrument the source code. This was done because the MPI Parallel Environment is part of the MPI distribution MPICH [4] and will allow us to easily port profiled applications from a traditional cluster to the TMD. Also, JumpShot is an easy to use visualization tool that is able to open considerably large profiler logs.
The TMD
The TMD is a heterogeneous multi-core, multi-FPGA system developed at the University of Toronto that achieves high performance by taking advantage of hardware acceleration using FPGAs. The system can be fully customized to a specific application by adapting the on-chip network topology and choosing adequate computation elements for each task. The possibility of having heterogeneous cores requires a flexible communication protocol that is not dependent on any particular computation element, and that simultaneously allows the seamless addition, removal or exchange of cores to better fit a particular application. This flexibility co-exists in this system both in the hardware level, where a node can be physically removed without changing the rest of the system, and in the software level where the source code of an application is independent of the number and nature of nodes.
TMD-MPI
The message passing paradigm has become the most used communication paradigm in high-performance computing. MPI, the best known message passing library, provides a language and architecture independent communication protocol that is highly scalable and portable. The API is organized in layers built on top of each other. While the lower layers are hardware dependent and need to be adapted for different architectures, the top layers are standard and allow the user code to be usable across different platforms.
Each process is represented by a unique integer, called the rank, spanning from 0 to size-1, where size is the total number of processes on the system. TMD-MPI [9] is a lightweight implementation of MPI for embedded processors in a heterogeneous multiprocessor system. It uses the Rendezvous protocol, which is a synchronous communication mode where a process will only initiate a transfer when the receiver acknowledges the operation. The transfer is initiated by the producer process sending a request to send packet composed of its rank, a destination rank, size of the message and a 32-bit tag, which is a user-defined message identificator. The receiver process will then send an acknowledge packet, allowing the producer to start the data transfer. 
Hardware
The nodes can be divided into two types: computation nodes, that are responsible for all computation, and network nodes that route message packets outside of the FPGA. All nodes interact with the on-chip network through a Network Interface (NetIf) core that contains the routing tables. Communications between all hardware blocks are done through Xilinx's Fast Simplex Links (FSLs) [11] , which are unidirectional FIFOs that provide isolation between the Computation Elements (CEs), allowing different clock domains, while providing buffering for ongoing messages. Fig. 1 shows all types of nodes that can exist in the TMD.
Computation nodes
Computation can be performed by a Microblaze, a PowerPC or a hardware engine. Because the source code can be kept constant from platform to platform, due to the TMD-MPI abstraction, a Microblaze or a PowerPC can be used as an intermediate stage when porting an application from a traditional cluster to the TMD. After the application is successfully ported in software, if the task running on one of these processors is critical it may then be optimized by substituting it with a hardware engine. Because all communications are done using the TMD-MPI protocol, a spe-cial core that encapsulates the TMD-MPI functionality, the TMD-Message Passing Engine (TMD-MPE) [9] , was designed so all hardware engines can send and receive messages from other nodes in full-duplex mode. This simplifies significantly the design process of a hardware engine, as the designer will not need to be concerned with the details of the protocol such as unexpected messages and packetization. Although the TMD-MPE was mainly designed to be used by engines, it can also be used by a Microblaze or a PowerPC, reducing the memory footprint of their code and, for the latter, increasing communication speed by allowing Direct Memory Access (DMA).
Off-Chip Communications Nodes
Off-chip communications nodes do not perform any computation and are present on the system to route message packets outside to another FPGA. Their configuration depends on which platform the TMD is being implemented on. This paper will only consider the Berkeley Emulation Engine 2 (BEE2) boards for implementation (see Section 4.).
There are two possible types of off-chip communications on the BEE2 board implementation: intra-board and interboard. The intra-board communications refer to messages sent between nodes from different FPGAs on the same board and are done using direct connection wires between the FPGAs, through the Interchip core.
The inter-board communications refer to communications performed by different nodes in different boards and are done using the X Attachment Unit Interface (XAUI) standard. As with all blocks, the interface between the onchip network and XAUI is done using the FSL standard.
BEE2 Platform
The BEE2 platform contains five Xilinx Virtex II Pro 70 FPGAs per board with 4 GB per FPGA of high speed DDR DRAM. The FPGAs are connected in a star topology, with the Control FPGA being the central node, and a ring topology between the User FPGAs. Each User FPGA has four 10Gbps links, configured as XAUI links in our applications, while the Control FPGA has two. The Control FPGA is also connected to common I/Os such as the 10/100 ethernet, USB 1.1, RS232 serial and DVI.
Profiler
The reprogrammable nature of the FPGAs can be used in our favour to create a custom profiler for an application that will minimally impact the system performance, therefore, retrieving the real behavior of the system. This profiler retrieves data from all TMD-MPI communication calls and computation states defined by the user, both from a processor or a hardware engine, and allows the visualization of all ranks on a timeline. Hardware is profiled by sampling specific hardware events. Software is profiled by embedding code into the MPI functions or using additional profiling functions. 
Hardware
The profiler hardware should occupy minimal FPGA real estate and not affect timing by making the application run at a slower clock rate. Unfortunately, these goals are at a different end of the spectrum and a compromise had to be made. We incurred extra hardware to prevent performance loss whenever there was a choice.
One of the most critical points of a profiler is where to store data retrieved from the system and how to redirect it to that location. Because the BEE2 board has a considerable amount of memory (4GB of DDR per FPGA), each BEE2 board stores the profiling data of their own FPGAs on the Control FPGA until the end of the computation, when the data is sent to a remote workstation. Furthermore, the profiling data does not use any of the communication channels used by the application, but a special network created exclusively for the profiler. Fig. 2 shows the high-level profiler architecture between a User FPGA and a Control FPGA, and Fig. 3 shows the connections between the profiler and the computation nodes. Each FPGA contains a 64-bit clock cycle counter, shown in Fig. 2 , synchronized between all FPGAs. This counter serves as a global timer and is used to obtain the timestamps of events of interest on the system. The Tracers in Fig. 3 are the connection point between the application and the profiler, and each node will have up to three Tracers, depending on whether the computation is being profiled or not. This number of Tracers is needed because of the parallel nature of hardware where at any given moment a node This may also be true for a PowerPC or Microblaze when using a TMD-MPE unit.
The profiling data of a given instance of a state is kept in the Tracer until that instance finishes. Only then is the data sent, as one packet through the profiler network. This is done to prevent inconsistencies in the profiler data by assuring that either all data referenced to an instance is saved in memory, or that instance is completely discarded.
The Tracers are non-intrusive with the exception of when they are used on a processor, since communication from the processor to the tracer is done through an FSL and General Purpose Input/Output (GPIO), delaying the main computation on the processor. However, when profiling only the computation, the overhead is not significant due to the computation time itself.
The communication Tracers are designed specially for the TMD-MPE. They record all headers, tags and timestamps of the start and end of both the data transfer, request to send packets and acknowledge packets in their registers and FIFO, shown in Fig. 4 a) . The computation Tracers, shown in Fig. 4 b) , only register the timestamps of the start and end of user-defined states, except when used in conjunction with a processor (e.g. MicroBlaze, PowerPC) without TMD-MPE, where they will also register all MPI calls. For the latter case, the computation Tracers have one set of registers for the MPI calls, and one set of registers and two stacks for the user-defined states. The two sets of registers allow profiling of MPI calls using user-defined states and the two stacks allow nested user-defined states. Sampling of events of interest in the hardware engines is done by computation Tracers built specially for each specific hardware engine.
All Tracers connect to the Gather unit, shown in Fig. 2 , that will serve as a multiplexor to send the data to the Control FPGA. On the Control FPGA side, the Collector core, shown in Fig. 2 , will receive data from the other FPGAs and from any core being profiled in the Control FPGA, and will write it to the DDR. The Collector is also responsible for time stamp synchronization between each board. Synchronization is done by all boards sending their time stamp counter periodically to the root board [7] . 
Software
The profiler library uses the same API as the MPI Parallel Environment of the MPICH distribution. Therefore, an instrumented source code, through embedded calls, can be easily ported from a traditional cluster to the TMD. The system outputs the profiler logs in text format, which are then converted to the CLOG2 format used by the MPI Parallel Environment. The CLOG2 files will then be converted into SLOG2 (Scalable LOGfile), a format optimized for visualization, and used by JumpShot.
Sample Applications
To demonstrate the usage of this profiler we used the MPI collective call Barrier and a heat equation application. Both applications were run on one BEE2 board with each User FPGA containing two PowerPC processors, using an TMD-MPE with DMA. All computation was done by the PowerPCs configured as shown in Fig. 3 a) .
Barrier
The barrier is a collective call in which no node will advance until all nodes have reached the barrier and it is used when the application requires synchronization points. This paper demonstrates two versions: a sequential barrier and a binary tree barrier. For the sequential barrier the root node will wait for all other nodes to report to it and will then reply to all nodes so they can advance with the computation. This method scales very poorly as there will be a large amount of contention while the nodes communicate with the root node. For the binary tree version a node will report to its parent when its children finish reporting to it, assuring that when the root node gets a message, all other nodes have reached the barrier. The root node will then signal its children and so on until the signal reaches the leaf nodes. Fig. 5 shows communication patterns for the two barrier implementations. The numbers indicate the rank of the nodes. Fig. 6 and Fig. 7 show the data retrieved from the profiler when running the two algorithms on an eight-node network. The horizontal axis shows the time elapsed, and the vertical axis shows the node ranks. Each rank is represented by two timelines, one for receives (RECV), and another for sends (SEND). Because the Barrier only synchronizes the nodes and does not transfer any data, only the request to send packets and acknowledge packets can be seen on the figures.
In the sequential algorithm, rank 0 can be seen receiving from all other ranks and then replying to each of them sequentially. During the execution, the non-root ranks are stalled, waiting for rank 0 as shown by the bars on the Receive timelines indicating the times when the nodes are waiting in the receive state.
In the Binary Tree algorithm the leaf nodes (4,5,6,7) can be seen starting the barrier by communicating with their respective parents. When rank 0 finishes receiving messages from both its children, it replies back to them and the tree is traversed downwards until it reaches all leaf nodes. Both while traversing the tree upwards and downwards, communications are done simultaneously. During the execution of the barrier, up to two Send-Receive pairs can be seen occurring at the same time. The binary tree algorithm scales logarithmically and does not introduce considerably more overhead then the sequential version, making it faster for three or more processors as expected.
The Heat Equation Application
The heat equation application is a partial differential equation that describes the temperature change over time, given an initial temperature distribution and boundary conditions.
The matrix is divided into equal parts between all nodes and the algorithm used consists of the following steps: Step 1. Receive data from root node
Step 2. Exchange rows with neighbors
Step 3. Perform Computation
Step 4. If iteration number is less then N STEPS go to Step 2 else go to Step 5 Step 5. Send data to root node MPI supports both blocking and non-blocking calls. Nonblocking calls can be extremely useful when there is computation that carries no data dependence with the data to be transferred. If the system architecture allows, the computation can then be done simultaneously with the data transfer. Parallel systems like the TMD can take full advantage of such calls. Each node can have a TMD-MPE core that will be in charge of the communications, freeing the CE to perform whatever computation there may be.
We implemented two different versions for the calculation of the heat equation that differ only on the type of communication used in Step 2. When using blocking calls, the nodes will stall computation while rows are exchanged between neighbours, regardless of having independent rows that can be calculated while waiting for the row exchange. The non-blocking calls on the other hand, will allow computation of all independent rows on each node after initiating the row transfers, improving the computation to communication ratio. Both versions of the application were run with a matrix of 32 by 512 and N STEPS = 64. The profiler log for the version using the blocking calls can be seen in Fig. 8 , and using the non-blocking calls in Fig. 9 . For this example we also instrumented the source code to visualize the processor computations. Therefore, each rank is now represented by three timelines, receives (RECV), computation (COMP) and sends (SEND).
The profiler log visualization of the blocking version shows a clear break of computation while the rows are exchanged between processors. Although the program is being run in parallel across multiple nodes, each node is running sequentially, such that receives, sends and computation never overlap. The non-blocking version, on the other hand, takes full advantage of the communication infrastructure, and while the TMD-MPE performs the communications, the PowerPC is computing all independent data. This can be seen on the profiler log visualization of the non blocking version, where the majority of the ranks are receiving, sending and computing simultaneously. Table 1 shows the resource overhead of using the profiler for the example configuration where the percentages in brackets are fractions of a Virtex II Pro 70 FPGA. Part of this overhead is due to the Tracer's buffers on the User FPGA and the collector's buffer on the Control FPGA. The size of these buffers can be reduced to save resources, however, profiling events may be lost, compromising part of the profiling data. Another option would be to use BRAMs for buffering, reducing the size of the cores. Unfortunately, BRAMs can be a scarce resource depending on the design, so we chose not to use them on the profiler. More specifically, each communication Tracer occupies 526 LUTs and 1000 flip flops. A Tracer for a processor will occupy 1196 LUTs and 1521 flip flops when profiling both communications and computation (no TMD-MPE), and 855 LUTs and 1200 flip flops when profiling just computation. The overhead of a computation Tracer for a hardware engine will depend on the hardware engine itself because each event being capture may require some custom logic to make the event available to the Tracers.
Profiler Overhead
The overhead introduced by the software libraries to profile the source code of the Microblaze or the PowerPC consists of communications with an FSL and GPIO, which on average will take one hundred and fifty cycles per event.
As long as the programmer takes some care on choosing the length of the user states, this will not affect performance considerably. If the communications are being profiled through software, meaning the processors do not have the aid of a TMD-MPE unit, the system will suffer a performance decline, making the data retrieved from the profiler less helpful to improve performance. Profiling the TMD-MPE or any hardware engine will not affect system performance as long as enough hardware resources are available so the system frequency can be maintained.
Future Work
With this working profiler the next step will be to focus on more computing demanding applications that contain more complex communication patterns (e.g. molecular dynamics). With the increase in complexity of the applications, a tool that can perform a pre-analysis of the data and identify possible bottleneck areas can become extremely useful.
Improvements on the hardware blocks to reduce their footprint and increase functionality, such as track critical messages, are also scheduled.
Conclusion
Profiling of multiprocessor applications can be done using existing software packages. This paper addresses the need to do profiling when there is a heterogeneous mix of processing elements, including some processing elements that are implemented purely in hardware, such as would be found in most reconfigurable processing applications.
In this paper, we have presented an MPI-based profiler capable of being used on a system including FPGA-based hardware processing elements. The profiler is able to perform all the functionality of an already existing software tool, the MPI Parallel Environment, that is used for traditional software-programmed clusters. The profiler logs are also compatible so the JumpShot viewer can be used. When hardware resources are available, the hardware profiler can be less intrusive than the software version because the tracing can be done in parallel with the running application. The tool has been shown to work on several applications, demonstrating that the profiling of MPI-based applications including pure hardware processing elements can be done in a way that is compatible with existing practices for profiling MPI.
