Abstract-A new multicomputer performance monitoring system is described in this paper. Where possible, the system employs portable performance monitoring instrumentation technology and leverages previous work. Trace event acquisition is hardware assisted and based on the MultiKron, a single-chip measurement solution developed at the National Institute for Standards and Technology. The user interface is based on the Pablo Performance Analysis Environment, a visualization and sonification toolkit developed at the University of Illinois. The SPIscope is introduced as a component that bridges the gap between these emerging standard interfaces. The SPIscope provides a high-bandwidth path to a large secondary storage for recording performance data. Connectivity to the user's performance analysis workstation is via a TCP/IP LAN.
I. INTRODUCTION
Anyone who has worked with distributed memory parallel systems can attest to their sometimes mystifying and often disappointing performance. This is not surprising considering system complexity. Computer designers and application analysts must somehow optimally orchestrate the actions of architectural components possessing limited bandwidth and performance-crushing latencies. Intertwined with architectural considerations are a wide range of computer science issues such as algorithmic techniques, programming paradigms, advance compilers, and operating _____________________________________ This research is supported in part by National Science Foundation CISE Instrumentation Program grant CDA-9222917. systems. Little wonder that in the relatively uncharted area of parallel application development, tools generating key information, suitably displayed, are needed to help identify sources of errors and performance bottlenecks and to point to corrective action.
Unfortunately, the area of parallel performance monitoring has been plagued by the absence of standards and the lack of a clear consensus of users needs. With no standards, vendor investment is higher with each company pursuing its own unique approach. The significant cost benefits of using commodity performance monitoring hardware and software are unavailable. Meanwhile, users are reluctant to invest in new tools not portable across platforms, and both user experience and demand remain limited.
A survey of performance monitoring systems reveals a diverse set of approaches, complicating the task of identifying potential standards [1] . Program state may be sampled statistically or state changes signaled by event traces. Trace data must be moved, processed and stored. When and where this is done is system dependent. System analysis may be on-line where a portion of the event information is available during runtime or off-line on a postmortem basis. Trace data may use various representation schemes and visualization tools. Fortunately, recent research has increasingly recognized the need to promote portable solutions.
Performance monitoring systems are further distinguished by their trace generation mechanisms. Traces may be formed by three monitoring schemes: (1) software -developing probes in software and relying solely on existing system resources, (2) hardware -probing physical signals with dedicated instrumentation and using external hardware for processing and storage, or (3) hybrid -a technique that combines software and hardware approaches. While software monitoring allows greater portability and requires no special purpose hardware, it is inherently more intrusive and likely to require substantial cpu cycles and memory, thus introducing a probe effect. This can lead to perturbation in program execution, even masking of error conditions. Hardware monitoring, on the other hand, is minimally intrusive, but the monitoring of physical signals makes it very difficult to relate captured data to source-level execution. Furthermore, without on-chip access, many of the system components such as caches and memory management units are simply not accessible to hardware monitors. In contrast, hybrid monitoring has the same high-level correlation of event data as with software monitoring, but takes a pro-active approach of providing special purpose hardware to receive event data through low-overhead writes [2] [3][4] [5] . Typically, hybrid monitors apply a high resolution global timestamp to event data and pass it along for external processing and storage. Capturing event traces involves a constant, relatively insignificant overhead.
Although the advantages of hybrid monitoring have been known for some time, hybrid schemes remain generally unavailable. One reason is a lack of architectural support for the moni-toring hardware. Standard interfaces are needed to generate industry participation and allow instrumentation portability. This paper describes a performance monitoring system implemented for a developmental multicomputer (SuperMSPARC) at the NSF Engineering Research Center (ERC) for Computational Field Simulation. Using the hybrid approach, the design is based on a systems perspective that promotes standards by leveraging previous work and contributing a missing component, the SPIscope. The overall goal of this research is to promote performance monitoring as a tool for optimizing parallel architectures and algorithms for scientific computations. The research seeks to increase the momentum for improved performance monitoring tools by employing standards promoting technology at the event acquisition, transport, storage and visualization levels in demonstration prototypes.
Previous work is leveraged at both the host (compute nodes) and user interfaces. VLSI technology from the National Institute of Standards and Technology (NIST) forms the core of the host interface [6] , and the user interface is based on the Pablo Performance Analysis Environment [7] , that was developed at the University of Illinois. The SPIscope bridges the gap between the NIST network interface specification and Pablo's self-defining trace format.
For several years, engineers at NIST have been working to integrate the hardware required for effective hybrid performance monitoring at a reasonable cost and size. Their recent effort, the MultiKron [8] , is a single-chip measurement solution with a separate data collection network. The MultiKron design is in the public domain, and is suitable for incorporation on commercial processor boards. NIST promotes the development of standards based on the MultiKron by making both the chip and system design description available, through technology transfers, and by making indefinite loans. Further enhancements of this design continue with the soon to be released MultiKron II. Several other universities have evaluated or have designed MultiKron-based hardware.
The MultiKron design has also influenced Intel's performance monitoring efforts for the Paragon.
Pablo is a visualization and sonification toolkit designed to be a de facto standard based on a philosophy of portability, scalability, and extensibility. Pablo allows the user to combine simple displays with minimal effort to get application specific information. Application independence of the data analysis component of Pablo is achieved through a self-defining trace data format.
The SPIscope, our contribution, manages data acquisition from the MultiKron collection net- host interface and SPIscope designs are described in Sections III and IV. The visualization layer including probe format specification and conversion issues are discussed in Section V, and the paper is concluded with a case study in Section VI and a summary of key ideas, project status, and future plans in Section VII.
II. SYSTEM OVERVIEW
The SuperMSPARC multicomputer along with SPIscope attachments and its performance analysis visualization layer provide a platform for multidisciplinary research. The system shown in Fig. 1 is aimed at improving the performance of large-scale computational field simulation (CFS) problems, that are among the computational grand challenges put forth by our nation's High Performance Computing and Communications Program. The research areas include parallel algorithm development, portable programming environments, low-latency interconnection networks, active memory hierarchies, and advanced performance monitoring.
The monitoring environment on the SuperMSPARC is designed to provide feedback to the application designer and tie key performance events to particular places in the execution of the user's code. Using instrumented libraries, the SuperMSPARC probes are embedded in applications running in the Object-Oriented Fortran (OOF) environment developed at the ERC [9] . This environment provides extensions to Fortran for declaration of object classes and dynamic creation of object instances. The current instrumentation system monitors key events in this environment such as the creation and deletion of objects and exchange of messages between objects. In order to make the instrumentation system as flexible as possible, we have provided a probe dictionary to allow system builders to easily modify the probe information accepted by the instrumentation system.
SuperMSPARC SPIscope
User Workstation 
III. HOST INTERFACE DESIGN
A function of the PAB is to provide an SBus interface to the MultiKron for each processor cluster. Only this interface need be changed for compatibility to other node architectures. Node event data are written to special PAB-monitored locations, where they are collected, formatted, time- PABs and its 100 nsec resolution is sufficient to allow a total ordering of events from all nodes. The only intrusiveness comes from the formatting and writing of the event data.
Since lower intrusiveness is the principal advantage of hybrid (versus software) monitoring, minimizing the overhead associated with writing probe data is important. In the traditional UNIX approach, the performance monitoring hardware would be accessed through system calls to a kernel-resident device driver. For the small trace sample sizes supported by the MultiKron (up to ten bytes of user data), the system call overhead would be an order of magnitude more costly than the actual writing of the probe data. The system call overhead can be eliminated by mapping the performance monitor device registers into the application's address space. This is the approach adopted in SuperMSPARC.
To quantify the effects of this decision, measurement of this intrusiveness has been done on two levels. First, the time to actually generate the individual probes was measured. The probe that contains the most user data, and therefore takes the longest to build and write, is the message-send probe. Measurements (taken with the hardware itself) indicate this probe takes an average of 2.02 msec. In contrast, generating the same probe with software takes an average of 100.6 msec. The time for software probes could have been optimized slightly if the generation of timestamps did not directly emulate those produced on the MultiKron. This emulation was done to allow the visualization tools to directly use either type of probes transparently, and does not account for a significant percentage of the time taken to generate software probes.
The second measure of intrusiveness is at the application level. An application programmer is primarily interested in the effect monitoring has on the overall execution time of the application being monitored. Several large computational fluid dynamics applications were run with and without probes being collected. The difference in the overall execution time of the applications was statistically insignificant. In all cases, the difference in the times was no greater than the run-to-run variation that the application already exhibited. In addition, for several applications, the average time with probes was slightly less than that for the application without probes. These anomalies are still under investigation.
The mapping of the performance monitor device registers into the application's space introduces a new design challenge: managing multiprocessor contention for the shared resource without system calls. One approach is to add support for a hardware-assisted semaphore. With this approach, reads and writes from a memory-mapped semaphore address would be used to implement mutually-exclusive access to the performance monitoring hardware. Serialized access ensures that probe writes, that may require up to three bus transactions, are indivisible operations.
However, this technique suffers from inefficiencies resulting from context switches that occur while access is locked, and the complexities associated with deadlock prevention in the presence of preemption due to software signals.
We have adopted a hardware approach that addresses the drawbacks of user-level sema-
phores. An SRAM is used as a staging area for probe data. The PAB contains an address FIFO that is queried to obtain an identifier for a free SRAM entry. The first two probe words (32-bit quantities) are written to the specified SRAM locations. The final probe word is written directly to the probe FIFO. When the final probe word has been written, control logic on the PAB transfers the associated probe words in the SRAM to the probe FIFO. The entire probe is atomically transferred to the MultiKron and the SRAM identifier returned to the address FIFO. In this manner, writes of independent probe data may be coherently interleaved.
Direct access to the MultiKron registers is also provided to allow manipulation of the filter settings and resource counters. The MultiKron provides 16 filter levels and 16 resource counters. A 4-bit filter value is included in each probe. If filtering is enabled for the specified level, the probe is discarded by the MultiKron; this allows selective measurements from instrumented programs without recompilation. The 32-bit resource counter can be individually configured to count clock cycles, external signal transitions, or software events. The counters represent a compromise between resolution and storage requirements that is particularly useful for frequent events.
The MultiKron trace sample format includes an 8-bit header, a 32-bit source identification field and a 48-bit user data field. Header sub-fields signal node id, sample type and error conditions. The source identification field comes from one of eight source identification registers. It was intended by the MultiKron's designers that the source register be selected by hardware signals identifying the processor issuing the probe. Ideally, the register would be updated at each context switch so that only six bytes of user data would be written per probe. Since processor identification signals are not available on the SBus, the source identification field is treated as user data (i.e., the PAB control logic transfers the four bytes of user data to a fixed source identification register, that is always selected). The three bus transfers required to write ten bytes of user data leave two bytes available for specifying the filter value, distinguishing between trace and resource samples, and identifying long probes. Long probes are related sequences of trace samples that accommodate more than ten bytes of user data.
The MultiKron internally buffers samples in an 8-entry FIFO prior to output on the data collection network. Earlier performance instrumentation chips developed by NIST utilized a bytewide token-ring collection network; however, the MultiKron employs a simple output design in- 
IV. SPISCOPE DESIGN
The primary function of the ERC developed SPIscope is to provide a high-bandwidth link to the large secondary storage needed to store event traces received from the SPInet. While concerns about the volume of performance data that will be generated by massively parallel systems are val- id, capabilities for storing large traces without data loss are also important. Unabridged traces will remain viable because: (1) most debugging and performance optimization is performed on smaller-scale systems [10] , (2) phase analysis techniques can be employed to partition the optimization process [11] , and (3) some useful performance analysis tools are based on complete traces. Nearcritical path analysis is an example of a tool that relies on unabridged traces [12] . Near-critical path analysis includes the synergistic effects of activities on the k longest execution paths to quantify the overall benefit associated with improving specific critical path activities.
Scalability is addressed using a distributed design that replicates the SPIscope for clusters of monitored nodes. A common timestamp clock and reset are required for all SPIscope units. SPIscope disk subsystems operate independently since the chronological order of the event data is established by the timestamp. Additional support for managing the massive amounts of data generated by large systems is provided by a probe filtering mechanism that eliminates event data at the source before it enters SPInet. Dynamic control of these filter settings increases the ability to finely focus the monitoring process.
The SPIscope provides improvements over earlier performance monitoring recorder technology developed in conjunction with the SUPRENUM multiprocessor [3] . The ZM4 hardware monitoring system uses PC/AT computers to record trace data and relies on the PC/AT bus to provide a standard input interface. The recording bandwidth of each unit is limited to 10,000 events per second by the data transfer rate of the PC/AT bus.
The SPIscope was designed in modular fashion to maximize off-the-shelf content and permit a phased implementation. SPIscope components include a VME chassis, five commercially available board-level products, an array of commercially available disk drives, and one custom interface, the SPIscope Control (SPIcontrol) board. Off-the-shelf boards include a Motorola MVME162 single-board computer (SBC), two dual-ported memory boards, and two SCSI-2 controllers.
The VME backplane provides multiple independent data paths for increased performance in multi-master systems. Both the VMEbus and the VSBbus are used by the SPIscope. The VSBbus is a local subsystem bus tailored for efficient memory block transfers. The SPIcontrol is a VSB master and the memories are VSB slaves. Performance data received from the SPInet is transferred from the SPIcontrol to the memories over the VSBbus. The SCSI controllers are VME64 masters and the memories are VME64 slaves. Performance data is transferred from the memories to the SCSI controllers using the VME64 block transfer protocol.
Software executing under the VxWorks operating system on the SBC controls the SPIscope operations. VxWorks provides high-performance real-time kernel facilities for multitasking, intertask synchronization, communications and interrupt handling as well as support for UNIX source-compatible sockets. The SBC hardware contains a MC68040 microprocessor, an integrated Ethernet interface, and four IndustryPack mezzanine bus ports. The IndustryPack ports provide an upgrade path for higher performance workstation interconnections and an independent bus for accessing performance data on-line without degrading the VME bandwidth available to the recording function.
The SPIcontrol manages the SPInet token ring; for example, logic is included to detect and regenerate lost tokens, validate received probes, and transmit MultiKron filter settings under SBC control. The SPIcontrol also contains logic to support a VSB DMA engine for transferring valid probes to memory. Enhanced versions of the SPIcontrol board could include logic for breakpoint detection, on-line compound event filtering, and on-line analysis.
A block diagram illustrating the SPIscope data paths is provided in Fig. 3 . The maximum transfer rates of the data paths are summarized in Table I . Notice that the maximum rate to the disks varies, with transfers to the outer tracks offering higher performance. If only one drive is available on each SCSI bus, the disk transfers are clearly a bottleneck in matching the input rate from the data collection network; especially when overheads such as track-to-track seek times and rotational latencies are considered. To overcome this problem, probe data is striped across the array of eight drives in a controlled fashion allowing all drives to simultaneously transfer data to and from memory with a minimum of head movement. The entire disk array is viewed as one logical device with consecutive logical disk blocks mapped in sequence to drives 0 through 7. Sustained disk array bandwidth is in excess of 20 MBytes/sec. Scalability to even higher bandwidths could be realized with multiple SPIscope recording units and a hierarchical collection network. Event clustering in which the peak event rate exceeds the above figures are accommodated by the SPInet token ring since the PABs include a 1 KByte event data FIFO. Also, a single node cannot retain the token indefinitely because the SPInet bandwidth exceeds the probe peak generation rate (less than 6.5MBytes/sec currently for the SuperMSPARC). However, should the network for any reason be unable to accept probes, a mechanism to throttle probe generation is included in the PAB. When the address FIFO is read preceding each event write, a flag indicates the fill state of the event data FIFO. As a user option, the PAB driver may elect to slow the application or allow it to continue unimpeded. In the later case, any data overruns that may occur are tagged in the trace sample header.
V. VISUALIZATION LAYER
Visualization is vitally important in demonstrating the benefits of performance monitoring.
The network interface between the SPIscope and the visualization layer is based on the clientserver paradigm. A client process executing on the user's performance analysis workstation requests performance data from a server daemon resident on the SPIscope. The client process is also responsible for sorting the probes from different clusters to establish a total chronological ordering, and converting the MultiKron samples to the format required by the visualization layer. Our prototype system converts the performance data into Pablo's Self-Defining Data Format (SDDF). A probe dictionary drives the conversion. Similar to the Pablo SDDF, the probe dictionary allows the definition of user specified probe formats. The probe dictionary does not inherently restrict the output format to SDDF. This flexibility allows consideration of other performance visualization systems in this layer.
The Pablo environment can be used to explore the performance data from a variety of perspectives. The Pablo toolkit supports construction of custom performance analysis environments, and encourages experimental exploration of the data. For example, we have developed the object tree display shown in Fig. 4 . This display allows the user to visualize the hierarchy of object creations in an object-oriented parallel application. An interactive button associated with each object can be pressed to view profile information about operator execution time.
In addition to the Pablo displays, several other displays were implemented based on the SDDF data format. The availability of the SDDF library made the implementation of these displays 
VI. CASE STUDY: A CFD APPLICATION
UNCLEkad, a three dimensional parallel euler solver, was chosen for this case study [13] .
The code solves three dimensional euler equations numerically, using an implicit finite-volume scheme with local time stepping on stationary grids for steady state conditions. A divide-and-conquer method is used to parallelize this program. The grid is partitioned physically into 32 pieces and boundary information is communicated at each iteration by exchanging messages with neighboring nodes.
An original implementation of this application was traced using the SuperMSPARC. The first step in optimizing the application was to add user probes to indicate the particular iteration that each process was currently working on (numbers in parentheses in Fig. 5 ). Using this information on the execution profile display, it is clearly indicated that process number 1 (first line on Node 2) is the bottleneck in the entire execution of the application. It clearly shows that this process has no idle time and its neighboring process is constantly waiting on boundary information from it to begin its next iteration. This delay can be seen to ripple through all the other processes causing a large amount of idle time in the entire application. Experimentation with placement of this process led us to conclude that this effect was not due to operating system interference, nor interference from the main coordinating process in the application. Further investigation of the problem led us to suspect numerical instabilities in the application in the area of the grid assigned to the first process. It was discovered that this area of the grid contained flow directions which led to arithmetic underflow in the calculation of fluxes. In order to better control this, the grid for the problem was changed slightly to change the direction of flow relative to the grid cells in this area to reduce the areas where fluxes close to zero occurred. After this change, we observed a much more balanced distribution of the idle times. At this point we also discovered by looking at the execution traces with the user probes that the application occasionally became idle when it seemed that its boundary information should already be present. This was traced back to a synchronization problem at the very beginning of the simulation that caused one process to begin its computations before it had the appropriate neighboring information. This problem would have likely been undetected without the capability to display the user probes along with the execution trace.
Traces were again collected for the application after the grid change and synchronization error were corrected. We again observed several places where iterations seemed to take much longer than average (as indicated by longer busy bars at certain points). Once this happened, again the application exhibited large amounts of idle time at neighboring nodes as this delay propagated across neighboring processes. It was again determined that the arithmetic underflow conditions were the cause. We used a function of the Sun Fortran compiler to flush these underflows to zero and immediately saw a much more balanced computation with little idle time. Unfortunately, the answers in this case were not within an acceptable range of accuracy. As a result, we moved from single to double precision and brought the answers back into agreement with the original simulation. The net gain in performance through optimization so far is to decrease the execution time of this application from 16 minutes to 4 minutes.
Traces from the current version of the application now indicate an increased percentage of time due to exchanging of messages (Fig 6.) . The next step in the optimization process will be to investigate, using the near-critical path analysis tools, the effects of decreasing the communications overhead in the application. We are currently installing the gigabit Myrinet network in the SuperMSPARC and expect to see an improvement in this application as a result.
Three key features of SuperMSPARC made this application tuning process feasible and successful. First, the analysis of this application depends on the availability of a globally synchronized clock to be able to accurately determine the relationship between events on different processors.
Second, the ability to define user probes and easily incorporate them into the application proved invaluable. The display of the iteration number led directly to the discovery of the synchronization error. Without this user information, the behavior would have been interpreted as processes waiting for the correct neighboring values. Finally, the low intrusiveness of the hybrid probes made the insertion of these user probes feasible without either significantly affecting the total runtime of the application, nor perturbing the relationship between events on different processes in the application.
VII. CONCLUSION
This work provides a prototype of a hybrid performance monitoring system that leverages standards promoting technology for the acquisition, transport, storage and visualization of trace sec data processing and storage system, SPIscope, was designed and implemented using primarily off-the-shelf components and standard buses. A flexible data dictionary approach for event data processing was developed making it possible to systematically map coded event data to portable trace formats. And finally, a specific translation to SDDF trace files used by the Pablo graphical interface was implemented.
Project goals include increased user awareness, promotion of performance monitoring instrumentation standards, and contribution of components with utility throughout the parallel processing community. The high bandwidth and large storage capacity of the SPIscope possess immediate significance in terms of refining current limits on practical performance data resolution.
The system also provides a foundation for further research in the areas of on-line observation, dynamic rate control, and the natural integration of debugging facilities with performance monitoring tools.
The phase 1 design of the PAB, SPIcontrol, and SPInet hardware is complete, and the system became operational in the Fall of 1994. Base hardware costs for the SPIscope (excluding disks) are less than $20,000, and we are open to a variety of distribution methods. The second phase of the design will consist of SPIcontrol modifications to incorporate support for breakpoint detection and on-line transmission of selected performance monitoring data.
In keeping with the theme of standardized components for this system, we are beginning the design of probes that will be meaningful in the newly emerging standard message passing interface (MPI) [14] . We believe that MPI is likely to become accepted as the de facto message passing standard and support for this interface will provide an important level of standards-based support needed in this system. The groundwork laid in the development of a flexible probe dictionary for the OOF system can easily be used to advantage in implementing meaningful MPI probes. The message passing displays already implemented for the OOF programming environment will be applicable to the MPI environment as well. One of the major contribution of MPI is the use of communicator groups to insure that messages from parallel libraries do not interfere with user-level messages. We plan to develop new displays that allow users to visualize these different communication spaces and communication access groups more easily.
Other ideas under consideration include replacing the SPInet with a more standard network and performance monitoring technology for distributed systems. We are particularly interested in the utility of lower-cost hybrid monitoring systems that use existing LANs and file servers for performance data collection.
Readers may use a graphical browser to obtain additional information using the Web at http://www.erc.msstate.edu/ca/html/supermsparc.html. Email may be directed to harden@erc.msstate.edu.
