We are exploring the development and application of information visualization techniques for the analysis of new massively parallel supercomputer architectures. Modern supercomputers typically comprise very large clusters of commodity SMPs interconnected by possibly dense and often non-standard networks. The scale, complexity, and inherent non-locality of the structure and dynamics of this hardware, and the operating systems and applications distributed over them, challenge traditional analysis methods. As part of the á la carte (A Los Alamos Computer Architecture Toolkit for Extreme-Scale Architecture Simulation) team at Los Alamos National Laboratory, who are simulating these new architectures, we are exploring advanced visualization techniques and creating tools to enhance analysis of these simulations with intuitive three-dimensional representations and interfaces. This work complements existing and emerging algorithmic analysis tools. In this paper, we give background on the problem domain, a description of a prototypical computer architecture of interest (on the order of 10,000 processors connected by a quaternary fat-tree communications network), and a presentation of three classes of visualizations that clearly display the switching fabric and the flow of information in the interconnecting network.
Introduction
The magnitude of the scientific computations targeted by the U.S. Department of Energy Accelerated Strategic Computing Initiative (ASCI) project requires unprecedented computational power. To facilitate these computations ASCI plans to deploy massive computing platforms, possibly consisting of tens of thousands of processors, intended to achieve one petaOP by 2008.
Better hardware design and lower development and deployment costs requires performance evaluation, analysis, and modeling of parallel applications and architectures, and in particular, the ability to predict computational capability and capacity. Performance studies are routinely used to select the best architecture or platform for a given application, to select the best algorithm for solving a particular problem, and to study scalability with respect to problem and platform size. Evaluating and analyzing the performance is challenging primarily because of the large number of components making up such systems, and the complex interactions that occur between them.
The tools of the trade in performance modeling and analysis are typically categorized as algorithmic/analytical analysis, statistical analysis, analysis with queuing theory, and simulation. Depending on the problem, one or more or these methods will be more appropriate than others. Although significant results have been obtained in recent work for an important class of applications of interest to ASCI, 1 analytical modeling of systems and applications of this scale is not always possible. Queuing models generally lead to very complex nonlinear equations whose solutions are intractable. For systems of ASCI-proposed size and complexity simulation remains the predictive tool of choice, although simulation may be considerably augmented by analytical and statistical analysis.
Three related targets for our simulation effort have been identified: simulation of ASCI-scale parallel systems using a realistic ASCI workload, simulation of ASCI-scale storage systems and I/O, and simulation of the highperformance ASCI wide-area network. All these aspects of ASCI system design are tractable by the approach we propose. However, given the scale of the effort required, a staged approach has been taken, initially concentrating primarily on the parallel systems and the particular types of applications that run on them.
In conjunction with the other methodologies, the simulation environment could be used for exploration of hardware/architecture design space, exploration of algorithm/implementation space both at the application level (e.g., data distribution and communication) and the system level (e.g., scheduling, routing, and load balancing), determining how application performance will scale with the number of processors or other components, and analysis of the tradeoffs between performance and cost.
Because of the sheer volume of relevant data generated by a simulation run, visualization is an important, potentially primary method of practical data abstraction and comprehension. [2] [3] [4] [5] The á la carte project at Los Alamos National Laboratory (LANL) 6 seeks to develop a simulation-based analysis tool for evaluating massively-parallel computing platforms including current and future ASCI-scale systems. 7, 8 The basic approach relies on an iterative development process for creating models of appropriate fidelities and integrating them into a portable and efficient parallel discreteevent simulation that is scalable to thousands of (simulated) computational nodes ( Figure 1 ). Components may be processors, switches, network interfaces, or application workloads, for example. Studies of hardware architectures are made by running the simulation on a particular aggregate system composed of these components. The output of the simulation captures the behavior and performance of the components.
Target architecture and simulation environment
The computing platform simulated in this work is a cluster architecture closely modeling LANL's existing Q machine. 9 The basic unit of the cluster is four computational nodes connected with a network interface card (NIC) to a network switch. As the number of basic units increases, the interconnection network expands accordingly. Figure 2 illustrates the layout of such a network with 64 computational nodes and three layers of 16 switches each. Each network switch has eight duplex I/O ports; the ports may be linked to computational nodes or to other switches. The network is organized into layers of switches that connect only to the layers above and below: for each eight-port switch there are four upward connections and four downward connections, configured into a quaternary fat-tree network.
Figure 1
Goals and applications of the á la carte project: The boxes represent the project goals and the 'clouds' represent applications for the simulation and analysis tool.
A low-fidelity network implementation represents the network as a circuit-switched fat-tree network. The switches, with four up ports and four down ports, are modeled at the packet level with a simple protocol that allocates a circuit through the network for each message.
Each message consists of one or more packets. Using a single circuit for an entire message may be adequate for many types of simulation studies in which the specific details of network traffic congestion are not important and rapid simulation is advantageous.
A medium-fidelity network model builds on the lowfidelity model by enhancing its accuracy and level of detail. This model is much closer to representing actual hardware and mimicking the behavior of network protocols in use on real systems. The temporal resolution of this model is sufficient for even the most detailed network studies. The primary requirement is the ability to model accurately the movement of packets in Quadrics 10 local area networks consisting of Elan network interface cards connected to Elite crossbar switches in a fat-tree network at nearly flit (16-bit unit) resolution. There is also a requirement to track accurately the movement of messages across the PCI bus between main memory and the NIC, account for their packetization, and clock the transfer of data across the network. Because contention may exist in the network, different parts of the packet may move at different speeds through the switches (i.e., buffering and delays may occur anywhere in the network). The medium-fidelity design relies on the tracking of the head and tail of the packet throughout its history, along with various flit-level tokens specified in the Elan protocol. In addition to tracking the movement of the head and the tail of packets through the NICs and switches, the existence of two virtual channels sharing bandwidth at switches is accounted for by simple division of bandwidth (ignoring the possibility of more complex schemes for resolving contention, such as age-based prioritization).
DaSSF, 8 developed at Dartmouth College, is the current choice of simulation environment. DaSSF is a parallel discrete event simulator that uses conservative simulation protocols to synchronize execution on multiple processors. DaSSF was designed to be a high-performance and highly scalable simulator; it is the only Scalable Simulation Framework API 11 implementation that can fully utilize clusters of shared-memory processors. It achieves high performance in part by using a custom threading mechanism that uses memory and performs context switching very efficiently. The initial á la carte implementation comprises low-and medium-fidelity models of a network and low-fidelity and direct-execution models of workload. This implementation supports studies of simulation performance and scaling, and also the properties of the simulated systems themselves. Since this initial system does not include a visualization component to aid in the analysis of the complex time-varying interactions of the logical components, we are studying visualization methods independent of these tools. Because the network connecting the processors in the simulated machine is large and complex, visualization efforts have focused on representations of spatiotemporal graphs representing this network. Note that this research is not focusing on parallel program visualization. The following section describes the approach to visualize the results of these machine simulations, the Flatland visualization tool used for the three-dimensional (3D) interactive environment, and the details of the models used to represent the machine switch fabric.
Visualization approach
The visualization research presented in this paper focuses on viewing the structure of the interconnection network, the flow of message traffic within the network, and displaying the performance of the simulated system. [12] [13] [14] Visualization will also aid in debugging the simulation itself, developing and evaluating the efficiency of load balancing on the simulated machine, and understanding synchronization between simulation timelines. [15] [16] [17] [18] Visualizing the simulated system allows end users to understand how varying workload or network intercon- nection architecture affects the overall performance of a hypothetical or novel architecture. Therefore, as a requirement, the visualization system must display the switch topology, the communication patterns in the network, levels of network usage, and the presence of bottlenecks. The approach to the creation of useful visualizations involves regular refinement of requirements and consultation with the potential system users, with an emphasis on innovative abstractions of the architecture and dynamics of the system.
Traditional tools for parallel program visualization may be divided into three categories: processor workload, communication, and synchronization. 19 Because of the concentration on the interconnection network our visualizations fall into the communication category. Typical abstractions such as inter-task communication time, quantity of data transferred, the communication queue system, and others 20 have been implemented in systems like ParaGraph, 21 VAMPIR, 22 and XPVM. 23, 24 These are all general tools for parallel program and network visualization, but, as observed by Stasko, 25 some situations (such as debugging) require application-specific views. We have, therefore, concentrated on building an architecture-specific visualization of our target machine.
All visualization development in this project utilizes Flatland, a tool developed at the University of New Mexico. Flatland is an immersive visualization development framework developed as part of the Homunculus Project. [26] [27] [28] It is used to facilitate rapid prototyping and research in scientific and information visualization, immersive environments and interfaces, and human factors engineering. The Flatland infrastructure supports the creation and management of an internal graph-based data structure containing OpenGL geometry and transformations, lighting, shadows, stereoscopic rendering, and spatialized sound. In addition, Flatland provides dynamically loaded application modules without mutual interference; management of novel input and output devices; navigability of the generated virtual spaces; and basic spatial reference objects such as a landscape, stars, or a sun. Complex virtual spaces are dynamically built at runtime from a user-specified configuration file. This tool is currently being used in a wide variety of application domains, including network intrusion detection, computational intelligence and robotics, facilities management, program visualization, distance health education, and integrated circuit analysis (see webpage 26 for examples).
One major tenet of this approach to visualization is that immersive 3D environments can offer unique advantages over two-dimensional graphics, or nonimmersive 3D graphics. By placing the user of these tools within the same context as the objects being viewed and allowing the user to navigate around and between them, choosing their own point of view and using motion parallax to comprehend the 3D relationships between objects, and allowing the objects to cast shadows and to have behavior and emit sounds, a qualitative improvement in comprehension of the data can be achieved. We acknowledge that this is not a widely accepted point of view but the successful use of 3D, dynamics 29 and metaphorical worlds 30 for encoding abstract information are being reported in the literature. Initial empirical studies of our own have provided support for this hypothesis. 31, 32 In addition, it has been observed that a single representation often does not work for all purposes, even when the capabilities of zooming, and selection/deselection of the types of information displayed are available.
Certainly not all types of information of potential interest could be displayed at once in any comprehensible visualization, particularly in forms appropriate or preferred for a particular purpose. This is simply because the number of network (graph) behaviors and activities of interest can be very large. Examples of characteristics of network visualization include:
Size of network: as visual density increases the mapping to the logical network topology (hierarchical layers of switches) and corresponding data flow to representational components inevitably becomes hard to comprehend. Types of data reduction: for example, instantaneous activity, rolling averages, or total number of events at each location. Centricity: whether the desired view is switch-centric (emphasizing activity, e.g., congestion, at a switch), connection-centric (e.g., for identifying contention for connections), or message-centric (e.g., for identifying problems with the distribution of an application over compute nodes, or poor matches between network topology and communication patterns).
The next section describes the details of three classes of cluster interconnection network visualization representations developed by the á la carte project team.
Visualization results
This section presents three general classes of visual representations of the quaternary fat-tree switch network and simulation results: (1) adjacency matrix-based methods, (2) planar compact methods, and (3) messagefocused methods. For matrix-based methods, the nodes and NICs are aligned along the diagonal of a form of the interconnection adjacency matrix, and the switches are coded in a structured set of stacked layers filling out the interconnection pattern. The planar compact representations efficiently pack the switches onto a two-dimensional (2D) surface without explicit connection information. The message-focused methods explicitly represent only the network components along the path of a single message. Each class of representation displays the switches and simulation results (flow of messages) in complementary ways.
Matrix-based representation
Graphs may be represented explicitly using, for example, nodes as boxes and connections as lines ( Figure 2 ). Unfortunately, as the number of nodes and connections grows the display becomes congested and elements of the display may become obscured, so the utility of this view diminishes with the number of elements represented. To address this we have developed a more abstract representational model of the switch network structure, motivated by the 2D structure of the adjacency matrix, in which switch layers are grouped together into aggregate visual objects. Figure 3 shows an example of the adjacency matrix for the 64-node network shown in Figure 2 . All the computational nodes and interconnection switches are represented along each side of the matrix, outputs vertically and inputs horizontally. The upper-left 64 Â 64 square represents the computational nodes, the two 64 Â 16 (and 16 Â 64) rectangles adjacent to that represent the space of the first-layer switches. Likewise the other four 'checkered' spaces represent the second-and third-layer switches. Note that the matrix is sparse and symmetric. The gray areas are regions of the matrix that have non-zero elements, indicating a physical connection. The black areas represent switches or groups of switches. The figure also illustrates a message being sent from node i to the nearby neighbor node j, which is under the same switch in the first layer. Figures 4A and B show views of a modification of this representation for the same 64-node system shown in Figures 2 and 3 . Modifying the block structure shown in Figure 3 using the third dimension, the network is laid out on layers of slabs with 64 equally spaced pillars along the top diagonal representing the computational nodes and their corresponding NICs. This is the result of scaling the corresponding blocks of non-zero connection elements and stacking them according to layer. Although each layer contains a total of 16 switches, their groupings systematically shift as a function of depth. The 16 top layer (red) slabs on which nodes stand code their 16 firstlevel switches, four nodes per switch (compare with Figure 1 ). There are 1/4 as many (green) slabs on the second layer as in the layer above but each is four times thicker to represent the containment of four switches each. Finally, for this example, the bottom or base layer contains one slab (blue) that contains a total of 16 layers (switches), corresponding to the top row of rectangles in Figure 2 . The individual switches within a slab are coded as sub-layers, as can be seen in Figure 4B .
When a processor sends a message to another processor through the switching fabric, a line or 'pipe' is shown leaving the processor and growing across the switching fabric until it comes to the point on the implied connectivity matrix in which the two processors' connection would normally be indicated in a standard adjacency matrix ( Figure 5A ). At that point it makes a 901 turn and continues to grow until it reaches the destination node. In other modes, the slabs or the sublayers of the switch slabs change color intensity as they are involved in more and more communication traffic, allowing the viewer to recognize the relative level of utilization of each switch or switch group.
The layered representation provides a display more compact than the direct or pure adjacency matrix representations, and has been observed in cases to scale to medium-sized configurations of 128 nodes or fewer. As the number of nodes approaches a thousand the layered representation becomes overly cluttered, indicating scaling problems. It does have the advantage of being able to show the activity in very large systems (e.g., 4096 computational nodes) if the pipes showing individual messages are suppressed. However, the connectivity matrix scales in size and complexity as n L , the number of nodes for an L-layer network, which naturally limits its use at large sizes. With a 4096-processor simulation, individual processors, NICs, and even the first level of switches are smaller than a single pixel when the entire system is viewed in its entirety, even when using a highresolution (1600 Â 1200) display. This is only a partial limitation since the two dominant user modes are macroscopic, attempting to understand the aggregate dynamics of the system that will be mostly differentiated at higher level switches, or microscopic, focusing on following individual messages through the system. Nevertheless, in this representation, full microscopic detail and system wide macroscopic context cannot be apprehended simultaneously.
One other limitation of this method is that of implied distance or weight. Two computational nodes in the center of the representation are topologically the same distance from each other as two nodes at opposite ends of the representation. The visual distance between nodes implies a topological distance that may not be accurate, especially when compared to other node pairs, so potentially may be misleading.
Planar compact representations
To address the scaling issues of adjacency matrix methods, a general framework that expresses the layout of computational nodes and network switches in a more efficient 2D or 3D layout of a fat tree was developed. For this analysis, it is sufficient to consider only switches in the layout, as each quadruplet of nodes connects to only a single switch -the lowest-level switch in any layout can be replaced by that switch and its four connected nodes. Likewise, the eight duplex ports on each switch will not be considered because each switch can be expanded into its eight in ports and/or its eight out ports in a layout.
Because some of these layouts have a fractal-like nature, a brief diversion into the generation functions and the computation of the fractional dimensionality for these representations will be given. The current discussion will only consider 2D layouts whose aspect ratio does not vary with L, the number of layers in the switch network. Let w L be the width of the layout in pixels. The box-counting dimension d of the layout is defined from s L Ew L d as L-N. 33 To compute this, the following limit is calculated:
As a general technique to represent the plethora of possible fat-tree layouts, consider a pair of generating functions A n and B n which map the integers 0, 1, 2, 3 to pixel coordinates; here n is a non-negative integer specifying the scale of the mapping. It is very important that the range of the functions A n and B n are disjointotherwise, switches on different layers will overlap on the same pixel. Since the fat tree is quaternary, it is useful to represent switch IDs in base four, representing a number x as x ¼ P LÀ1 n¼1 4 nÀ1 x n where the x n are its base-four digits. The pixel or cell coordinates of switch x in layer ' of a L layer switch network is calculated as follows:
The complete layout is generated as the union of all the switch pixel coordinates:
Note that although we have considered 2D layouts here, the formalism extends to the 3D case.
Compact, self-similar, and fractal representations As mentioned earlier, representations that essentially scale linearly with n L will fail in at least the most obvious way of not providing full detail in full context. While management of the level of detail would increase the limit of usability somewhat, without improving the compactness of the layout it would not allow the viewing of each individual node, NIC, and switch simultaneously at the targeted scales. The direct rectangular layout scales by ffiffiffiffiffi n L p as can be seen in Figure 2 . Thus 4096 processors, for example, only require an area on the order of 64 Â 64 units to display. This seems very promising but by distributing the processors in a 64 Â 64 array, the majority of the processors are in the middle of the array, and similarly, the switching layers laid out in 32 Â 32 arrays tend to occlude each other. From this simple analysis it appears that laying out the NICs and switch layers in two dimensions is compact and scales well, but leads immediately to problems with occlusion. Motivated by the somewhat self-similar nature of the fat tree ( Figure 2) , two different versions of this class of compact representation will be presented, one inspired by a simple pair of fractal generators and another inspired by a generalization of the plane-filling recursive bronchi, 35 Lindenmayer Systems, 36 and the H-array radar antennae configuration, as shown in Figure 8 . The similarities between these two representations led us to consider a more general representation of all layouts in two and three dimensions of fat trees.
Fractal representations of large hierarchies were presented by Koike. 34 The aim of that effort, however, was to fix the number of displayed nodes while traversing the tree, using the fractal self-similarity to present a uniform representation at any tree level. In this project, we are interested in using fractals and fractal-like representations to compress the representation into a small area, allowing either detailed, close-up views, or a comprehensive view of large structures that still reveals all the lowlevel nodes in a comprehensible fashion.
Fractal representation An example of a fractal-based representation can be created by defining: 
The function a(k) places the lower-layer switches on the sides of a square, while the function b(k) places the higher-layer ones on the corners of the same square. The coefficient 3 nÀ1 in Eq. (4) ensures that subsequent squares are appropriately scaled to a larger size. Figure 6 shows the fractal for the first several L -the self-similarity between networks of different sizes is apparent. Figure 7 shows an example first-to-second-level network connections for three values of L.
For this representation w 1 ¼ 1, w L ¼ 3w LÀ1 , and the solution to the recursion relation is w L ¼ 3 LÀ1 , resulting in a box-counting dimension d ¼ log 4/log 3E1. 26 .
This layout has the advantage that it scales well -the 6144 switches of a six-layer fat tree is represented in a 243 Â 243 pixel (cell) area, for example. It has a disadvantage that the switches for different layers are interleaved, making it difficult to visually separate the activity in different network layers. On the other hand, animated versions of these layouts have been used to distinguish successfully the distribution of messages in two 4096-computational-node applications with different communication patterns.
'Fat H' representation The H-tree is a form of a standard quad-tree in which a parent node is located in the center of the horizontal cross segment of the H, while its fourchild nodes are located at the tips of the vertical segments of the H. 35, 36 A multi-layer H-tree is formed by recursively scaling and translating H structures to the locations of the children nodes ( Figure 8A) .
The fat H-tree representation (Figures 8B and 9) is an extension of the H-tree in which the single parent node is replaced by a set of nodes that are members of the same layer in the fat tree. In the graphics, these are arranged in a rectangular grid around the place in which the original H-tree parent would have resided.
We can address the problem of switch layers being interleaved in fractal representation in the fat H-tree representation by defining Thus the more central groups of switches are in higher layers. Figure 10 illustrates this. For this representation w L ¼ 1, w L ¼ 2w LÀ1 þ 2 LÀ1 , and the solution to the recursion relation is w L ¼ L Á 2w LÀ1 . This results in a box-counting dimension d ¼ 2. Hence, this representation 'efficiently' fills 2D space with no obscuration, but is not technically fractal. This can be seen in Figure 10 in that the highest, central layer of switches occupies a smaller region of the diagram relative to the layers below as L increases, and the layout is not self-similar. The connection pattern between switches offers a more straightforward arrangement in the fat H-tree representation (Figure 11 ) than for the more interleaved fractal representation (Figure 7) . Animating message flow in the fat H-tree There are multiple ways to display the time-varying properties of the simulation in this representation. Figure 12 illustrates one of them -the switch utilization histogram -for two system sizes. The height and color of the rectangular pillar above each switch dynamically codes the utilization level. Since each switch has eight ports, four up tree and four down tree, the switch utilization level is proportional to the ratio of the number of activated ports to the total number of ports. The color changes in a continuous ramp from green to red according to the utilization level.
For the fat H-tree visualization, two methods for displaying the flow of packets between switches were created. The first, shown in Figure 13A , is similar to the matrix-based method; connection pathways are shown as pipes in the plane of the representation. In this diagram, the outgoing data leave the switch in the horizontal direction, turn a right angle and enter the receiving switch in the vertical direction. Yellow connections represent the constructed path and the green connections indicate data packets being sent from one switch to the other. As in the matrix-based method, this coding of flow tends to suffer from pipe obscuration when paths overlap. This is remedied in the second method, again using the third dimension for connecting arcs representing the pathways ( Figure 13B ). The yellow arcs represent the constructed paths and the green ones represent information packets being sent through this path. The height of the arc is proportional to the distance that the message travels. Notice that this representation ameliorates the problem of perceived vs topological distance: connections between any two given levels are all approximately the same length, thus eliminating any perceived weighting between connections that is not reflected in the hardware configuration.
Preliminary message-focused representations
Based on the use of the previous two representations, a complementary representation of the simulation data was required that abstracted away much of the network topology. This abstraction allows a user to concentrate on characteristics of a single message or packet (such as collisions, wait times, etc.) without being overwhelmed by the static representations of the switch topology that A B Figure 8 (A) A four-level quaternary tree in the form of an H-tree. The root is the large square in the middle of the figure. (B) A fat quaternary tree in the form of a modified H-tree. The root node has been expanded to contain the 16 top-level elements, and the other layers have been similarly expanded.
Figure 9
A three-level fat H-tree, where the central 4 Â 4 grid of squares represents the switches in level 3 (corresponding to the blue slab in Figure 4 ), the four 2x2 grids near the corners represent the level 2 switches (corresponding to the green slabs in Figure 4) , and finally the singular squares representing the first-level switches (corresponding to the red slabs in Figure 4 ). Note that the nodes and NICs are not explicitly shown in this class of representation.
otherwise fills much of the visual scene. The preliminary representation that addresses this requirement is called the Parallel Rail. The Parallel Rail representation does not encode the network topology. Rather, all packets flow from a source to destination rail along their own path, originating on the output rail (left horizontal cylinder in Figure 14A ), moving up to the corresponding NIC (the first knob above the rail), out to all appropriate switches, to the NIC corresponding to the input node, then finally to the input rail.
The switch locations are shown positioned between the rails but are not explicitly drawn, thus reducing visual complexity. The switch layer is shown vertically, with higher-level switches toward the top. This arrangement emphasizes switch layers and de-emphasizes switch placement, although switch placement is also encoded in the horizontal position of the knob. All packets are gathered again at the input NIC level and then drop to the input rail. Switch level and switch placement, therefore, make up two of the three dimensions of this representation.
Time is the third spatial dimension (increasing down the rails, away from the viewer in Figure 14A ). The animation can be run either forward or backward through time at any desired rate, with packets appearing at one end of the rails and disappearing as they reach the other end. The amount of time represented by the rails can be controlled. Figure 14A shows a single node sequentially sending messages to all of the processors, creating a scan, Figure 14B shows an expanded view of one message, and Figure 14C shows a compressed view of a large block of program execution time.
Packet information can be seen in Figure 14B , where the beige pipe on the left represents the packet head, the next (blue) pipe the packet tail, the next (cyan) pipe the packet OK, and the final (purple) pipe the end-of-packet. In the views where time is more compressed these details are not as clear, but still present, while higher-level patterns emerge. This representation, and others like it, are under development, and show promise for revealing a great deal about the structure and behavior of the flow of data through the switch network, for example the behavior of the variation in the paths of the packets of a single message.
Interface and control of the visualizations
The á la carte prototype visualization system currently provides the ability to import data from a simulation and display its dynamics both forward and backward in time, controlling the speed of the animation of the events. It also allows the user to change various modes of the representations. In the matrix-based representation, the most useful options are the ability to turn the visibility of the NICs on and off; the choice to represent connections by a simple line following the surface, a hollow pipe following the surface, by a pipe in the shape of an arc, or to make connections invisible; and the ability to switch on and off the representation of switch utilization as on the edges of the slabs. In the fat H-tree representation, the same time controls are available, the switch utilization heights and coloring may be switched on and off, and the connection modes may be invisible, straight lines on the surface, or arcs. In the Parallel Rail representation, packet/ message-level detail can be varied and the time window displayed can be resized. The interaction with a given application is handled in two ways: through a set of pulldown menus for each application, with the various options displayed in sub-menus, and through keyboard bindings.
Discussion and conclusions
The visualization tool under development for this project consists of a set of graphical representations of the topology and simulated traffic in a Quadrics switch network for monitoring and analysis of production ASCI computer systems. By allowing an analyst to observe the abstract interactions and events within the simulation (or the real machine), a better understanding of the relationship between various message-passing events and states in the system was achieved.
In the interest of enhanced intuitive coupling between humans and data, consistent metaphorical mappings from the problem domain (supercomputer architecture and traffic patterns) to the representational domain (i.e., Flatland) were explored. In each case, the representation invoked a much larger set of relationships between the individual representational elements. In immersive 3D worlds, geometric abstractions are no longer sufficient to describe and explain the objects in the space. For example, layers are not merely rectangular boxes; they are layers or slabs with substance as well as volume. They sit atop each other, each resting on the next, not merely juxtaposed. The connections in all three representations are not made merely by connecting lines, but with pipes that can therefore carry something. While these metaphorical environments are relatively simple, they do allow, or require, higher orders of metaphorical understanding than in 2D and non-immersive 3D environments. One of the points of this approach is to try to use the metaphorical associations to increase cognition and intuitive coupling while reducing the need for decoding representations. By using real-world metaphors, we hope to improve the ease of use and usefulness of information visualization tools.
Given this, it is instructive to compare the advantages and disadvantages of some of the fat-tree representations we have created so far. Although no formal study has yet been performed, the representations developed here have been useful in helping both casual observers and researchers become intimate with the topology and the architecture of the machine under study, and to achieve a better intuitive understanding of its structure. Animating the output from the simulations has in several instances been useful for debugging the simulator itself. The simulations visualized thus far have been examples of uniform message traffic between processors, and of a distribution that approximates that typical of the kinds of calculations done on volumetric grids, where each cell in a grid communicates only with its nearest neighbors. The characteristic differences between these two distributions were made clearly visible in the matrix-based and compact, self-similar representations. The most notable differences were the localization of message traffic in the latter example and the attendant reduction in utilization of higher switch levels.
Each new representation that was developed was designed to solve one or more problems with the previous representation. The direct representation ( Figure 2 ) was literally a first step, a place to start, in which each node and switch was a square and each communication between nodes and switches was a line. This produced excessive visual clutter, and there was no accommodation for the display of finer details of the network topology. The matrix-based representation was an attempt to represent the hierarchical fat quad-tree topology in a variant of an adjacency matrix. This layout succeeded to some extent, although it did not scale well and the connections, confined approximately to a plane, intersected badly. The self-similar representations (fractal based and fat H-tree) took greater advantage of the topology of the network and used 2D space more efficiently, while forcing a more symmetrical (and, therefore, more accurate) representation of the topological space. The Parallel Rail representation greatly deemphasized the topological relationships in order to focus the user's attention on packet structure, timing, and routing. It also included time as a major axis, forcing attention to this critical component of network performance. This focus and axis assignment made it easier to see the packet component structure and timing, while still showing a light-weight encoding of the network topology. The Parallel Rail representation also encoded more details of the communications, showing packet components and allowing visualization of the impact of factors such as switch contention on individual packets of the message.
All representations suffered from visual clutter, occlusion, and intersection of connections to varying degrees. Connections were represented as simple lines and cylinders hugging the surface of the switch structure, and as cylindrical arcs whose heights are proportional to their length. As can be seen in Figure 13B , the use of arcs of different heights does help to show more intuitively where the flow of data occur in the switching fabric and particularly in the matrix-based and H-tree cases, how expensive one communication is relative to another. Specifically, communication between lower levels yields tiny arcs while communication between higher levels yields larger arcs. Since the ultimate goal of the interconnection network is to facilitate communication between the computational nodes at the lowest level of the switching fabric, communication at the higher levels in the fabric implies communication down through all levels, thus the larger arcs at the higher levels qualitatively match the larger amounts of latency and contention in the whole system. Likewise, the Parallel Rail representation showed longer communication paths (through higher-level switches) as taller paths, since switch layer was encoded on the vertical axis. This once again encodes the higher cost of using higher-level switches, while abstracting away distracting horizontal topological detail.
All the representations inherit from the Flatland environment the ability to observe the representation from any viewpoint. Many issues with obscuration and level are solved by the ability to change quickly and easily the viewpoint and angle of view. Motion parallax produced by a smooth movement of the viewpoint causes otherwise obscured details to become visible, and also emphasizes the 3D relationships between components in the representation.
As the investigation, refinement, and use of these visualizations continues, more subtle features of these data sets are becoming apparent. The process of designing, building, and debugging the medium-fidelity simulation model has relied on this tool for providing the developers with clear, intuitive views into the dynamics of this complex switch network.
