Visualizing Parallel Computer System Performance by Malony, Allen D. & Reed, Daniel A.
-- UILU-ENG-88-1771Report No. UIUCDCS-R-88-1465
TAPESTRY
Technical Report No. TTR88-8
Principal Investigators Roy Campbell
and Daniel Reed
Visualizing Parallel Computer System Performance
by
Allen D. Malony and Daniel A. Reed
September 31, 1988
TAPESTRY
[
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN • URBANA, ILLINOIS 61801-2987
https://ntrs.nasa.gov/search.jsp?R=19980007720 2020-06-16T00:23:42+00:00Z

Visualizing Parallel Computer System
Performance
Allen D. Ma]ony * Daniel A. Reed t
Department of Computer Science
Center for Supercomputing Research and Development
University of Illinois
Urbana, Illinois 61801
Abst_ct
Pandlel computez systems m 8mons the most complex of man's cre-
ations, mald_ ,,_b_actory _ chszactezisation diflcult. De-
spite this complexity, there are strong, indeed, a/m_t irtesistlble, incen-
tives to quantify par.rid -y.tem _ _ a .insle mettle. The
fallacy lies in succumbhs| to such temptations. A complete perfonnance
characterisation requ/res not only ms sns/ys/s of the system's constituent
levels, it also requires both atatie and dlmamic characterhJations. Static
or ave_aze behsv/or analys_ may mask trans/ents that dramatically alter
system pedomumee.
Althoush the _ visual system is _kedly sdep¢ st/nterpret-
ins end identif3_S anomafies in fa/Je color data, the importance of dy-
namic, visas/ scientific da_ presentation ka. only recently been reco K-
nisei. Larse, complex pszallel syJte_, pose equsny vexins pe_orman_
interp_ta_oa problems. Data from hardware and software perforlnance
monitors must be presented in ways that emphsudse importan_ events
while elidin s irrelevan_ details. Design approaches and tool- for perfor-
mance visualisation are the subject of this paper.
"Supportedln pert by the Natioz_l Scle_.e Foundstionunder 8_nts NSF MIP-8410110 and
NSF DCR 84-0e918, by the D_ o_ Enersy under mint DOE DE-PG02-8SER25001,
and by the Air Force Oi_ee of Selemti_ R,_nweh under _'ant AFOSR-F49e20-8@.C-0138
(URI).
tSupportmi in part by the National Seie_aee Foundation under Sran_ NSF ANTI
TAPESTRY 1-S-300S5, NSF DCR 84-17948, and NSF CCRS&s789e, by the Nstlon_l Aero-
nsutles and Space A_;_;-tratimaunder NASA Contraet Number NAG-I-613, and by the Air
Force OfKce of Scientific Research under 8t_nt A]POSR-F49e20-se.C-01Se (URI).
The purpose of computing is insight, not numbers.
Richard Hamming
I Introduction
The appearance of any new computer system raises many questions about its
performance, both in absolute terms and in comparison to other machines of
its class; parallel computer systems axe no exception. Unfortunately, parallel
computer systems are among the most complex of man's creations, makin 8 sat-
isfactory performance characterisation difficult. Despite this complexity, there
are strong, indeed, almost irresistible, incentives to quantify parallel system
performance using a single metric. The fallacy lies in succumbing to such temp-
tations. Just as it now is widely recognised that human intelJigence is not
subsumed by the spatial and verbal abilities measured by standard intelligence
tests, complete characterisation of parallel computer system performance en-
compasses more than operations executed per second.
Peak performance ratings in MIPS (milLions of instructions per second) or
MFLOPS (mill/ons of floating point operations per second) obscure the ira-
portuce of interacting perfoemance levela sad d!lnamic equilibeiur_ Repeated
studies have shown that a system's performance is maximized when the com-
ponents are balanced (i.e., the_e is no single system bottleneck) I5]. As an
example, optimizing the performance of message pmudng systems I17] requires
a judicious combination of node computation speed, message transmission la-
tency, and operating system software. High speed processors connected by high
latency communication links restrict the classes of algorithms that can be effi-
ciently supported.
A complete performance characterisation requires not only an analysis of
the system's constituent levels, it also requires both static and d_rnamic char-
acterilations. Static or average behavior analysis may mask transients that
dramatically alter system performance. By analogy, biological researchers have
long recogni,ed the importance of both in vitro sad in vivo measurements. Lab-
oratory measurements of isolated cells or biological molecules often differ from
similar measurements in natural environments.
The history of virtual memory rmeacch offers a classic example of transient
behavior and its importance. The slow drift model [4] predicted that program
reference locality changed slowly. Later, more detailed measurements showed
that reference localities change swiftly and catastrophically. Most page faults
sad associated overhead occur in small time intervals, and a phase-transition
model more accurately reflects observed behavior.
Performance measurements of high-speed computing systems can q.uickiy
generate vast quantities of numerical data. Indeed, recognition of the impor-
tance of virtual memory phase transitions was hampered by the volume of data
Hardware Design
System Software Design
Algorithm Design
Application Design
FiKure 1: Performance Levels
generated during simulation and measurement; post-measurement data com-
pression yielded page fault rates, a static performance measure. However, phases
and transitions can be seen only by examining significant portions of the refer-
enee trace; this is best done via dynamic graphic displays.
Although the human visual system is remarkedly adept at interpreting and
identifying anomalies in false color data, the importance of visual scientific data
presentation has only recently been recognised [7]. Large, complex parallel
systems pose equally vexing performance inte_'etation problems. Data from
hardware and software performance monitors must be presented in ways that
emphasise important events while eliding irrelevant details.
In collaboration with the Center for Supereomputing Research and Devel-
opment at the University of Illinois, we are developing a suite of performance
visualisation tools. These tools and our design approach are the subject of this
paper. In §2 we examine the importance of performance levels and formalise
the empirical performance evaluation process. In _3 we discuss HyperViev, a
prototype that dynamically displays performance data obtained from hardware
measurement and simulation of message passin 8 systems. Techniques for vi-
sualising application performance are the subject of _4; linear progrsmming
[6] [19] is used as a test pr.oblem. Finally, _5 summarises our experience and
development plans,
2 Experimental Performance Analysis
As Figure I illustrates, there are four levels in the hierarchy of performance
measurements. The answer to the oft-asked question, "How fast is it?" de-
pends on the intended use of the performance data. At the lowest level lies the
performance of the hardware design. Determining this performance provides
mInstrumentation Data Reduction
Specification and and
Data Collection Presentation
Figure 2: Performance Analys/J Phases
both a design validation and directives for system software design. Only by
understanding the strength, and weaknesses of the hardware can system soft-
ware designers develop an implementation and user interface that msadmizes
the fzaction of the raw hardware performance available to the end user. As
an example, consider a hypothetical hypezeube operating system that provides
dynmnic task m/station to balance workloads. To meet these goals, it must be
possible to rapidly transmit small status messsses. It is fruitless to design such
a system if the underlying hs_lware provides only high-latency message trans-
mission. Given some characteri=ation of the balance between processing power
and interprocessor communication resultin 8 fzom the system software, users can
deveJop algorithms that are best suited to the parallel system. Finally, the best
mix of key algorithms will maximize the performance of user applications.
Regardless of the system level, performance characterisation requires specifi-
cation of the desized measurements, instrumentation and data collection mech-
anisms, and data reduction and display; see Fignze 2. I Althoush it is clear that
a pszallel computer system is a gestalt whose performance is inextricably tied to
the performance of its constituent hardware and software levels, it is less clear
that performance iust_mentation and data collection techniques for one level,
or even one system, are rarely applicable to other systems or other levels. As an
example, Table I shows a subset of the important performance measurements
for three levels -- hardware, system software, and alSorithm and three systems
the Cray X/MP, the Uuiversity of Illinois Cedar system 113], and the Intel
iPSC hypercube {16].
The diversity of underlying technology and system architecture makes it
impossible to develop s single set of performance instrumentation techniques.
Memory bank conflicts on the Czay X/MP have no analo8 on the distributed
memory Intel iPSC. Moreover, the event time scales differ by six orders of
ZBy analogy with news reporting, _W_&t do I want?" "How do I get it? n and "How can I
see it?"
Level
Hardware
CmV X/MP
vector startup
memory
conflicts
Illinois Cedar
network contention
"vector/cache
interaction
Intel iPSC
processor speed
communication
latency and
bandwidth
Software compiler compiler OS support
Algorithm shared memory accessvectodzation communication pstteru
Table 1: Performance Level Comparison
magnitude. Similarly, the shared memory access patterns of Cedar application
algorithms may cause interconnection network conflicts, but these patterns are
not predictors of performance degradation due to network contention. Although
it is impossible to develop a single performance instrumentation mechanism
applicable to all levels, mechanizma for specification of noteworthy performance
events and their presentation are largely system independent. 2
At all performance levels there exists a minimal set of required events (e.g.,
counts and times). Capture of these events should be enabled by sisnals to a
hardware monitor, operating system calls, or flags to a compiler preprocessor.
In addition to standard events, certain others must be enabled selectively, either
to minimise the performance perturbations of instrumentation or to reduce the
data volume to tractable levels. Ideally, a standard user interface should permit
event specification regardless of the event type or the performance level.
Despite the diverse instrumentation events of dfffering levels and systems,
the performance measures can be presented using a small number of display
types (e.8., bat and strip chs_ts, three-dimensional plots, and state transition
diagrams). These graphical displays are the subject of the remainder of this
paper.
3 HyperView: A Hypercube Visualization Tool
In collaboration with the Center for Supercomputin 8 Research and Develop-
ment, we have designed and implemented ilyperView, a 9rototype performance
visualisation tool for distributed memory parallel procG_mrs configured as hy-
pereubes. HypezView dynamically displays architectural and system activity
via a multiplicity of system views. Detailed performance measurements also are
2The e_enta vary but the speci/_cstion s_d d/Ip]sy m.¢chan_m need not.
provided via standard statistical displays.
HyperVlev was inspired by SeecuSe [3], a hypercube visualisation system
built for the SunView s window environment. A/though many of the HyperView
displays were borrowed from Seecube, the implementation is based on the X
window environment [18] and the user interface libraries provided by the Faust
parallel programming environment bein 8 developed at the University of Illi-
nois Center for Supereomputing Research and Development [10] [11] [12]. The
portabi/ity provided by X permits use of HyperViev in a variety of workstation
environments. Because X supports a c]]ent-server paradigm, the data analy-
sis and display portions of HyperVlew are decoupled, potentially executing on
different systems. Tkis decouplins not only makes the visualisation portions
independent of messs45e passing hardware and system software, it also is cru-
cial if real-time performance display and dynamic system reconfiguration are to
be supported. Thus, HyperViev eontnins three cooperating modules -- data
capture, state analysis, and visualisation.
3.1 Data Capture
The HyperVlev visualisation component accepts event traces generated by the
processors of a message passing system. Because the data capture is decoupled
from visualisation, the event trace can be generated via simulation, permitting
study of new message pusing arch/tectures, or from program execution. At
present, the HyperVlew visualisation is driven by data obtained from simula-
tion of communication hardware for diJYerent message passing paradigms 19],
including store-and-forward message switching, circuit switching, staged circuit
switching, and wormhole circuit switching. Our experience hss shown that
visual comparison of system dynamics quickly reveals differences in communi-
cation paradigms.
When an event is detected by the performance instrumentation, an event
identifier, a timestnmp, and any additional event data are written to a trace
buffer. For our message passing simulations, we instrumented the simulator to
record the following information about message events at each hypercube node.
The following events suffice to display message passing activity for fixed-path
and adaptive variants of both circuit and store-and-forward message switching.
i <time>
m <time>
q <t_e>
Q <t_e>
• <_lae>
v <_Ime>
t <tiae>
T <1:ime>
<nodes> <s_z_>
<meg id> <_:om> <1:o> <size>
<meg id> <at>
<asg id> <a_>
<meg id> <a_>
<mmg Id> <_zon> <_o> <l_J_k>
<meg Id>
<mag Id>
3SunView is • trademark of Sun Microsystems.
Ini_ial message
Create message
Enqueue message
Dequene message
Circttit ee_ablishmen_
¥isi_ node via link
Begin message transmission
End nessage transmission
V <time> <meg id> <fron> <_o> <link> Delete link between nodes
E <_ime> <nsg id> <froa> <to> Circuit te_n£na$ion
M <t_e> <as 8 id> Message delivery
For both circuit and packet switching, messages may require severs/trans-
missions and may cross multiple communication Links to reach their fins/desti-
nation. Hence, the events recorded by a single hypercube node are insufficient
to reconstruct the history of a message. Thus, the/rein and to arguments in
the message creation event represent the point of message origination and the
final destination. Because we are studying routing pamdi_ns that can choose
one of many paths to the destination node, ]ink traverse/information must be
saved to reconstruct the routing path.
A/though the instrumentation events just described suite to display com-
mun/cation traffic and queueing delays, other events are needed to display sys-
tem software and application behavior. Thus, we are developing software and
hardware instrumentation for an Intel iPSC/2 hypercube that will permit near
real-time data capture of user, system and hardware events, including support
for local event bu_ering, globs/timestamp synchronisation, and trace process-
ing; see _5 for additional details.
8.2 State Analysis
In s distributed memory parallel system such as a hypercube, each node must
record events based only on local knowledge; the absence of global memory
precludes data sharing with the granularity necessary to dynamically maintain
a consistent, global state. Moreover, the nodes of many distributed memory
systems are individually clocked, the clocks often are not synchronised, and the
clocks may tick at di_erent rates. Thus, the event trace at best defines a partial
time order, and the timestsmps may be logically inconsistent with the logical
order of events.
To recover global state during trace analysis, the trace timestamps must be
reconciled and enough event data must be saved to correlate distributed events.
The ans/ys/s requires interpreting each event in sequence and incrementally
modifying the current system state; for complete details see [3].
8.8 Performance Visualization
The HyperVlew user interface permits simultaneous display of the dynamic sys-
tern state via a variety of diITering v_ewm. Each view emphasises certain system
aspects (e.g., the network topology, the multiplicity of partially overlapping
paths from a source to a destination node, or queues of waiting messages). Each
view provides a di/_erent insight; collectively they convey system dynamics.
A/though Hamming's dictum applies, numbers are often necessary and im-
portant. In addition to graph/ca/displays, HyperView provides statistical dis-
HYPERVIE_ Trace stem Node Link
Figure 3: 8yparViev Top-Level Displuy
plays at both macroscopic levels (e.g., number of messqes transmitted) and
microscopic levels (e.g., link utilisation). Finally, HyperVlew permits selective
display of message traffic and statistics, permittins the performance analyst to
isolate anomalous behav/or for further study.
Because HTperVleg is a d_/namsc performance visualisation system, much is
lost in description of static, monochrome imasa. Despite these Limitations, we
discuss HyporViav us a performance analyst might encounter it, beginning with
the top-level user interface shown in Fisure 3. User menus s_re shown at the top
of the screen. Pu/ling down the T_aee menu lists the Description, Execution
Control, and Statistics items shown in Figure 4.
3.3.1 Trace Description
In the Trace Description window, a performance analyst can select, by clicking
the mouse on the Trace F;/e item, a trace file that contains the event informs-
tion captured durin s system execution. A dialogue box (not shown) will pop
up request/n s the user to enter the trace file name. After read/aS the trace
file, H]rperViev begins the state analysis needed to recover the time varying
global state of the message pusin s system. Durin S state analysis, HTptrV_.ev
computes the number of events, messsses and bytes transmitted. Throughout
the visualisation session, these statistics can be viewed by selecting the Trace
Description menu.
3.3.2 Execution Control
As the name suggests, the Execution Control window controls updates to the
graphical and statistical displays. During state analysis, HyperV£ev identifies
$
: 4096
Node
FiKure 4: HyperVLev Trace Window
a series of global/y consistent d/splay points. During updates of the trace dis-
play, SyperVleg moves between these display points. The current point can
be marked by a position in time and/or location in the event trace. Thus,
the performance analyst can select a curzent display time either by clicking the
mouse on Current Frame Time and enterin$ a time, or by clicking the mouse
somewhere within the Frame Time slider bar. Event trace positions s_e selected
similarly. When a new time or event is selected, HTperViev moves to the next
consistent system state and its correspondins display. Because the event trace
is processed a poatet'/o_ the performance analyst can move both forward and
backward in time. 4
A Frame in HyparVln corresponds to a displayed system state. The user
can chahse three aspects of frame display -- mode, rate, and state di_'erentis/.
Frames can be d/splayed either in sin_e-step mode or continuously. If the mouse
is clicked on the SZNGT.E STEP button, the user must explicitly request display
of the next frame. Conversely, COzVTI/_'U'O_'S mode automstic-lly advances to
the next fra_te specified by the frame rate and di_erential controls.
Via the Frame Rate control, the performance snsIy_ can adjust the interval
between display of new frames. The third aspect of frame control is the change
in system state, in events or time, between successive, displayed frmunes. This
state dii_erence is the minimum of the specified number of Events per Frame and
4§5 discusses both the mmA:Ivantss,emmsnddlmsdvsntmaKesof time independene browsln s.
FiKure5: HyperVisvSystemMenu
thenumberof Clock Ticks per Frame. By sdjustins the display mode, _sme
rate, and state differential, the performance analyst can study gross behavior,
exemlining a small subset of all states, or examine the trace event by event.
3.3.3 Global Star/sties
The Star/st/ca window shows the global system state, both cumulative message
statistics and cuM_en£ node and llnk activity. Because the performance analyst
can browse the trace, cumulative statistics are not monotonic -- they reflect
performance data relative to the current trace state.
3.3.4 System Displays
Figure 5 shows the menu of dynamic system views provided by HyparViH.
Figures 6 and 7 show the CUBE, FFT, PASCAL, and QUEUE views. Each dis-
play g/yes a different v/ew of the hype_ube that shows current system activity
as highlighted nodes and links. Each view emphasises certain system aspects
(e.g., the network topolosy, the multiplicity of partially overinpping paths from
s source to a destination node, or queues of waiting messages). For example,
the CUBE view is the "natural" multi-dimensional representation of a hyper-
cube. In contrsst, the FFT view emphssises message routing paths. The GRAY
CODE view, not shown, emphasises subcube communication -- communication
links connecting the two D - l-dimensional subcubes of a D-dimensional by-
percube appear as parallel lines [3]. The PASCAL vie.w reflects the logarithmic
combining (e.g., global minimisation) when losicsd trees are embedded in the
hypercube topology [19]. The QUSUE view in Figure T shows the instantaneous
state of the messsge queues at each node. Each message awaiting transmission
is shown as a small box. Communication trandents appear as bursts of en-
I0
queuedmessages. Similarly, the effects of differing communication paradigms
(e.g., store-and-forward message switching and circuit switching) appear as dif-
ferences in mean queue sise.
In all views, colors emphasise activity -- links change color when messages
are sent, nodes flash when processing messages. Moreover, each system view
supports pull-down menus for inquiries about nodes, links, messages, and cir-
cuits. In each topological view (i.e., CUBE, FFT, and PASCAL), unwanted
detail can be elided via the Node and Link menus. For example, display of
any combination of transmitting, active, or receiving nodes and links can be
disabled. Figure 8 shows the L/nk menu; the Node menu is similar. A//, Active
and Ttansmittin 8 select the displayed link states.
3.3.5 Message and Circuit T_rac]_g
In addition to elision of unwanted node and link details, HyperView supports
rneasage tvacldng and ciecait tvacidng, s After identifying source and destina-
tion nodes, only those messages in transit between the specified nodes are dis-
played. Figure 9 illustrates message and circuit tracking in s system with circuit
switched communication. In the figure, nodes 0 and 20 have been selected for
circuit tracking and message tracking, respectively. Node 0 is transmitting a
message to node 15 along the path shown. The intermediate nodes on the path
are not active because only circuit connections have been established there.
Concurrently, node 0 is sending a message to node 20, and node 20 is sending a
message to node 29.
Message and circuit tracking have proven invaluable when comparing com-
munication paradigms. By eliding extraneous detail, the dynamics of circuit
establishment in both fixed aad adaptive routing paradigms can be easily com-
pa_ed.
4 A.ppHcation Performance Displays
Performance visualisation at both the hardware and system software levels pro-
vides important insight for syatem design and analysis. And because system per-
formance is manifest in the application software executed during performance
analysis, system level performance visualisation indirectly provides application
performance insight. However, imdght from visualisation of system performance
must be coupled with insight from application performance visualisation to un-
derstand the interactions of different performance levels. To il/nstrate these
interactions and the importance of integrated visualisation tools, we use a par-
allel implementation of the simplex linear optimisation algorithm [6] I19] as
an example. Like many parallel algorithms, the performance of the simplex
SThe e.hoi©eand semantics depend on the undertyin s hLrdwsre ¢omm_u_cation psradlgm.
11
CUBE
\
\
\ (
FFT
INode
f
' I
-%/
INode '_ ink IMessage ICircu i t
".7-v
cXX
_X
<xx
×.'o
FiKure 6: CUBE snd FFT Disp1&ys
12
PRSCRL tNode ILink IMessag, ICircuit
QUEUE IrlesSaCle
r-_ _
8 I 2 3 4 5 6 7 8 9 JB L1. J2 ]3 J_ J5
]6 _' m J9 3B 2L 22 :_ 3_ 25 _6 _7 3B29 3B 31.
Fibre T: PASCAL and QUEUE Displays
13
Node
Figure 8: Link Menu
method varies greatly with input data, and these variations "are intimately re-
fated to both the algorithm and its interaction with the hardware and system
software.
4.1 Linear Optimization: An Example
Large, sparse, linear systems of equations arise frequently when constructing
mathematics/models of natural phenomena. Most often, these linear systems
are fully constrs/ned and can be solved via a variety of di:ect or iterative tech-
uiques. However, one important problem class requires so|utions to undercon-
str,,/ned linear systems that maximise some objective function. These linear
optimis, ation problems often contain hundreds of equations with thousands of
variables. Mathematic-ny, this can be stated as:
Minimize : c;rz
Subject to: A2 = b
b>O
z>_O
Here, cy is an n vector of variable coefficients that defines the objective function
(i.e., the function being minimized). For a maximization problem, the negative
14
:UBE iNode _l. ink ---_essage !C,rcu_ nl f
9 i Un,que :
J2 J3
6
J
11
_5
3_
21 __
25
u_ _q
22 23 3B 31
Fi&ture 9: Me_nsse and Circuit Tracking
of the objective function can be minimised. The objective function can thus be
viewed as a cost function, where the goal is to minimize total cost. The Tn × n
linear system Az -- b defines the linear constra/nts on the objective function z.
Each of the m rows of the matrix A defines a constraint on the n variables of
the objective function.
The optimisation problem arises because the linear system Az = b is under-
constrained (i.e., rn is smaller than _, and the matrix A contains many more
columns (variables) than rows (constralntsl). s Consequently, there are many
possible z vectors that satisfy the system Az = b. A fundamental theorem of
linear programming states that an optimal solution, if it exists, occurs when
n - m elements of z are |ero (i.e., when there are precisely m non-zero elements
of z). This corresponds to the solution of an m x m linear system, the _zJiJ,
obtained by selecting m of the _ columns of the matrix A.
Clearly, exhaustive solution of the (m_) possible linear systems is not fea-
sible. The simplex method is a search algorithm that decreases the value of
the objective function at each iteration by selecting a non-sero element of z, a
so-ca/led/_c variable, and replacing the corresponding column of .4 with an-
other column. The simplex method provides a systematic way of moving from
one basic feasible solution (i.e., one satisfying the constraints) to another. This
systematic movement, called pivoting, must
6See Fisure 11 For an exsmple.
15
, ,
I I I
' 'A In f_ix ....
I I t I I
I t I I I
I I I I I I I I
t I I l I I I I
I I I I I I t I
' ' 1)p{' Objdti_ r_w , I' 'I I I I
0 2 3 0 1 2 .3
(a) Column. partitioning
0 1 2 3 0 1 2 3
(b) Row partitioning
Figure 10: Data Placement for Simplex Row and Column Partitions
• identify a new basis column that decreases the objective (cost) function
value,
• identify the column to remove from the basis that maxim/sea the decrease
in the objective function value while still satisfyin 8 the constraints, and
• replace the old basis column with the new one.
These transformations are realised by standard techniques from numerical linear
algebra (i.e., Gauss-Jordan elimination).
4.2 Parallel Simplex Variants
In message passing architectures, interprocessor communication is much more
expensive than local memory access. Hence, many algorithm implementation
details are constrained siren the mapping of data to processors. The simplex
algorithm shares similar characteristics with solution of linear systems, matrix
multiplication, and other common matrix operations. Previous work on dis-
tributed matrix alSorithms has advocated row or column partitioning of matri-
ces [1] [8] [15]. We have considered similar schemes for distributin 8 the matrix
of constraints across the nodes of a hypercube [19].
In the column partitioned method, shown in Figure 10, complete columns
are divided equally amon 8 the processors. To identify the column to enter the
basis, each hype_ube node must first find the local minimum of the objective
values for those columns in its local memory, then cooperate with other nodes
to identify and distribute the identity of the column containing the minimum
objective vulue. Conversely, the s_ngie node containin 8 the pivot column must
identify the column to leave the basis. Thus, partitioning the matrix by columns
creates both parallel and sequential computation phases.
In the row partitioned strategy, complete rows of the matrix are divided
equally among the processors. As Table 2 shows, this approach also creates
16
Partition
Column
Row
Bnte_ng BaJis Column
Pszsllel computation
Global _ation
Parallel computation
Global _ation
Ddpar_n9 Baa_ Column
Sequential computation
Parallel computation
Global marion
Gaua#.Jordon EliminatioT
Cohmm global send
Psrs_lel computation
Row global send
Parallel computation
Table 2: Hypercube Simplex Variations
both parallel and sequential computation phases. ? Despite the similarities sug-
gested by Table 2, the performance of simplex algorithms based on row and
column data partitions can be strikingly different. Why? Distributed linear
systems solvers process n × n matrices. The constraint matrices processed by
the simplex method contain many fewer rows than columns. Moreover, the ratio
of the number of rows to columns can vary dramatically. This variance, coupled
with the dit_erences in matrix sparsity, is manifest in the relative costs of com-
munication, sequential computation, and parallel computation. Hence, neither
row nor column partitioning is uniformly superior. To understand the dynamics
of algorithm interaction with matrix structure, application visualization tools
are necess_y.
4.3 Simplex Performance Visualization
p
Earlier study [19] suggested that, despite variations in matrix structure, row
partitioned simplex implementations often yielded better performance. How-
ever, counterexamples exist; Figure 11 shows the non-zero matrix structure of
one such problem. Although the 7:1 ratio of columns to rows suggests the
reason that column partitioning is preferable, the details are best grasped via
visualization.
Figures 12 and 13 show four views of the number of messages sent between
tasks of the row partitioned simplex algorithm on a 16 node l.ntel iPSC/2 hy-
percube. Recall that in a D-dlmensional hypercube, a node with address n is
directly connected only to those other nodes with addresses whose binary expan-
sions differ grom n in exactly one bit. Although messages must cross multiple
communication links to reach some nodes, the maximum distance between any
two nodes is D. When exploring performance at the hardware and system
software levels, understandin$ node connectivity is crucial. However, at the ap-
plication level, messages are exchanged by tasks, not hypereube nodes. Hence,
rI.n reality, there are many subvari,tlons of both row sad colmms partitioning, and each
hu dL_erin8 performance; see [19] for complete detsi]_.
17
Row
100
80-
60-
40-
20-
, .,J},
1 _ ° I I o. I
I " S _ Z • I
• I I i |-_
| I lal 11% • *
# : _*';,I.tr *
, ._! iy.;i
I * III._
• I III JI
o * oo *o.** • .*_
° * • I*o*. ** _
::"i::::.=2--
: :.. :::_,o
•. : ":'--
,::..
.=m
l i i
0 170 340 S10 680 850
Column
Figure 11: Simplex Benchmark SCSD1
Figure 12 s and subsequent figures show the logical interaction of tuks, not the
physical transfer of data. We emphasise that complete understanding requires
performance visua_ation at all levels, hardware, system software, algorithm,
and application. By separatin 8 the levels, the performance contributions of
each level a_e manifest.
In Figure 12 the peaks represent the logarithmic combining necessary to
identify global minims. In Figure 13 the logarithmic combining appears as
lightly shaded resions in the density view and as clustered contour lines in
the contour view. Because task sero is the root of the combining tree, during
each simplex iteration it must broadcast the identity of the task containing the
global minimum. The identi_ed task then broadcasts the needed row to all other
tasks, ff the workload were perfectly distributed, each task would broadcast an
equal number of times. Excluding messages due to the logarithmic combin-
ing, all other variations in communication traffic are attributable to this load
Sln the 3._.nsion_ dJspllys, counts sTe,ter than thirty were clipped, hence the unifor-
mity.
18
imbalance.Themultiplidty of views reflects our belief that an integrated per-
formance visualisation system should permit the performance analyst to select
those views that correspond to his or her personal preference and needs.
Figure 14 shows the volume of data exchanged between tasks. Comparing
Figure 14 with Figure 12 shows that tasks exchangln8 many messages do not
exchange a large volume of data. Why? The many messages necessary to
realise the combining tree are small; the row broadcasts reqnire fewer, larger
messages. The performance ramifications of this h/modal distribution of mes-
sage sises can only be understood by examining hardware and system software
performance displays. These displays show that message passing systems Like
the Intel iPSC/2 have large message preparation times relative to communica-
tion link bandwidth, penalisin 8 small messages. Hence, message count is the
important performance metric, not message volume.
Final/y, Figures 15 and 16 show the message count and volume for the column
partitioned simplex algorithm. As before, a combining tree is used to identify
81obal minima. However, as Table 2 shows, this global _ation is used only
when finding an entering basis column. Because each task contains columns,
a sequential computation is used to identify the departing basis column. This
reduces the number of small message transmissions at the expense of reduced
parallelism. More importantly, however, broadcasting matrix columns is much
less expensive than the row broadcasts of the row partitioned algorithm. The
scales for figures 14 and 16 direr significantly; this is the reason the column
partitioned variant is superior for the matrix of Figure II.
5 Current 11esearch
The hardware and application performance visualisations just described are ad
and are not integrated. First, HyperView was designed primarily to dis-
play hardware performance. As such, it is not easily extensible to display of
application performance, nor should it be -- display techniques for system and
application performance differ. Second, the simplex application visualizations
required manual instrumentation of the shnplex code and extensive preprocess-
ing before they could be displayed using ._a_emat_c_ a symbolic manipulation
system" and mathematician's assistant not intended for this use. HyperViee
and the simplex application visualization are facsimiles or rapid prototypes of
what is desired _ an integrated pe_ormance specification, instrumentation, and
visualization system for message passing systems.
Figure 17 illustrates the ides/. This hypothetical system, called Tapestry
would weave together elements of the hardware, system software, and applica-
tion levels. The hardware and system levels, shown at the top of the figure
would, like HyperVSew and Seecu/_e, dynam/cal]y display internode communi-
cation tra_c _ using multiple colors. Dynamic displays would include current
t Although the figure shows a i'om'-dimenaional hypercube, other views, such a. the PsscLl
19
messages, cumulative traffic (either counts or volume), and link utilisation. Via
a mouse, the performance analyst also could choose from an extended menu of
performance displays for each node and link:
• external input/output (i.e., file accesses),
• processor utilization,
• context switches,
• system calls,
• memory utilisation,
• memory reference patterns (i.e., reference |ocalities),
• virtual memory paSin 8 activity, and
• messase counts and volume by destination.
Each of these could be displayed in a variety of formats (e.g., perspective, his-
togram, strip chart, contour, or dens/ty).
The application performance level, illustrated at the bottom of Figure 17
would display the [o_ca/graph of the intertuk communication pattern, not the
physical graph of the underlying interconnection network. By dragging graph
nodes and edges with a mouse, the topolog/cal or/entation of the graph could
be modified to reflect the performance analyst's preferences. The application
performance level, like the hardware level, would include dynamic displays of
message traffic on the parallel program graph _and via perspective, density, and
contour plots. In addition, pull-down menus for tasks would include:
• message counts and volume by destination,
• delays for message transmission or receipt,
• dynamic procedure call graphs, and
• execution profiles.
Finally, the visualisation system would permit correlation of system and appli-
cation performance.
The astute reader will have realised that near real-time processing and dis-
play of such detailed performance data (e.g., memory reference patterns) implies
prodigious, indeed unrealistic, computing, storage, a_d display requirements.
Below, we discuss those features we believe s_e necessary to achieve the goal of
an integrated performance visualisation system.
trim_nKle , Gray code, or Fl_r would be supported also.
2O
5.1 Instrumentation and Visualization
ThemajorlimitationofHyperView,Seecube, and the simplex application visu-
alization is the absence of near real-time behavior. A poster_oei examination of
performance data means that a/l data of potential interest must be captuzed a
p_ori. Despite the consequent increase in storage requirements, this is some-
times desirable -- it permits performsmce data browsing across the entire inter-
val of execution, and it permits data capture at a level of detail incompatible
with near real-time processing. However, a poster/or/examination also precludes
dynamic system or application reconfiguzation based on observed performance.
Real-time display of even & portion of the captured dat_ would permlt the per-
formance analyst to selectively enable and disable performance instrumentation
based on observed behavior, reducing the stor_e requirements.
Despite the mmfifest advantages of interactive performance instrumenta-
tion and display, on most message passing systems, including the Intel iPSC/2
hypercube, a po_eeior_ data display is unavoidable because there is insufll-
cient communication bandwidth to transmit performance data to an external
host without distorting the performance being measured. Moreover, the lim-
ited memory at cach node constrains the volume of performance data that can
be buffered for subsequent transmission. Clearly, hardware support for perfor-
mance data record/n 8 is crucial, and we assume its existence. However, detailed
discussion of hardware designs for performance instrumentation is beyond the
scope of this paper. See [2] for a discussion of the hardware requirements for
performance instrumentation.
A visualisation system must be evolutionary, adapting to the changing de-
mands of hardware, system software, applications, and users. Thus, the im-
plementation must be extensible, permitting addition of new display formats
and performance metrics, and portable, permitting use with a variety of sys-
tems. These twin goals, extensibility and portability, suggest a modular, object-
oriented design that separates interface from implementation. Using the X
client-server paradigm [18] would provide portability and insure future exten-
sibility based on an emerging standard for window systems. However, X alone
provides neither the necessary abstractions (e.g., hierarchical performance dis-
plays) nor the rapid prototyping support; object-oriented window libraries such
as InterViews [14] are needed.
5.2 Current Status
Based on the lessons learned with HyperView, we are implementing an initial
version of Tapeatz'y for the Intel iPSC/2 using X and In_erVJ.eu. Initially,
software instrumentation of NX/2, the iPSC/2 operating system, will provide
data on system performance; a hardware monitor will be added later. Appli-
cation performance data are captured by instrumenting application and system
libraries, by modifying a compiler to automatically instrument application code,
21
andbymanuallyinsertinginstrumentationd/rectivesin applicationcode.
Acknowledgments
Craig Stunkel designed the simplex code and conducted the experiments that
motivated the application performance displays. Steven Wolfzam's Mathematica
I20] was used to generate Figures 12 -- 16. We are grateful for use of proto-
type versions of this softws_e; the ability to quickly generate many data views
has been immensely useful. Finally, Seecube [3] provided the inspiration for
HyperViee; many of the ideas and displays were borrowed from this pioneering
work.
References
Ill AYKANAT, C., AND OZOUNBIt, F. Large Grain Parallel Conjugate Grs-
client Algorithms on a Hypercube Multiprocessoz. In PT_ceedingJ of the
I987 International Conference on Parallel Processing (St. Charles, IL, Aug.
1987), pp. 841-644.
[2] CAZPZNTZIt, 11. J. Report of the Distributed Memory Discussion Group.
In These Proceedings.
[3] CouclI, A. L. Graphical Representation4, of Program Performance on
H_percube Message-Paasing Maltiproesuors. PhD thesis, Tufts University,
Department of Computer Science, Apr. 1988.
I4] DENSINO, P. J. Working Sets Past and Present. IEEE Transaction on
Software Engineering $E-6, 1 (Jan. 1980), 64-84.
[5] D]_NNING, P. 3., *ND Buz_._, J. P. The Operational Analysis of Queueing
Network Models. ACM Computing SurveyJ I0, 3 (Sep. 1978), 225-261.
[6] FOULDS, L. R. Optimization Techniques: An lnt_>duction. Springer-
Verlag, New York, NY, 1981.
17] FItE_KIBr., K. A. The Art and Science of Visualising Data. Communica-
tionz of the ACM 21, 2 (Feb. 1988), 110-121.
[8] GEIST, G. A., AND HZ,tTS, M. T. Matrix Factorisation on a Hypercube
Multiprocessor. In Proceedings of the First Conference on Hypercube Com-
puters and Concurrent Applications (Knoxville, TN, Aug. 1985), pp. 161-
180.
22
[9]GItUNWALD,. C., AND P_ED, D. A. Networks for Parallel Processors:
Measurements and Prognostications. In Proceedings of the Third Con.
ference on Hppercube Computers and Concurrent Applications (Pasadena,
CA, Jan. 1988).
[10] GUAItNA, V., GANNON, D., GAUL, Y., AND JABLONOWSKI, D. A.n
Environment for Programming Scientific Applications. In Proceedings of
Supercomputing '88 (Orlando, Florida, Nov. 1988).
[II] GUAaNA, V., AND GAUL, Y. A Portable User Interface for a Scientific
Programming Environment. In Proceedings of the Siggraph Symposium on
User Interface 5oj_ware (Banff, Alberta, Canada, Oct. 1988).
[12] JABLONOWSKI, D., £ND.GUAEN£, V. A D_f_Gmic Graph Too[ and _rts Use
in an Integrated Programming Environment. Tech. Rep. CSRD Report No.
746, University of Illinois at Urbana-Champsign , Center for Supercomput-
ing Research and Development, June 1988.
[13] KUCK, D. J., DXVIDSON, E. S., Lxwe,.xe., D. H., AND SAMEH, A. H.
Parallel Supercomputing Today and the Cedar Approach. Science _31
(Feb. 1986).
[14] LxN'ros, M. A., ,_ND CALDIR, P. 1t. The Design and Implementation
of InterViews. In Proceedings of the USENIX C÷+ Wor_hop (Santa Fe,
NM, Nov. 1987), pp. 256-267.
[15] MOLl]t, C. Matrix Computation on Distributed Memory Multiproces-
sots. In Proceedings of the First Conference on Hypereube Computers and
Concurrent Applications (Knoxville, TN, Aug. 1985), pp. 181-195. _
[16] RATTNZR, J. Concurrent Processing: A New Direction in Scientific Com-
puting. In Conference ProceedingJ of the 1985 National Computer Confer-
ence (1985), AFIPS Press, pp. 157-166.
I17] P_ZD, D. A., AND FUJIMOTO, R. M. Muiticomputer Networks: Meuage-
Baaed Parallel Processing. The MIT Press, 1987.
[18] SCHEITLP.IL R. W., AND G_.TTYS, J. The X Window System. ACM
TranJactior_ on Graphics 5, 2 (Apr. 1986), 79-109.
[19] STUNKBL, C. B., AND P_.D, D. A. Hypercube Implementation of the
Simplex Algorithm. In Proceedings of the Third Conference on Hypercube
Computers and Concurrent Applications (Pasadena, CA, Jsa. 1988).
[20] WOLPRAM, S. Mathematics: A STlatem for Doing Mathematics by Com-
puter. Addison-Wesley, July 1988.
23
Message
Count
Destination
Task 0
15
MessaseCount
15
15 0
Destination
Task
Figure 12: Perspective Views of Mess_e Counts (Row Partitionlns)
24
15
$
0
tl
C
T
&
$
k
0 Destination Tuk 15
15
$
O
tt
t
C
T
&
S
k
0 Destination Tuk 15
Figure 13: Density sad Contour Views of Message Counts (Row Partitioning)
25
Message
Task 0
15
15
Message
Volume
Source
Task
15 0
Destination
Task
Figure 14: Perspective Views of Message Volume (Row Partitioning)
26
Message
Count
Destination
Task
15
1$
Message
Count
15 0
Destination
Task
Figure 15: Perspective Views of Messsse Counts (Column Partitioning)
2T
Messsge
Volume
DestinationTask 0
15
Message
Volume
Souzce
Task
15 0
Destination
Task
Figure 18: Perspective Views of Message Volume (Column Partitioning)
28
Memory
Loes_ities C
Processor Utilisation
DynAmic Call Grsph
Figure 17: Tapes1:ry Performance Visualisstion System
29

BIBLIOGRAPHIC DATA 1. Repo_No.
SHEET UIUCDCS-1%-88-1465
4. Tide and Subtitle
Visualizing Parallel Computer System Performance
7. A=h_(s)
Allen D. Malony and Daniel A. 1%eed
9. Perfocming _ganizacionNameandAd_eon
Department of Computer Science
1304 W. Springfield Ave.
240 Digital Computer Lab
Urbane, IL 61801
13.Spoas_ing _|oniz_ionNsmeondAddress
National Science Foundation
Washington, DC 20550
Air Force Office of Scientific 1%esearch
Washington, DC 20550
Department Of Energy
Washington, DC 20550
NASA Langley 1%es_
Hampton, VA 236(
15. Supplemenc_y Noces
3. Recipient's Accession No.-
S- Repc.rc Date
September 1988
4.
8. Performing Org|miz=tion Relx.
No. i%-88-1465
10. F_roject/Tonk/Vork Unit No.
1I. Contract/Gram, No.
see page I
13. Type of Repazc & Period
Covered
!_cn uencer
_5
16. Abstmcts
Parallel computer systems are among the most complex of man's creations, making
satisfactory performance characterization difficult. Despite this complexity, there
are strong, indeed, almost irresistible, incentives to quantify parallel system per-
formance using a single metric. The fallacy _]/es in succumbing to such temptations.
A complete performance characterization requires not only an analysis of the system's
constituent levels, it also requires both static and dynamic characterizations.
Static or average behavior analysis may mask transients that dramatically alter system
performance.
Although the human visual system is remarkedly adept at interpreting and identifying
anomalies in false color data, the importance of dynamic, visual scientific data
presentation has only recently been recognized. Large, complex parallel systems pose
equally vexing performance interpretation problems. Data from hardware and software
performance monitors must be presented in ways that emphasize important events while
eliding Irrglevaut details. Design approaches and tools for performance visualiza-
tion are the subject of this paper.
17. Key Words
visualization, performance evaluation, parallel processing
171_ idem_ie_/0pe_-E_led Terms
17c. COSAT[ Fieki/Gcoup
18. Availability Statement
unlimited
_'OlqlM NTIS':IIS (10*TOl
19.. Security Class (This
Repoct)
UNCLASSIFieD
20. Secuuri_y C_ss (This
Pa_NCLASSIFiED
21. No. of Pages
32
I
23. Price
_JSC:Oidlv_O¢ 4QSaS)-P7 I

