Interconnection network is a decisive component of parallel and distributed computer systems. With the merits of simplicity and e ciency, 2-dimensional (2-D) mesh has been a popular choice of large MIMD interconnection networks. Mesh, however, has its known weaknesses in scalability and connectivity. Packed Exponential Connections (PEC) is a newly proposed network which is designed to improve the scalability and connectivity of 2-D mesh while maintaining its merits. In this study, the performance of PEC over mesh network is carefully examined through computer simulation. Characteristics of PEC networks are revealed. A novel routing scheme is proposed and used in PROT EUS environment to simulate the performance of 2-D PEC network. Simulation and analytical results show that for many applications where non-local communications are required, PEC network provides superior performance to that of mesh. Based on simulation results, structural modi cation is also suggested to further enhance the performance of PEC network.
Introduction
Performance study is an important aspect of parallel system research. It can be used to select the best architecture platform for a class of applications, to choose the best algorithm for solving a particular application on a given hardware platform, predict the performance of large application instances, to identify application and architecture bottlenecks suggesting application restructuring and architectural enhancements, and to gain insight on the interaction between an application and an architecture predicting the performance of other application-architecture pairs.
There are three commonly used techniques in performance evaluation. They are experimentation, theoretical and analytical modeling and software simulation. Parallel and distributed computer systems are quite complex. Accurate modeling of a parallel system is very di cult to accomplish. On the other hand, building a real machine for performance experimentation is expensive at best and unrealistic for most situations. For parallel computing, simulation has played an important role at all levels of the design and analysis of multiprocessor systems, ranging from architectures to runtime systems and from algorithms to applications.
Packed Exponential Connections (PEC) is a newly proposed network which has gained intensive attention recently 8, 9, 11] . PROT EUS 2, 3] is a simulation system which can provide fast and accurate simulation for parallel algorithms. In this research, the performance of PEC network is studied via simulation on PROT EUS system. The characteristic of PEC network is examined. A practical routing scheme, R?Route, is proposed, analyzed, and simulated solving two applications on PEC and mesh network. PROT EUS software system is appropriately enhanced to carry the simulation of the PEC network. Simulation results of PEC and mesh network are then compared and analyzed. Simulation results con rm analytical results: PEC network is a promising network; it provides a superior performance to mesh for applications requiring global communications, especially when system ensemble size and problem size are large. Based on simulated performance, modi cation of PEC network is also proposed to further enhence the performance.
This paper is organized as follows. The structure and functionality of PROT EUS is introduced in Section 2. A detailed description of PEC network is presented in Section 3. Properties of 1-D and 2-D PEC network are analyzed and the similarity and di erence between PEC and mesh are also described. In Section 4 a 1-D PEC routing scheme, R-Route, is introduced and analyzed. Based on it, a 2-D PEC routing algorithm is proposed and implemented in PROT EUS environment.
In Section 5 two binary-exchange scheme algorithms, namely all-to-all broadcast and Fast Fourier Transform (FFT), which are used in the simulation study, are introduced. Section 6 presents the simulation results of PEC and mesh network. Comparison and analysis of the performance are also provided. Finally, Section 7 gives the conclusion and summary.
The PROT EUS Simulator
PROT EUS is a parallel system simulator developed at MIT 2, 3] . In PROT EUS environment, each simulated parallel processing node consists of a processor, a network interface, a cache, and a memory. The processor is a sequential processor extended with instructions for network access and cache coherence. The network interface connects the processor with the interconnection network. The cache, which is optional, can be used to handle cache coherence and to provide remote memory access. The memory at each node is divided into two modes, a shared mode and a private mode. The shared mode maps to the global address space, which can be accessed from other processor nodes via interconnection network. The private section is not accessible from the interconnection network. If the memory of each processor is all in shared mode and memory access time is uniform, the structure is a COMA Model 4]. If we consider the e ect of interconnection network, it is a NUMA model. If all the memories of processors are in private mode and no global address space is used, then it is the message-passing model.
To simplify the replacement and adjustment to di erent applications, PROT EUS is designed with a modular structure. It includes operating system modules, shared memory modules, cache modules, data collection modules, and network modules. The operating system module provides simulated kernel interface on parallel environment, such as thread scheduling and management, memory management, and interrupt and trap handling. It also provides InterProcessor Interrupts (IPI) and handlers, which serves as message handler facility in message-passing architecture. Shared-memory module provides access to local shared memory, handles memory management, and provides atomic synchronization operations such as test-and-set. Data collection module is responsible for information collection and display. Network module simulates the movement of data within the interconnection network.
In general, a parallel-system simulator may face potential drawbacks on two fronts { speed and accuracy. To achieve accuracy, PROT EUS applies execution-driven mode, which ensures network contention ordering of program events and permits accurate simulation of contention and process interaction. To achieve high speed, PROT EUS avoids interpreting user application code whenever possible, thus removing the overhead of interpretation for most instructions. It is designed to make the entire system run in a single address space. PROT EUS also provides users high exibility in choosing or customizing the level of accuracy and gives users the control of tradeo between accuracy and speed. These and other features make it faster and more accurate than most existing simulators.
In addition to performance, a primary asset of PROT EUS is its support for monitoring and debugging. PROT EUS provides an internal debugging mode called \snapshot", which allows users to examine the status of threads, processors, locks and memory. PROT EUS also provides repeatability, user can rerun the simulations to nd out bugs. PROT EUS provides an integrated subsystem for system con guration, data collection and result display. Its graphic capabilities make it simple in system con guration and in performance evaluation of algorithms and architectures.
Like other simulators, however, limitations exist. Though PROT EUS is a high-level fast simulator compared to its counterparts, it still takes long time to simulate comparatively complex algorithms.
One reason for the long running time is that PROT EUS is a sequential program. Its parallel version is not available. The other limitation of PROT EUS is that it does not provide options for recording simulation results. The simulator automatically records much information about the execution. When massively parallel processing (number of processors > 500) is simulated, information recording will become very resource consuming.
Packed Exponential Connections Network
Many factors contribute to the design of an interconnection network. In addition to speed and connectivity, simplicity and scalability are two other important concerns in network design 10]. Some networks, such as complete connection network, may provide high analytical performance. But, di culties in VLSI fabrication and hardware scalability render them impractical when building actual parallel computers. A simple and e ective interconnection network is the mesh. Mesh, however, lacks support for long distance connections, which makes it very ine cient when system ensemble size is large and non-local communication is needed.
Packed Exponential Connections (PEC) network is a new interconnection network that solves the problems of scale and connectivity by augmenting a two dimensional mesh with additional long and q is a non-negative integer. We say PEC(i) = h.
Note that the PEC value of a node i is the least signi cant position of i where a \1" appears.
The rightmost bit position is counted as position 1. De nition 2 A PEC network is a graph G =< V; E >, where V =f0,1,: : :, 2 n?1 g and E is a set of links, E = f(i; j)j(i = j 1) or (i = j 2 h , and PEC(i) = PEC(j) = h), where i and j 2 V .
In general, a node i with PEC(i) = h has at most four links which connect it to nodes i+1; i?
1; i + 2 h , and i ? 2 h . Figure 2 shows an alternative representation of 1-dimension PEC network of size 32.
2-D PEC Network
In a rectangular mesh network, each node has four nearest neighbor connections (E,W,S,N). PEC adds four additional connections that are part of a second 2-D mesh of a di erent scale. The nearest neighbor mesh will be called PEC(0). A mesh that connects every second node is a 2-D mesh of PEC(1). In general, a 2-D mesh of PEC(k) will connect the processors 2 k neighbors away. 2-D PEC has several properties which make it an attractive alternative to other networks. De nition 3 A processor farm groups the nodes in one area of the processor array, and uses these processors together to implement a function, or a set of functions within the context of a larger application 
Simulation Design
The heart of parallel-system simulation design is the design of routing schemes. The introduction of a 1-D routing scheme, extension of the new routing to 2-D PEC, and incorporation of the new scheme into PROT EUS simulation environment are presented in following four sections, respectively.
A Routing Scheme on 1-D PEC
Lin and Prasanna have proposed the G-Route which can set up a path from node 1 to node i in PEC graph with 2 n ? 1 nodes, where 1 i 2 n ? 1 8] . G-Route reaches the theoretical lower bound of routing on 1-D PEC network. However, it has its drawbacks. First, instead of providing a general routing scheme from node i to node j, G-Route only gives out the route scheme from node 1 to node j. Secondly, G-Route requires the user to present the optimal characteristic set, H n , which is not easy to acquire. When the PEC network is not a perfect PEC network, that is when the network size is not 2 n but m2 n (for example, 24, 48, etc), determining an optimal characteristic set is extremely di cult, and, therefore, the applicability of G-Route is very limited in practice 1 . On 1-dimension PEC, a routing path from node i to node j can be determined recursively when we know at each stage the longest link it takes. R-Route routing scheme proposed in ths study is based on this property. Assuming i and j are the source and destination node on 1-dimension PEC respectively, and assuming the longest link between i and j is the link goes through nodes with PEC level equals k, where k 0, then R-Route can be generated using the following steps:
1 Though we can combine routing i to 1 and 1 to j together to make an i to j routing scheme, it will make node 1 the bottleneck in communication.
1.) Put all the links through level t = k nodes between node i, j into PATH, and mark remaining links between node i, j unresolved.
2.) Search for links through level t = t ? 1 nodes from the unresolved links that do not overlap with any link in PATH. If there is any, put them into PATH.
3.) Repeat
Step 2 until both node i and j are reached by PATH.
4.) PATH is the R-Route routing path.
An example of R ? Route path is given in Figure 5 . There are two major di erences between G-Route and R ? Route: rstly, there is a trace-back operation in G-Route but not in R ? Route; secondly, a characteristic set is needed in G-Route but not in R-Route. The most salient feature of R ? Route is it does not require the knowledge of a characteristic set 8]. The other advantage is it can be applied to an imperfect PEC network.
2-D PEC Routine Scheme
PEC network is a relatively new network. Optimal routing schemes on PEC networks are not available, especially for 2-dimension or higher PEC networks. By theoretical analysis, an optimal route between farms in a PEC network could give routing distances similar to those found in a binary tree 9]. Using high level connections for optimal routing schemes is still a subject of current research 5]. There is no near optimal routing scheme available on 2-D PEC. Figure 8 gives an intuitive sense of the complexity to utilize high level PEC connections on 2-D PEC. The 2-D PEC routing algorithm has the following three step structure. In our simulation, we assume that X-Y routing will be accomplished with each farm. That means a data value on a given processing node will be sent to a target node by following horizontal and vertical paths (or vice versa) which utilize either nearest neighbor or 1-D routing contributions. Figure 9 gives the 2-D routing program. To avoid generating high contention in a particular direction, the program chooses the direction with longest distance as the rst routing direction. 
Embedding the Routing Scheme into PROT EUS
The network modules in PROT EUS 2] provide the functionality to simulate data movement in an interconnection network. Instructions that a ect remote nodes are implemented using simulator request. The network module uses three types of requests. The rst is send request, which signi es that the processor is ready to send a packet to the network interface. The second type of request is route request, which determines the next node for a packet, computes the arrival time of the packet, and calculate the contention involved on each message. The third type is receive request, which occurs when the packet reaches the target node. The receive request either interrupts the processor or noti es the cache depending on the packet. In PROT EUS, only the network module generates route and receive requests; all other modules generate only send requests.
The route request decides the next node to which the packet should be forwarded. For example in a k-ary n-cube the route request determines, based on the target node, which output link to use, the incoming link, etc. New release of PROT EUS provides a router function from which users have an interface to de ne the special interconnection. The function resides in the program net.exact.c. The format of the function is:
default router (int from, int to, int prev, int, curr, int *line) from is the number of source node, to is the target node, prev is the last position, curr is the current position, and line is a list of dimensions passed. The return value of this function is the next node position corresponding to curr. default router () is called each time when there is a single data movement from one node to a neighbor node. To facilitate the simulation speed, the PEC routing scheme could be generated beforehand and stored in a 2-D le|-routetab. The contents of routetab is generated by a separate program using R-Route method. So it avoids the tedious procedure to do recursive searching at each step of the PEC routing. As a contrast, we also implement mesh routing scheme in the default router (). Because mesh routing is direct and easy to execute, its routing function is putted inside default router () 2, 6] .
The number of messages passed and the total contention involved on each node are the two 
Application and Analytical Results
Two binary-exchange scheme algorithms, namely all-to-all Broadcast and Fast Fourier Transform (FFT), have been implemented to test the performance of PEC network. A generic binary ? exchange computation can be de ned as one that repeatedly uses data values that are a power of two apart. Given an initial set of n data values, a 0 ; a 1 ; : : : ; a n?1 , we consider patterns of processing that involve pairs of the form a 0 i = a i a i+2 and a 0 i+2 k = a i a i+2 k for i = j : : : j + 2 k ? 1; j = 0 : : : ; 2 k+1 ; 2 2 k+1 ; 3 2 k+1 ; : : : ; where the operator x y denotes any arithmetic, comparator, or set operation. The range of k can be either from 0 up to (log 2 n) ? 1, or vice versa 11].
The main feature of binary-exchange scheme is the communication pattern required for completion. As we mentioned above, if a i and a i+1 are close to each other, then a i and a i+2 k will be considered far from each other (see Figure 10) . If the computer system has a shared single memory, then the time required to access either their nearest neighbor or remote nodes is the same. However, this is not true for distributed memory machines. On distributed systems, latency of local and remote data access is di erent, and the performance of a binary-exchange algorithm varies with network support. In addition, binary-exchange is a widely used computation pattern of scienti c and non-scienti c applications. They are good candidates for testing the performance of 
All-to-All Broadcast
In one-to-all broadcast, a single value that resides on one of the PEC nodes is to be copied to all other nodes. In all-to-all broadcast (also called total-data-exchange), a single value resides on each of the PEC nodes is to be copied to all other nodes. One way to accomplish all-to-all broadcast is to conduct one-to-all broadcast on each node concurrently and send the value using n + 1 phases as discussed in Section 3. The all-to-all broadcast algorithm, which is implemented on each node, is listed in Figure 11 . Since each node will hold all the N data at the end, at least N memory locations have to be allocated on each node before data transfer, where N is the total number of nodes in the PEC network.
Fast Fourier Transforms
Fast Fourier Transform (FFT) plays an important role in many scienti c applications, including time series and wave analysis, solutions to linear partial di erential equations and image ltering. The computation pattern of sequential FFT algorithm is a reversed binary-exchange (called butter y) scheme. Three di erent data layout exist for parallelizing sequential FFT algorithms: cyclic layout, block layout, and hybrid layout. Cyclic layout assigns the (i + j N)th row of the butter y to the ith processor where N is the number of nodes in the network and j is a non-negative integer. Block layout place the rst m=N rows of the butter y on the rst node, next m=N rows to the second node, and so on, where m is the problem size. Hybrid layout starts with cyclic layout and switches to block layout in the middle of the computation 7]. Both cyclic layout and block layout need one local computing, at the end and at the beginning respectively, and compute at each stage of the butter y. The hybrid layout consists of two local computings and one global data transpose. Due to the transpose communication, hybrid data layout based algorithms are also called transpose algorithms 7] . Transpose algorithms provide a better performance than that of the other two approaches when the problem size is large. Our implementation is based on the hybrid approach, and the data transpose is accomplished by using all-to-all personalized broadcast. All-to-all personalized broadcast has the same communication pattern as the all-to-all broadcast, the binary-exchange pattern, while with an increased data size in communication. 
Simulation Results and Analysis
Simulations have been conducted on PROT EUS 2] environment to simulate performance of 2-dimension PEC and mesh network with 64 nodes and 256 nodes. Since data transfer is the main concern of performance evaluation, two di erent statistic graphs are presented below for network contention and sum of data ow hops which occurred on each processing node respectively.
For all the graphs, we assume m to be the total number of data being accessed, Y{axle shows the respective parameter it measured while X-axle shows the number of processors. 
Simulation Result
Two sets of graphs are presented to demonstrate the data ow hops and contention on each processor respectively. Data ow hops on each node are the total number of messages which have been sent out by the node, including both messages generated and messages passed. This parameter gives a general idea of the \intensity" of usage for each node. Contention on each node is the sum of time taken by all messages waiting to be sent out on the node. This parameter gives a general idea of the contention involved in each node. Two di erent data sizes and two di erent ensemble sizes are used for testing. The simulated results on PEC and on mesh are presented side-by-side for easy comparison. On the data ow gures, the shape of PEC is much steeper than that of mesh. The reason for this is that on PEC network, because R-Route utilizes longest path on each X and Y direction as much as possible, the chance that a message route through a higher level link on each Xand Y-dimension is higher than through a lower level link. On mesh, however, all nodes have the same four nearest neighbor connection and no long jump route exists. The sum of data ow hops on all nodes of PEC and mesh network are plotted in Figure 25 and Figure 27 . Figure 25 shows the total data ow hops on 64 processors. Figure 27 shows the total data ow hops on 256 processors. From Figure 25 and 27 we can see that for all-to-all broadcast PEC network encountered less data ow than that of mesh. What is more, the relative performance di erence of the two networks increases with ensemble size and problem size. It shows that the PEC network is more scalable than mesh network in the measure of data ow hops. From Figure 26 , we can see that when the number of nodes is 64, the contention on PEC is higher than contention on mesh. The reason for this is that in PEC structure all messages try to route through highest level PEC nodes available on each X-and Y-dimension. This longest possible routing scheme generates high contention on some high level nodes of PEC. When the number of processor is 256, we nd the contention on PEC becomes lower than that of mesh. The reason for this is that, with large ensemble size, message must route through more nodes before reaching the target node on mesh; while on PEC network more high level paths become available and contention on high level links is scaled down (see Figure 28) . The long links of PEC make it more e cient for large ensemble constructs. Contention on PEC over mesh is expected to be even better when the ensemble size scales up from 256. PEC is more scalable than mesh in the measure of contention as well. Though communication data ow and contention of FFT with 64 nodes are not presented, the scalability discussion of the performance of all-to-all broadcast is applicable to FFT performance as well. PEC is a more scalable network than mesh.
Architecture Enhencement
Combining the performance analysis of the two applications, we conclude the following ndings.
PEC network is superior than mesh, in the measure of contention and in the measure of number of messages passed per node. For binary-exchange scheme, the contention generated on each node of PEC is more uniform than that of on mesh, which means load balance, in terms of communication utilization, is better accomplished on PEC. PEC is more scalable than mesh, both in the measure of contention and in the measure of data ow per node. On PEC network, contention is concentrated on some high-level links along X and Y directions. Further network enhencement is possible by increasing bandwidth of these highly utilized links.
The reason of high contention on some high level links can be traced back to the routing scheme used on PEC network. To reduce tra c, 1-dimension R-Route is designed to use long connection (high level) links whenever possible. Therefore, the success of R-Route means a high utilization of long jump links when global communication is Like most parallel-system simulation, this study includes multiple phases: understanding the characteristics of the network topology, developing or mastering and modifying an existing simulator, identifying algorithms and applications for benchmarking, and conducting the simulation and analysis of the simulation results. Unlike most simulation studies, since there is no routing scheme available for PEC network, a considerable e ort was made in this study to develop a practical routing scheme, R-Route. Applications with two di erent global communication patterns, namely total-data-exchange and global data transpose communication pattern, are used for performance evaluation. Their performances on 2-D PEC are simulated on PROT EUS software environment, a widely used parallel system simulator developed at MIT. Simulation results show PEC network provides better support for communication than 2-D mesh in terms of contention and messages passed per node. In addition, PEC is more scalable and expected to deliver an even better performance over mesh when the system ensemble size and problem size increase. Truncated Fat PEC network is suggested based on our simulation ndings to further improve the performance of PEC network.
The simulation results conducted in this study are subjected to the newly proposed routing scheme, R-Route. In contrast to a 2-D mesh, where an e cient routing scheme is readily available, routing on a 2-D PEC network is not an easy task. R-Route is a one dimensional routing scheme. It is not an optimal routing scheme on a 2-D PEC and, in fact, it is not an optimal routing scheme on a 1-D PEC. The development of optimal routing schemes and the searching of the lower bound of routing on a 2-D PEC are still the subject of research. With an optimal or improved routing scheme, the performance of PEC network would be further improved.
