Introduction
In conjunction with a consortium of institutions led by Cal Tech, Intel has developed a massively parallel, distributed memory parallel processor called the Touchstone DELTA system. The DELTA processors are interconnected by a mesh, rather than the hypercube topology used in Intel's earlier parallel systems (iPSC/l, iPSC/2, and iPSC/860). The Intel i860 processor is used as the computing element in the DELTA, the same processor used in the iPSC/860 hypercube. This report describes the DELTA system and contrasts its performance with the iI'SC/S60 hypercube and the Ncube 6400 hypercube.
The DELTA project is a prototype to demonstrate that a mesh topology using byte-wide communication channels is a more cost-effective utilization of "communication wires" than the bit-wide hypercube topology. That is, given a fixed number of wires, can a more effective parallel processor be constructed using more wires among fewer adjacent processors versus using those wires to provide more direct connectivity? Since the DELTA and iPSC/860 use the same processor, measuring and comparing communication performance bet ween these two parallel processors should provide some answers.
The mesh has some potential advantages over a hypercube topology. Though both topologies are extensible, in practice? commercial hypercubes have a fixed maximum dimension. For example, the largest iYSC/860 is seven dirnensions or 128 processors. Hypercubes must be expanded in powers of two, which is often prohibitively expensive. Meshes can be expanded at linear costs by adding an additional row or column. Of course, the hypercube topology has advantages as well. The maximum distance between two processors in an n processor system is only log2n for a hypercube, compared with Jn for the mesh. The lower connectivity of the mesh may lead to communication "hot spots" in the mesh or to slower aggregate communication operations such as barriers.
In the following section, the hardware specifications of the DELTA mesh are described. In section 3, the corrimunication performance of the mesh is dcscribed and contrasted with hypercube communication performance. In section 4, the performance of the attached file system is measured. Section 5 calculates some communication rnetrics and contrasts the performance of two parallel applications on the mesh and on hypercubes.
DELTA configuration
The Intel Touchstone DELTA system is a mesh-connected parallel processor, consisting of 528 is60 compute nodes, 32 50386 I/O nodes, two 80386 network interface nodes, six services nodes, and two tape nodes [5] . Each compute node has 16 million bytes of memory and is connected to a Mesh Routing Chip (MRC) through a Mesh Interface Module (MIM). Each MRC chaiinel is 8-bits wide and has a bandwidth of 65 million bytes/second (MB/s), but the FIFO's on the MIM have only a 26.7 MB/s data rate (Figure 2.1). (As of August, 1991, the pcak data rate of the communication system was limited to 22 MB/s [6] .) The largest mesh available to an application is 16 x 32. The prograniming environmerit, attached file system (Concurrent File System, CFS), and network interface nodes are the same as that used on the Intel iPSC/S60 [3] . At the time of the tests, the DELTA operating system was NX 3.3, X012 tmnsmittal. The computational performance of a single $60 node is the same as the iPSC/860 node performance reported in [3] . The main difference between the 1)XCLTA and iPSC/860 is the number of nodes, 512 versus 128, and the architecture and speed of the communications. The iPSC/860 has bit-serial channels with a peak data rate of 2.8 MB/s connected in a hypercube network.
?---

Ethernet
Thus the iPSC/860 has more but slower channels than the DELTA. 
Communication Performance
In this section, we analyze the communication performance of the DELTA mesh, first looking at adjacent node performance, then at communication to more distant nodes. A simple echo test is used, where a message is sent and echoed back by the receiver. The sender measures the round-trip time €or 1000 iterations. Additional tests are performed with an artificial load on the communication channel (contention) and with multiple senders (concurrency).
Node-t o-node comrnunicat ion
Figure 3.1 shows the data rate for two adjacent nodes echoing messages of various message lengths. The data rate increases linearly with message sizes from 8 to 8192 bytes, with the DELTA reaching a peak of about 13.1 MB/s €or a message size of 100,000 bytes. For large message sizes, the data rate is over a factor of four greater than the iPSC/860 data rate. Also shown in the figure are the data rates for two non-adjacent nodes in a 64-node ensemble, where the distance between the two nodes is 16 hops for an 8 x 8 mesh and 6 hops for a dimension-6
hypercube. The multi-hop data rates for the DELTA and Ncube are nearly the same as their respective one-hop rates. ('l'he discontinuities in the Intel data rates will be explained later.) The extra time required for a multi-hop message is more clearly seen if we look at the latency for small messages (Figure 3.2) . Though the bandwidth between nodes has increased on the DELTA in comparison to the iPSC/860, the latency has remained about the same. The latency is dominated by housekeeping chores (argument checking, context switch on interrupt, etc.) on the i860 on both the sending and receiving nodes. In a separate study ([2]), the time to handlc the time-slice interrupt on the iPSC/860 was about 50 microseconds, which suggests that interrupt context switch overhead could be the dominant factor in message latency.
I
I__.. 
Contentioil
All of the communication data rates that we have reported have been measured on idle systems. In actual applications, other message traffic may compete for the communication channels, either from the application itself or from applications in other sub-cubes (sub-meshes). Other sub-cubes may need to use another sub-cube's communication channels to reach the host processor, 1/0 processor, or other service nodes. The iPSC/860, DELTA, and Ncube 6400 use circuitswitching to inanage the communication channels. When a message is to be sent, a header packet is sent to reserve the channels required. When this "circuit" is established, the message is transmitted, and an end-of-message indicator releases the channels.
A program was developed to measure the effect of contention on the data rate of a communication channel. For the hypercubes, the program has node 0 continuously send messages to nodc 7. The messages froin node 0 to node 7 pass through node 1 and node 3. The amount of load (measured as a percentage of the total available bandwidth of a single channel) presented by node 0 is varied by selecting various messages sizcs. With a communication load from node 0 to node 7, node 1 then sends a stream of messages to node 3. Node 3 measures the data rate of messages arriving from node 1 under various loads and for various effect from contention under the tested loads, whereas both hypercubes ( Figure   3 .5) exhibit the expected behavior, as the load from node 0 to 7 increases, the data rate from node 1 to node 3 decreases. The effect of contention can vary froiii run to run and can slow down an application. Since thc DELTA mesh has fewer channels between nodes than the hypercube architecture, one would expect increased contention for the mesh channels. The higher data rates of the DELTA rnesh channels will reduce the effect of contention, but communication "hot spots" may still develop for some mesh applications.
Concurrent Communiicat ion
Thc message-passing performance of a node may be improved by utilizing more than one of its communication channels In summary, the communication performance of the DELTA mesh provides fewer but higher bandwidth channels between adjacent nodes than the iPSC/860
hypercube. Both Intel communications systems have about the same latency for small messages, but the wider mesh channels provide nearly six times the bandwidth. The high bandwidth and fast routing of the DELTA rnesh further help reduce contention. Although therc are more hops in a mesh than for a hypercube with the same number of nodes, the multi-hop penalty on thc DELTA mesh is much smaller than that of the iPSC/860. So small, in fact, an application can treat the mesh nodes as if they were all "adjacent" nodes.
File System Performance
The DELTA system provides 32 1/0 nodes supporting the Concurrent File System (CFS). Files are striped across the drives, under the control of the user. The 1/0 nodes are directly connected to the mesh, but otherwise thc hardware and software is the same as the iPSC/860 CFS. . The iPSC/860 configuration had only 10 1/0 nodes. As the figure illustrates, the CFS performance for the DELTA when using 10 or less 1/0 nodcs differs little from the iYSC/860 performancc. Even though the DELTA communication is faster, the communication rate from the compute nodes to the 1/0 nodes is only a small percent of the total T/O pcrformance data rate (which includes disk latency, SCSI bus data rate, and file-system software overhead). Using all 32 1/0 nodes, the aggregate date rate for 32 compute nodes each reading their own 16 megabyte files is 11 MB/s. For both the iPSC/SSO and DELTA, throughput decreases for this simple read test when the number of computc nodes exceeds the number of I/O nodes, mainly because of thrashing within the 1/0 node buffers.
DELTA provides two 80386-based network nodes, each with 4 megabytes of memory and an Ethernet interface; the same hardware/software that is used on the iPSC/860 network nodes, The TCP data rate for one of these Ethernet nodes is roughly the same as for the iPSC/860 [3], ranging from 40 KB/s (8-byte message) to 304 KB/s (4096-byte messages). As with CFS, the mesh data rate is only a small factor in the Ethernet date rates.
. Summary
The Intel DELTA mesh provides improved communication performance over the Intel iI'SC/860 hypercube. The DELI'A mesh provides wider and faster communication channels between nodes, plus faster routing hardware, but the reduced connectivity of the mesh slows some communication primitives such as barriers. The message startup times are nearly identical, being dominated by software overhead and the speed of the i860. To compare the performance of the DELTA machine to the earlier machines, in an application involving both communication and computatiou, we solved a 1000 x 1000 linear system of equations (C double precision) using Cholesky 
Acknowledgements
Special thanks are given to the im.xibers of the Concurrent Supercomputing Consortium for their cooperation and contribution of time and access to the Touchstone DELTA system.
References
[ 1 ] J . Dongarra 
