The computational performance of multiprocessors continues to improve b y leaps and bounds, fueled in part by rapid improvements in processor and interconnection technology. I/O performance thus becomes ever more critical, to avoid becoming the bottleneck of system performance. In this paper we p r o vide an introduction to I/O architectural issues in multiprocessors, with a focus on disk subsystems. While we discuss examples from actual architectures and provide pointers to interesting research in the literature, we do not attempt to provide a comprehensive s u r v ey. W e concentrate on a study of the architectural design issues, and the e ects of di erent design alternatives.
INTRODUCTION
As high-performance computers continue their stunning increases in computational performance, fueled in part by rapid improvements in processor and interconnection technology, I/O becomes an increasingly important c o m p o n e n t of overall system performance. This fact is especially true for parallel computers, where the combination of numerous processors boosts computational performance, leaving I/O as the serial bottleneck that limits scalability 2 ] . Indeed, many scienti c and commercial applications have tremendous I/O requirements 20], both for moving data in and out of the parallel computer, as bottleneck. Note also that the memory and memory-bus bandwidth needs to be 2{4 times that of the total disk or network bandwidth, because they are used more than once.
We also assume that the reader is familiar with the fundamentals of parallelcomputer architecture (for an introduction see 1] o r 5 5 ] c hapter 10). In this paper we use Flynn's taxonomy 2 9 ] to distinguish between SIMD (single instruction stream, multiple data stream) and MIMD (multiple instruction stream, multiple data stream) architectures.
Among MIMD machines, we distinguish between multiple-address-space systems and shared-address-space systems (sometimes called shared-memory systems). In a multiple-address-space system, each processor has its own private physical address space, and the memory is physically distributed. Processors communicate explicitly by passing messages over an interconnection network.
In a shared-address-space system, the hardware provides a shared physical address space. If the shared memory is physically centralized, we call it a Uniform Memory Access (UMA) architecture. If the shared memory is physically distributed, we call it a Non-Uniform Memory Access (NUMA) architecture. In either case, communication is implicit, with hardware translating accesses to re-mote addresses into messages on the interconnection network. Note that both architectures can support many di erent programming paradigms, including shared-memory and message-passing.
We often refer to processors, or processor-memory units, as \nodes," a name that comes from a vision of processors as nodes in the graph of an interconnection network.
EXAMPLE ARCHITECTURES
We use the following machines as examples during our discussion of several issues in the design of parallel I/O architecture. Although there are many interesting parallel machines, we c hose each of these as an interesting representative o f a n a r c hitectural category. W e i n troduce each brie y below, and cover more details in later sections.
Shared-address-space UMA: DEC AlphaServer 2100
UMA (shared-memory) multiprocessors usually connect several CPUs to a single memory with a single bus. Today, small shared-memory multiprocessors are common, sold by nearly every Unix workstation vendor (they are sometimes called SMPs, for Symmetric MultiProcessors). In the simplest case, an UMA multiprocessor looks like the uniprocessor in Figure 1 , but with multiple CPUs attached to the CPU-memory bus.
The DEC AlphaServer 2100 59], sketch e d i n F i g u r e 2 , i n c l u d e s a t l e a s t t h r e e buses in a hierarchy. This structure allows connection of I/O devices designed either for the fast, new standard PCI bus or the slower, old standard EISA and SCSI buses. Since their PCI bus can sustain 132 MB/s, and one SCSI bus can handle 10-20 MB/s, it is possible to connect several SCSI buses to the PCI bus.
Shared-address-space NUMA: KSR 2
There are many di erent v arieties of NUMA architecture, but perhaps the most recent common system is the KSR-2 45]. Custom KSR microprocessors are interconnected by a hierarchy of rings, and specialized hardware manages nearly all of the memory in the machine as a shared cache, migrating sub-pages (cache lines) from processor to processor. A SCSI-bus adapter may be connected to any processor node. Other NUMA systems with interesting I/O architectures include the BBN Butter y Plus 5], which had VME-bus adapters connected directly to the multistage omega interconnect, the NCR 3600 50], with a tree interconnect and specialized I/O nodes at the leaves, and the Convex Exemplar 12], with a dedicated I/O processor for each cluster of computational processors.
Multiple-address-space, hypercube interconnect: nCUBE/ten Some of the earliest large multiprocessors were based on a hypercube interconnect, and there have been many I/O studies speci cally aimed at hypercubeinterconnected multiprocessors 30, 3 2 , 3 5 , 58, 70] . Thus, we consider this class of machines separately from other multiple-address space machines.
We s k etch t h e I / O a r c hitecture of the nCUBE/ten and nCUBE/2 in Figure 
DISK I/O
In this section we discuss some of the architectural issues in parallel disk subsystems, and speci c ways in which our example architectures deal with those issues. After a review of disk arrays, we focus on ve fundamental issues in parallel-I/O architecture design: connection, management, placement, bu ering, and availability.
Disk arrays and RAID
Although disk arrays are not the focus of this paper, they represent a f u ndamental form of parallel I/O. We t h us review the topic of disk arrays and redundant d i s k a r r a ys (RAID) for readers who may not be familiar with the topic. Chen et al. 13] and Gibson 33] p r o vide more detailed surveys.
To improve the capacity and bandwidth of the disk subsystem, we m a y group several disks into a disk array, and distribute a le's data across all the disks in
Node Controller
Buffer memory There is no universal agreement on the de nition of these terms, but common usage seems to indicate that declustering means any distribution of a le's data across multiple disks, whereas striping is a declustering based on a roundrobin assignment of data units to disks. Interleaving is less commonly used now, but some have used it to mean striping when the disks are rotationally synchronized.
Network interface

SCSI bus Adapter
Early work by Kim 42] and Salem 60] typical values in the hundreds of thousands of hours. If le data are striped across N disks, then the failure of any one disk essentially causes the loss of the le. If the disks are assumed to fail independently and with an exponential failure rate, then an N -disk array will fail (lose data) N times as often as a single disk, i.e., MTTF N = MTTF 1 =N . Some form of fault-tolerance is necessary to protect data against disk failure.
In 1988 Patterson, Gibson, and Katz presented \a case for redundant a r r a ys of inexpensive disks (RAID)" 54], in which they argued that disk arrays could be faster, cheaper, smaller, and more reliable than traditional large disks, and categorized several techniques for using redundancy to boost the availability o f disk arrays. We summarize the work here. Their RAID \levels" are ( RAID Level 2. Hamming code. Data is striped across N data disks. Compute a Hamming code 36] for each group of N bits, one taken from each data disk at corresponding positions, to produce a larger set of bits. Add several \check" disks, so that you can distribute the coded bits one per disk. Since a Hamming code is designed to detect and correct errors, the bit lost due to a disk failure can be recovered using the extra Hammingcode bits stored on the check disks. For N = 10 disks, 4 check disks are required for N = 25 disks, 5 check disks are required. Thus, fewer disks are required than in RAID level 1. The Thinking Machines DataVault 64] was one successful RAID 2 product.
RAID Level 3. Single-bit parity. Since, when a disk fails, it is known to have failed, and the identity of the failed disk is known, a single parity bit for each N -bit data wo r d i s s u c i e n t to reproduce the lost bit in that word. Thus, RAID level 3 uses only one \parity disk" for any group of size N .
RAID Level 4. Block-sized striping unit. RAID leve l 3 i s e e c t i v e for large reads and writes, each o f w h i c h span all of the disks. Some workloads, such as transaction processing, tend to make smaller read and write requests. RAID level 4 uses blocks instead of bits as the striping unit, although parity is computed in the same way: one parity bit is produced from N bits, one from each disk at corresponding positions. Thus, it is possible to concurrently read di erent b l o c ks of data from each data drive, unlike i n RAID 3. 1 2 1 2 1 3 1 3 1 4 1 4 1 5 1 5 1 6 1 6 1 7 1 7 1 8 1 8 1 9 1 RAID Level 5. Rotated parity blocks. Notice that in a workload of small reads and writes, RAID level 4 requires four one-block I/Os to write a single data block: read the old data and parity b l o c ks, compute the new parity block, and write the new data and parity b l o c ks. Although the data reads and writes are spread over N disks, the parity disk is used for every write request, and thus becomes a bottleneck. RAID level 5 solves this problem by distributing parity blocks across all disks each stripe still contains N data blocks and one parity block, but their positions are di erent o n e a c h stripe.
The most common RAIDs in use are RAID level 0 (when reliability i s n o t a n issue), RAID level 1 (primarily in critical database applications), RAID level 3 (for high-bandwidth large-read and -write applications), and RAID level 5 (for applications with small I/O requests).
There are numerous RAID implementations from many v endors, some implement e d i n s o f t ware (in the le system or device driver), and some implemented in hardware and rmware (in the disk controller). There are a few software-RAID systems that distribute data around a network 37, 4 8 , 6 2 ]. These systems are intended to support traditional distributed-workstation workloads.
One group at Hewlett-Packard has extensively examined the question of parallel RAID management, beginning with DataMesh 68] and later TickerTAIP 11] . Although these systems were designed primarily for uniprocessors, they do have the potential to be connected to multiple independent processors. In their most recent w ork they show h o w to use a hierarchy of RAID level 1 and level 5 to construct an easy-to-use, cost-e ective, high-performance disk array 6 9 ]. One extreme is to connect the I/O nodes, or even I/O-device adapters, directly to the primary interconnection network. Another extreme is to provide an entirely separate I/O network, to which e a c h processor is connected. Or, a compromise is to connect each I/O node to a few points in the main network using an \extra" link most communications between computational nodes and I/O nodes are routed through the main network as well as the link to the I/O node. This distinction is important, because I/O-related network tra c often has di erent c haracteristics from other interprocessor network tra c. I
Connection
These issues are critical because an I/O system depends on an ability t o move data. Too many systems have fast interconnection networks that are limited to slow performance by an ine cient n e t work interface. Without DMA, for example, the CPU must use programmed I/O, requiring an interrupt to feed each p a c ket into the network (the IBM SP-1 had this restriction, limiting the performance of its parallel le system 28]). Furthermore, while simple DMA makes a big di erence, more sophisticated DMA functionality can be extremely useful. For example, if the DMA unit can gather discontiguous memory chunks into a message, or scatter a message into discontiguous memory chunks, extra memory-memory copies c a n b e a voided. Several parallel le systems have found it advantageous to support discontiguous le accesses 19, 28, 53] , for which data-reorganizing DMA support would be helpful. Since many parallel le systems are implemented as a user-level library on the compute nodes, and a kernel-level server on the I/O nodes, performance improves if messages can be sent and received through the network interface from user leve l , w i t h o u t k ernel intervention, because there is less overhead on the compute nodes. Several research projects demonstrate the bene ts of user-level network interfaces 8, 67]. Shared-address-space systems, by de nition, have specialized hardware support for load and store, to remote memories if necessary, from user level. I/O activity w ould make good use of a block-transfer mechanism, which can be viewed as a form of DMA to or from remote memory. T h e BBN Butter y had this feature 5]. CM-5: device controllers are attached to specialized I/O nodes, which a r e a ttached to the interconnection network. I/O nodes have special DMA controllers that can scatter data from the bu er RAM, through the network interface, to multiple compute nodes, in a wide variety of patterns. Alternatively, it can gather data from multiple remote nodes into the bu er. This ability to reorganize data is an important c o m p o n e n t of the their ability to provide a traditional linear-le model, striped across disks in 16-byte striping units, and yet be able to map the data in the le to different application \geometries" of processors and virtual processors. The compute-node network interface is accessible at user level. 
Management
Input/Output refers to the process of moving data into memory from a peripheral device, or out from memory to a peripheral device (such as disk, tape, or network). In a multiprocessor, there may b e m a n y memories (typically one for each processor) and many peripheral devices. A key issue, then, is management: what processors manage access to the devices? There are three common solutions, shown in Figure 8 , where the management i s A. centralized on one processor B. distributed among all processors, or C. distributed among a subset of processors that are dedicated to I/O.
Typically, as shown in Figure 8 , the devices are attached to their managing processor.
The centralized approach is common in SIMD systems, where most management is centralized anyway and the programming model is synchronous. In large MIMD systems, however, it represents a serious potential bottleneck, especially when used with an asynchronous programming model. The number of I/O nodes and devices may b e c hosen independent o f t h e number of computational nodes, allowing more exible system con guration. I/O nodes may be constructed di erently, e.g., with a di erent CPU, more or less memory, specialized DMA hardware, and of course adapters for peripherals and I/O buses. Fewer adapters may be needed. System packaging may be simpler, since compute nodes may h a ve di erent physical characteristics than I/O nodes. Each m a y t i n to di erent t ypes of racks, for example. I/O-service activity does not impact application computation by stealing cycles or memory, or causing unexpected interrupts.
A) B)
C) Figure 8 Three common solutions for management of parallel I/O: A) centralized, B) fully distributed, and C) distributed over a dedicated subset.
On the other hand, distributing I/O management among all processors could lead to better locality, i f e a c h processor could focus its I/O activity on its local I/O devices. It is di cult to characterize the performance tradeo s of this locality 4 3 ], especially given the wide variety o f w orkloads and interconnectionnetwork architectures, but it seems likely that local disks would be useful for paging and other forms of virtual-memory support for out-of-core computations 15, 1 7 ] .
Management in example architectures
DEC 2100: Theoretically, it is possible for any processor to manage the devices, although some operating systems may c hoose to centralize the management on one processor. In a \symmetric" (SMP) operating system, management of all disks is distributed across all processors. KSR 2: Management and devices are distributed among a subset of processors, though they are not typically dedicated to I/O. Once disk data are read into memory, and that memory is mapped into the application's virtual address space, the shared-memory system handles the movement o f data to the appropriate processors. 
Network-attached storage devices
There is an increasing trend to separate device management i n to high-level and low-level components and to attach the device controller directly to an interconnection network, rather than to a specialized I/O bus. Then a host CPU in one location provides high-level management, while the low-level details are handled by the device controller. This trend is partially a result of the ever-increasing sophistication of device controllers, and by the potential for better performance by m o ving data directly from the device to the network, bypassing an I/O bus, I/O adapter, and any I/O node's memory. The CM-5 is one specialized example. Other important examples include the RAID-II 24] and HPSS 16, 1 8 ] projects. The trend toward network-attached storage devices (NASD) is still new and may h a ve a signi cant e ect on parallel and distributed I/O architecture.
Placement
All multiprocessors have a n i n terconnection network, and all networks have some topology. M a n y topologies are more complex than a bus or a ring, such a s a h ypercube or a mesh. Communication latency, bandwidth, and contention in these networks often depend on the relative position of the endpoints of the communication. Thus, the position of the I/O nodes or devices in the network topology can have a signi cant impact on the performance of the I/O system. There are three typical approaches: Position is largely irrelevant in some networks, such a s b u s e s a n d m a n y r i n g s . 
Bu ering
Bu ering and caching are important aspects of any I/O system. Bu ering is important, for example, between a disk drive and an interconnection network, to compensate for the di erent speeds, di erent granularity (blocks or packets), and burstiness due to device characteristics (disk seeks) or load (network congestion). A bu er cache, which is an associatively addressed bu er pool holding recently used blocks, is important because it can often avoid I/O entirely. A bu er cache can be particularly important in the I/O node of a multiprocessor, because it can take a d v antage of interprocessor locality, when multiple processors are accessing di erent parts of the same block 4 4 ].
All I/O systems have b u e r i n g i n s e v eral places. We expect to see small speedmatching bu ers in the interconnection network, network interfaces, and device adapters. We expect to see bu ers and caches inside the disk or tape controllers, and memory caches in CPUs and processor boards. And, of course, operating systems often use some RAM memory for a le-system bu er cache. Of interest here are systems that have explicit bu er or cache hardware set aside for I/O, beyond the usual hardware described above. In this machine, the processor nodes are arranged in a binary-tree interconnect, with I/O nodes and disk drives at the leaves of the tree, specialized data-merging processors in the internal nodes, and one control processor at the root. This structure is thus designed for the selection, merging, and sorting operations common in database queries. It appears to be specialized for intraquery parallelism rather than inter-query parallelism. Dewitt and Gray discuss parallel database machines in more detail 21].
TAPE I/O
Most modern multiprocessors support tape devices, because many m ultiprocessors are used for data-intensive scienti c or commercial applications, and tapes are a cost-e ective form of tertiary storage. Most connect standard tape drives through a SCSI-or VME-bus, just like a n y d i s k d r i v e. The CM-5 actually has a specialized tape node, which is quite similar to the disk node in Figure 5 . A m o r e i n teresting approach i s tape striping, i n w h i c h data from a single le is striped across several tapes in several tape drives, for increased bandwidth 14, 23 ]. It appears to be di cult to obtain high performance from tape striping unless the workload is primarily large, sequential transfers 23].
NETWORK I/O
Multiprocessors have a l w ays supported external networks the early generation (BBN Butter y I, Intel iPSC/1, Cosmic Cube, etc.) typically had an Ethernet connection but no local disk drives. Most modern multiprocessors connect to external networks by a t t a c hing a network interface to one of the processor nodes. With a fast external network, such as a HIPPI network, it is important to consider how t o s m o o t h t h e o w of data from compute nodes to the I/O node and thence to the external network, or vice versa, especially when the data must be gathered from (or scattered to) many compute nodes 9, 34]. On the other side of the network interface, a industry-government consortium has de ned a protocol for parallel data transfers across multiple network connections between distributed supercomputers and network-attached peripherals 7].
The CM-5 has specialized HIPPI-network nodes 66] they are similar to the disk node in Figure 5 except that they have eight interfaces to the CM-5 data network. These eight 20 MB/s connections provide enough connection bandwidth to service the 100 MB/s HIPPI bandwidth.
The nCUBE/2 also supports HIPPI by using multiple internal-network connections to feed one HIPPI network 19]. As with the disk and graphics boards, the HIPPI-network board has 16 I/O nodes and 128 connections to compute nodes. The I/O-node memory is dual-ported video RAM, and shared with the HIPPI DMA hardware. Thus, compute nodes send data to the I/O nodes, who write it into bu ers in the RAM. The HIPPI interface reads data out of those bu ers and writes it onto the network.
The Maspar MP-2 attaches a HIPPI controller to its I/O bus, much l i k e t h e disk array in Figure 6 51 ]. Again the I/O RAM serves as a bu er between the HIPPI network and the internal global router.
SUMMARY
We describe the fundamentals of I/O architecture for multiprocessors, including a review of uniprocessor I/O architecture and disk arrays. Our discussion focuses on disk subsystems, and in particular the following design issues: connection, management, placement, bu ering, and availability. W e use several machines as recurring examples, including the DEC AlphaServer 2100, KSR 2, nCUBE/ten, CM-5, and Maspar MP-2. We also brie y cover database systems, tapes, external networks, and graphics. 
