We propose a simple extension to the optical network of a scalable multiprocessor that optimizes page swap outs. More specifically, we propose to extend the network with an optical ring that not only transfers swapped-out pages between the local memories and the disks of the multiprocessor but also acts as a systemwide write cache for these pages. This extended optical network confers several performance benefits: It provides a staging area where swapped-out pages can reside until the disk is free, it increases the possibility of combining several writes to disk, and it acts as a victim cache for pages that are swapped out and subsequently accessed by the same or a different processor. To evaluate the extent to which these benefits affect performance, we use detailed execution-driven simulations of several out-of-core parallel applications that run on an eight-node scalable multiprocessor. Our results demonstrate that our optical ring provides consistent performance improvements that derive mostly from faster page swap outs and victim caching. To show that our optical ring can also be applied successfully to traditional multiprocessors in which processors are interconnected with electronic networks, we evaluate its benefits for a mesh-connected multiprocessor. This latter evaluation shows that our optical ring improves performance for a traditional multiprocessor by roughly the same amount as it does for an optically interconnected multiprocessor. On the basis of these results and our parameter-space study our main conclusion is that our optical ring is highly efficient under several architectural assumptions and for most out-of-core parallel applications. Even though our study focuses on optimizing page swap outs, we believe that caching data with an optical ring can be beneficial for other types of disk-write traffic as well.
Introduction
Applications frequently access far more data than can fit in the main memory. Reducing disk-access overhead is the most serious performance concern for these out-of-core applications. For this reason programmers of these applications typically code them with explicit input-output ͑I͞O͒ calls to the operating system. However, writing applications with explicit I͞O has several disadvantages 1 : Programming often becomes a very difficult task, 2 I͞O system calls involve data-copying overheads from user-to system-level buffers and vice versa, and the resulting code is not always portable ͑performancewise͒ between machine configurations with different memory resources, I͞O resources, or both, such as different amounts of memory or I͞O latency. In contrast to the explicit I͞O style of programming, we advocate that out-of-core applications be based solely on the virtual-memory mechanism and that disk-access overheads be alleviated by the underlying system. Essentially, our preference for virtual-memory-based I͞O is analogous to favoring shared memory instead of message passing as a more appropriate parallel-programming model.
Basically, virtual-memory-based I͞O involves reading pages to memory and writing ͑or swapping͒ pages out to the disk. Page reads can usually be dealt with efficiently by the dynamic prefetching of data to the main memory ͑or to the disk-controller cache͒ ahead of its use ͑see, e.g., Refs. 3 and 4͒. In cases in which dynamic prefetching is not effective by itself prefetching can be improved with compiler involvement 1 or with future-access hints. 5 Page swap outs are more difficult to optimize, however, even though these writes happen away from the critical path of the computation. The performance problem with page swap outs is that they are often quite bursty, and, as a result, the operating system must at all times reserve a relatively large number of freepage frames to avoid having to stall the processor to wait for swap-out operations to complete. In fact, the more effective the page-prefetching technique, the greater the number of free-page frames the operating system must reserve. This situation is especially problematic for scalable multiprocessors for which not all nodes are I͞O enabled ͑see, e.g., Refs. 6 and 7͒ because both disk-latency and bandwidth limitations delay page swap outs.
In this paper, we propose a simple extension to the optical interconnection network of a scalable multiprocessor that optimizes page swap outs. Our proposal is based on the observation that the extremely high bandwidth of optical media provides data storage on the network itself, and thus these networks can be transformed into fast-access ͑temporary͒ datastorage devices. More specifically, we propose to extend the network with an optical ring that not only transfers swapped-out pages between the local memories and the disks of the multiprocessor but also acts as a systemwide write cache for these pages. When there is room in the disk-controller cache pages are copied from the optical ring to the cache such that the pages swapped out by a node are copied together.
The extended network requires a few changes to the original network interface and two simple modifications to the operating system's virtual-memorymanagement code. On the bright side, the extended network has several performance benefits: It provides a staging area where swapped-out pages can reside until the disk is free, it increases the possibility of combining several writes to disk, and it acts as a victim cache for pages that are swapped out and subsequently accessed by the same or a different processor.
To assess how these benefits affect overall performance, we evaluated our optical ring as an extension of OPTNET ͑optimized optical network͒, a costeffective optical network that we 8, 9 proposed previously. The extension basically adds an optical ring with several wavelength-division multiplexing ͑WDM͒ cache channels to the OPTNET network. These cache channels are used to store the swappedout pages. We refer to the combination of OPTNET and our ring as OWCache ͑optical write cache͒.
We use detailed execution-driven simulations of several out-of-core parallel applications that run on an eight-node scalable cache-coherent multiprocessor with four I͞O-enabled nodes. We consider the two extremes in terms of the page-prefetching techniques: optimal prefetching and naive prefetching. Under the optimal-prefetching strategy our results demonstrate that the OWCache improves swap-out times by 1 to 2 orders of magnitude with respect to OPTNET results. The OWCache benefits are not as significant under the naive-prefetching strategy but are still considerable. Overall, the OWCache provides execution-time improvements of as much as 64% under optimal prefetching and as much as 39% under naive prefetching, again, in comparison with OPTNET performance results. The OWCache performance benefits come mostly from faster page swap outs ͑optimal prefetching͒ and victim caching ͑naive prefetching͒. Our results also show that an OPTNET-based multiprocessor often requires a huge amount of disk-controller-cache capacity to approach the performance of our system.
To show that our optical ring can also be applied successfully to traditional multiprocessors in which processors are interconnected with electronic networks, we assess its benefits for a mesh-connected multiprocessor. In this case the optical ring is designed as a simple extension of the I͞O subsystem of the multiprocessor. In contrast to the OWCache, this extension, which is also an optical write cache and is called NWCache, does not require modifications to the multiprocessor hardware, is more flexible, and is modular because the NWCache interface can be plugged into the I͞O bus of each node of any multiprocessor. Our evaluation of the NWCache shows that our optical ring improves performance for a traditional multiprocessor by roughly the same amount as it does for an optically interconnected multiprocessor. However, the NWCache improves performance not just as a result of faster page swap outs and victim caching but also by the reduction of data traffic across the multiprocessor's interconnection network and memory buses.
On the basis of these results and our parameterspace study our main conclusion is that our optical ring is highly efficient under several architectural assumptions and for most out-of-core parallel applications. Note, however, that, even though our study focuses on optimizing page swap outs, caching data with an optical ring should be beneficial for other types of disk-write traffic as well.
The remainder of this paper is organized as follows: Section 2 presents some background material on virtual memory, WDM, optical networks, and optical delay-line memories. Section 3 describes the architecture of the OWCache in detail. Section 4 presents our experimental methodology and application workload. Section 5 presents the results of our base experiments and our parameter-space study with the OWCache. The discussion of the NWCache and some of its most important results are presented in Section 6. Section 7 is a discussion of related research. Finally, Section 8 summarizes our findings and concludes the paper.
Background
In this section, we discuss the background behind our study. Four main topics bear a direct relation to the ideas proposed in this paper: virtual memory, WDM, OPTNET, and delay-line memories. In the following, we discuss each of these topics in turn.
A. Virtual Memory
Virtual memory allows a program or a process that is larger than the physical memory of the computer to be executed. The idea is that only the code and the data that are currently being used by the processor need to reside in physical memory. Thus the code and the data must be transferred back and forth between disk and memory during run time.
The actual implementation of the virtual-memory concept involves a combination of software and hardware. In simple terms virtual-memory implementations work as follows: Compilers generate code that accesses a virtual-address space. At run time virtual-memory addresses are translated by means of hardware to physical addresses, usually according to a demand-paging approach. In this approach a virtual address is broken down into a virtual page number and a displacement within the page. The virtual page number is then used to index a page table ͑maintained by the operating system͒ that contains the number of the physical page where the virtual page is currently stored. If the page table does not contain a valid translation for the virtual page ͑thus causing a page fault͒ the corresponding page must be brought by the operating system to the physical memory from disk. This operation may, in turn, force a page ͑the least recently used page͒ to be written back ͑swapped out͒ to the disk by the operating system to create space in the physical memory. The combination of the physical page number and the displacement within the page is the actual physical address to be accessed. The translation from a virtual address to a physical address can then involve one or more memory accesses even when the page referred to is already in physical memory. To eliminate this overhead, it is usual for processors to have on-chip translation-lookaside buffers ͑TLB's͒ to cache the most recent translations performed. TLB's are extremely effective at improving performance in both uniprocessors and multiprocessors but must be managed carefully in multiprocessors. The reason for this need for care is that TLB's are not kept coherent in hardware, so the operating system must make sure that a modification made to a machinewide page table by one of the processors is seen by other processors. For accomplishing this recognition the operating system interrupts all processors and asks them to invalidate their translations for the corresponding page. This operation is called a TLB shootdown.
B. Wavelength-Division Multiplexing Networks
Through careful fabrication of optical fibers, transmitters, and receivers it is now possible to build dispersion-free optical communication systems with low attenuation and high bandwidth. The maximum bandwidth achievable over an optical fiber is of the order of terabits per seconds. 10 However, because of the fact that the hardware associated with the end points of an optical communication system is usually of an electronic nature, transmission rates are currently limited to the gigabit-per-second level. To approach the full potential of optical communication systems requires that multiplexing techniques be utilized.
WDM is one such multiplexing technique. With WDM several independent communication channels can be implemented in the same fiber. Optical networks that use this multiplexing technique are called WDM networks. The simplest way of implementing a WDM network is through a passive star coupler and a set of receivers and transmitters. 11 The star coupler broadcasts every WDM channel to the processing nodes connected to the network. Nodes usually do not listen to all channels, however, as the number of optical devices ultimately determines the cost of the network.
Because of the rapid development of the technology used in its implementation WDM has become one of the most popular multiplexing techniques. WDM multiplexers and demultiplexers with more than 100 channels can now be found commercially.
C. OPTNET
OPTNET 8, 9 is an optical interconnection network and an associated coherence protocol that exploits several beneficial characteristics of optics in the design of scalable multiprocessors. The network uses WDM to provide independent high-bandwidth communication channels. These WDM channels are divided into broadcast and point-to-point groups. Three channels-a request channel and two coherence channels-are assigned to the first group, whereas the other p channels ͑ p is the number of nodes in the multiprocessor͒, called home channels, are assigned to the second group.
The request channel is used for requesting memory blocks. The response to such a request is sent by the block's home node ͑the node responsible for providing up-to-date copies of the block͒ on its corresponding home channel. The coherence channels are used for broadcasting coherence and synchronization transactions. Broadcasting memory-request and coherence transactions simplifies the hardware by obviation of the need for directories. In addition, broadcasting coherence transactions optimizes the coherence protocol by the informing of the processors of changes to the shared data more efficiently. Thus the cachecoherence protocol of the multiprocessor is tailored to the network and based on update coherence. Finally, the grouping of channels improves performance by the decoupling of the memory-write traffic from the more time-critical memory-block read operations.
Overall, each OPTNET interface requires three fixed transmitters ͑one for the request channel, one for the home channel, and the last for one of the coherence channels͒, three fixed receivers ͑for the broadcast channels͒, and one tunable receiver ͑for the home channels͒. Hence the hardware cost of the OPTNET is 7 ϫ p optical components. Figure 1 shows an overview of the OPTNET architecture.
Finally, it is important to note that the OPTNET, the OWCache, and the NWCache architectures focus on WDM technology because of its immediate availability, but there is nothing in our proposals that is strictly dependent on this specific type of multiplexing.
D. Delay-Line Memories
Given that light travels at a constant and finite propagation speed in the optical fiber ͑approximately 2.1 ϫ 10 8 m͞s͒, a fixed amount of time elapses be-tween when a datum enters and leaves an optical fiber. In effect, the fiber acts as a delay line. Connecting the ends of the fiber to each other ͑and regenerating the signal periodically͒ allows the fiber to become a delay-line memory 12 because the data sent to the fiber will remain there until it is overwritten. An optical delay-line memory exhibits characteristics that are hard to achieve by other types of delay lines. For instance, because of the high bandwidth of optical fibers, it is possible to store a reasonable amount of memory in just a few meters of fiber ͑e.g., at 10 Gbits͞s, approximately 5 kbits can be stored on one 100-m-long WDM channel͒.
Optical Write Cache
In this section, we review the basic multiprocessor architecture and operating system ͑virtual-memory management͒ that we consider. We then move on to describe the architecture and the usage of the optical ring that implements the OWCache network.
A. Multiprocessor Architecture and Virtual-Memory Management
As our basic multiprocessor architecture, we consider a scalable cache-coherent multiprocessor in which processors are connected by means of a slightly modified OPTNET network. The slight modification improves the OPTNET performance in the presence of page swap-out traffic. The modification amounts to adding a fourth broadcast channel, called the swap channel, that takes care of all the swap-out traffic. Just like the coherence channels in OPTNET the swap channel uses time-division multiplexing access with variable time slots for medium-access control. The extra channel requires the addition of one more fixed receiver and one more fixed transmitter per node, leading to a final hardware complexity of 9 ϫ p optical components. As can be seen from Fig. 2 , each node in the system includes one processor, a TLB, a coalescing write buffer, first-level and second-level caches, local memory, and a network interface.
Each I͞O-enabled node also includes one disk and its controller, which is connected to the I͞O bus. The only part of the multiprocessor operating system that we must consider is its virtual-memorymanagement code. Again, we assume a standard strategy here. More specifically, our base system implements a single machinewide page table, each entry of which is accessed by the different processors with mutual exclusion. Every time the access rights for a page are downgraded a TLB-shootdown operation takes place.
The operating system maintains a minimum set of free-page frames per node of the multiprocessor. On a page fault the operating system sends a request for the page to the corresponding disk across the OPTNET request channel. For simplicity of presentation, hereafter we do not differentiate a virtual-memory page from a disk block. ͑We assume that pages are stored in groups of 32 consecutive pages. The parallel file system assigns each of these groups to a different disk in round-robin fashion.͒ For each request the disk controller reads the page from its cache ͑cache hit͒ or disk ͑cache miss͒ and sends it to the requester through the corresponding node's home channel. That is, a page-request operation is similar to a memory-block request in the base OPTNET network.
When the page that was faulted on arrives from its disk the global page table is updated, allowing all other processors to access ͑and cache͒ the page data remotely. If the arrival of this page reduces the number of free-page frames on the node to less than the minimum the operating system uses the least recently used criterion to pick a page to be replaced. If the page is dirty a page swap-out operation is started. Otherwise, the page frame is simply turned free.
A page that is swapped out of memory is sent to the corresponding disk-controller cache through the swap channel. The disk controller responds to this message with an acknowledgment ͑ACK͒ if it was Fig. 1 . Overview of the OPTNET architecture: Tx represents fixed transmitters that are associated with the request channel ͑R͒, with a coherence channel ͑C͒, and with home channel i. Rx represents fixed receivers that are associated with the request channel ͑R͒, with a coherence channel ͑C 1 ͒, and with a second coherence channel ͑C 2 ͒. TRx represents a tunable receiver. MUX, multiplexer; DEMUX, demultiplexer; NI, network interface. Fig. 2 . Overview of the node architecture: P, processor; WB, coalescing write buffer; L1 and L2, first-and second-level caches; LM, local memory; NI, network interface.
able to place the page in its cache. Writes are given preference over prefetches in the cache. The ACK allows the swapping node to reuse the space occupied by the page in memory. The disk controller sends a no-acknowledgement ͑NACK͒ response if there is no space left in its cache ͑i.e., the disk controller's cache is full of swap outs͒. The controller records the NACK in a first-in-first-out ͑FIFO͒ queue. When room becomes available in the controller's cache the controller sends an OK message to the requesting node, which prompts it to resend the page. ACK, NACK, and OK messages are sent through the request channel. Page prefetching is not within the scope of this study. Thus we consider the two extreme prefetching situations: optimal prefetching and naive prefetching. Optimal prefetching attempts to approximate the performance achieved by highly sophisticated compilers 1 or application hints 5 that can prefetch data from disks to disk-controller caches or to memory. Our idealized technique then assumes that all page requests can be satisfied directly from the disk-controller cache. In effect, this technique hides the cost of accessing the magnetic media ͑the disks themselves͒. Thus under optimal prefetching faulting on a page involves only the cost of transferring the page from a ͑possibly remote͒ disk-controller cache to the main memory of the faulting processor.
Under the naive-prefetching situation the only prefetching that takes place occurs on a diskcontroller cache miss when the controller fills its cache with pages that sequentially follow the missing page. We believe this technique is very naive for three reasons: ͑a͒ files in our system are stripped across several disks, ͑b͒ strips are sometimes accessed concurrently by several nodes, and ͑c͒ several of our applications do not fetch pages sequentially. Faulting on a naively prefetched page involves only the cost of transferring the page from a ͑possibly remote͒ disk-controller cache to the main memory of the faulting processor. However, the cost of accessing the disk is still observed by the application when the page that was faulted on has not previously been prefetched to the disk-controller cache. Figure 3 shows an overview of the OWCache system: a basic OPTNET network plus an optical ring. ͑Note, however, that the ring obviates the need for the swap channel of OPTNET.͒ The ring is used only to transfer pages that were recently swapped out of memory by the different multiprocessor nodes to the disks, while storing these pages on the network itself. All other traffic flows through the regular OPTNET channels, as described in Subsection 3.A.
B. OWCache Equals OPTNET Plus the Optical Ring

Overview
The swapped-out pages are continually sent around the cache channels in one direction. In essence the ring acts as a write cache, where a page that is swapped out of the main memory resides until enough room for it exists in its disk-controller cache. If the page is requested again while it is still stored in the OWCache it can be remapped to the main memory and taken off the ring.
The storage capacity of the ring is completely independent of the individual or the combined local memory sizes. The storage capacity of the ring is simply proportional to the number of available channels and the channels' bandwidth and length. More specifically, the storage capacity of the ring is given by Capacity in bits ϭ number of channels ϫ fiber length ϫ transmission rate speed of light in the fiber , where the speed of light in the fiber equals 2 ϫ 10 8 m͞s.
Ring Management
Each cache channel transfers and stores the pages swapped out by a particular node. A page can be swapped out to the OWCache if there is room available on the node's cache channel. An actual swap out to the OWCache allows the swapping node to reuse the space occupied by the page in memory right away. At the time of swapping the node must set a bit ͑the ring bit͒ that is associated with the page-table entry for the page, indicating that the page is stored in the OWCache. The swapping node must also send a message to the OWCache interface of the corresponding I͞O-enabled node. This message is sent through the OPTNET request channel and includes only the number of the page that was swapped out. The remote OWCache interface then saves the number of the swapping node ͑implicitly obtained from the message͒ and the number of the page in a FIFO queue that is associated with the corresponding cache channel. Every time the disk controller attached to the node has room for an extra page in its cache the network interface starts snooping the most heavily loaded cache channel and copies as many pages as possible from it to the disk-controller cache. After the page is sent to the disk-controller cache an ACK is sent to the node that originally swapped the page out. The ACK is understood by the node to mean that it can now reuse the space that was occupied by the page on the ring and prompts it to reset the ring bit that is associated with the page.
Two important characteristics of how pages are copied from the OWCache to the disk-controller cache increase the spatial locality of the different writes in the cache: ͑a͒ Pages are normally copied in the same order as they were originally swapped out, and ͑b͒ the interface starts snooping another channel only after the swap outs on the current channel have been exhausted. When a node swaps consecutive pages out these two characteristics allow several writes to be batched to the disk in a single operation.
Pages can be remapped into memory straight from the OWCache itself. On a page fault the faulting node checks whether the ring bit for the page is set. If it is not the page fault proceeds just as was described in Subsection 3.A. Otherwise, the faulting node uses the last virtual-to-physical translation for the page to determine the node that last swapped the page out. Then the faulting node can simply snoop the page off the correct cache channel. Moreover, the faulting node must send a message through the request channel to the OWCache interface of the I͞O-enabled node responsible for the page that informs the interface of the page number. This message tells the remote OWCache interface that the page does not have to be written to disk because there is again a copy of it in the main memory. The remote interface then takes the page number off the cache channel's FIFO queue and sends the ACK to the node that originally swapped the page out to the OWCache.
Note that the OWCache does not suffer from coherence problems because we do not allow more than one copy of a page beyond the disk controller's boundary. The single copy can be in the main memory or in the OWCache.
Software Cost
The software cost of the OWCache is negligible ͑pro-vided that the kernel code is available, of course͒.
The operating-system code must be changed to include the ring bits and to drive the OWCache interface.
Hardware Cost
The electronic hardware cost of our optical ring is restricted to the memory-bus interface, the FIFO's, and the buffers and the drivers that interface with the electronic and the optical parts of the OWCache interface. The optical hardware requirements of the ring are also minimal. The OWCache interface at each node can read any of the cache channels but can write to only the cache channel associated with the node and thus does not require arbitration.
The OWCache interface regenerates, reshapes, and reclocks this writable-cache channel. To accomplish these functions requires that the interface have two tunable receivers, one fixed transmitter, and one fixed receiver, as shown in Fig. 3 . One of the tunable receivers is responsible for reading the pages to be written to disk from the ring, whereas the other tunable receiver is used to search the ring for a page that has been faulted on locally. The fixed transmitter is used to insert new data into the writable-cache channel. In conjunction with this transmitter the fixed receiver is used to recirculate the data on the writable-cache channel. Thus the optical hardware cost for the ring is only 4 ϫ p optical components, where p is the number of nodes and cache channels in the multiprocessor. The overall cost of the OWCache interface ͑OPTNET plus the optical ring͒ is then 11 ϫ p optical components. This cost is a little higher than that of OPTNET but is still acceptable even for today's optical technology costs. Mass production of optical components and further advances in optical technology will soon lower these costs even further.
Note that, even though our optical ring acts as a cache for disk data, the ring does not guarantee nonvolatility ͑as some-not all-disk-controller caches do͒, in much the same way as using idle-node memory for swap outs ͑see, e.g., Ref. 13͒. This is not a serious problem, however, because both of these approaches optimize the virtual-memory management for applications that do not involve stringent reliability constraints, such as scientific applications.
Note also that, although we extended the OPTNET network with our optical ring, the same idea can be applied to any optical network. We selected OPTNET because the combination of the OPTNET network and its coherence protocol strikes the best cost-performance ratio of all similar systems under most architectural assumptions and for most applications, 8, 9 but nothing in our design depends on the base optical network.
Methodology and Workload
We are interested in evaluating the performance of our proposed OWCache-based cache-coherent multiprocessor under varying architectural assumptions. Hence we use simulations of real applications for our studies.
A. Multiprocessor Simulation
We use a detailed execution-driven simulator ͓based on the MINT ͑MIPS interpreter͒ front end 14 ͔ of OPT-NET and OWCache-based cache-coherent multiprocessors. Memory, I͞O, and network contention are fully modeled. The simulation of the multiprocessor's operating system is limited to the part that really matters for our purposes: virtual-memory management.
Our simulation parameters and their default values are listed in Table 1 . The cache and the mainmemory sizes that we simulate were purposely kept small because simulation-time limitations prevent us from using real-life input sizes. In fact, we reduced first-level, second-level, optical ring, and disk-cache storage capacities by a factor of 32 and main-memory sizes by a factor of 512 with respect to their usual sizes in real systems. Our goal with these reductions was to produce roughly the same swap-out traffic in our simulations as in real systems.
Our choice of the amount of optical storage deserves further remarks. The default storage capacity that we assume is 512 kbytes; 64 kbytes can be stored on each of eight channels. At 10 Gbits͞s ͑80 Gbits͞s of total bandwidth͒ this amount of storage requires 10 km of fiber, which might pose a management problem, even if not an unsurmountable one. To avoid lengthening the fiber further, we can accomplish an increase in the size of the OWCache storage only by increasing the transmission rate, using more cache channels, or both. Admittedly, however, increasing the capacity of the simulated OWCache by a factor of 32 would not be practical with the current commercially available optical technology.
Nevertheless, we believe that in the near future an increase in size of this magnitude will be possible. In fact, a few commercial WDM products ͑such as those of Hitachi͒ can currently deliver 4 times the amount of bandwidth that we assume. In addition, producers of telecommunications infrastructure ͑such as NTT, Alcatel, and Nortel͒ have been able to achieve WDM bandwidths in excess of 6 Tbits͞s over long distances. Thus our capacity assumptions can be considered to be highly conservative with respect to the future potential of optics, especially if we consider multiplexing techniques such as optical timedivision multiplexing ͑OTDM͒, which could potentially support 5000 channels. 15 With excess bandwidth, we can shorten the fiber to a more manageable length.
The default values in Table 1 represent our perception of what is reasonable for current multiprocessors. The parameter-space study presented in Section 5 allows us to investigate the influence of our most important architectural assumptions.
B. Workload
Our application workload consists of seven parallel programs: Em3d, fast Fourier transform ͑FFT͒, Gauss, LU, MG, Radix, and SOR. FFT, LU, and Radix are from the SPLASH-2 suite and are described in detail elsewhere. 16 MG is a parallel implementation of the multigrid benchmark of the NAS suite, which is described in detail in Ref. 17 . Em3d is from the University of California at Berkeley, 18 Berkeley, California, and simulates electromagneticwave propagation through three-dimensional ͑3-D͒ objects. Gauss and SOR were developed at the University of Rochester, Rochester, New York. Gauss performs unblocked Gaussian elimination without pivoting or backsubstitution. SOR performs successive overrelaxation on a grid of elements. Table 2 lists the applications, their input parameters, and the total data sizes that the inputs lead to.
All the applications MMAP their files for both reading and writing and access them by means of the standard virtual-memory mechanism. ͑Note that the UNIX MMAP call forces the user to specify the largest possible size of a writable file. This was not a problem for us because it was always possible to determine the exact final size of each writable file for our applications. Nevertheless, we feel that the UNIX MMAP call is much too restrictive for a pure virtual-memory-based style of programming.͒ The applications themselves ͑not their traces͒ are run by our simulator as they would run on a real multiprocessor, i.e., the simulator reproduces the correct sequence of events ͑including synchronization operations and operating-system calls͒ by use of event-duration feedback from the simulated memory system.
Experimental Results
In this section, we evaluate the performance of an OWCache-based multiprocessor by comparing it with the performance of an OPTNET-based multiprocessor. We start by studying the OWCache system under our two prefetching techniques and the default architectural and operating-system parameters. Following this study, we assess the impact of our most important assumptions.
A. Performance Benefits
First, it is important to determine the best minimum number of page frames for each combination of prefetching technique and multiprocessor. We performed experiments in which we varied this minimum number for each of the applications in our suite. In the presence of the OWCache the vast majority of our applications achieved their best performance with a minimum of only two free-page frames, regardless of the prefetching strategy.
The best configuration for the OPTNET-based multiprocessor is not obvious. Under optimal prefetching, three of our applications ͑Gauss, LU, and SOR͒ favor large numbers ͑Յ16͒ of free-page frames, whereas two of them ͑Em3d and FFT͒ achieve their best performance with only two or four free-page frames. The other two applications, MG and Radix, require 8 and 12 free frames, respectively, for best performance. On the other hand, under naive prefetching all applications except SOR favor small numbers ͑Ͻ4͒ of free-page frames. Thus we picked 12 and 4 frames as the best minimum numbers of free-page frames under optimal and naive prefetching, respectively. All the results in this subsection are presented for these configurations.
We are interested in assessing the extent to which the benefits provided by the OWCache actually produce performance improvements. As we mentioned in Section 1, the OWCache achieves several performance benefits: it provides a staging area where swapped-out pages can reside until the disk is free, it increases the possibility of combining several writes to disk, and it acts as a victim cache for pages that are swapped out and subsequently accessed by the same or a different processor. We now look at statistics related to each of these benefits in turn. Tables 3 and 4 show the average time ͓in pcycles ͑1 pcycle ϭ 5 ns͔͒ that it takes to swap a page out of memory under the optimal-and the naiveprefetching strategies, respectively. The tables show that swap-out times are 1 to 2 orders of magnitude lower when the OWCache is used. The main reason for these results is that the OWCache effectively increases the amount of disk cache from the viewpoint of the memory. A swap out is delayed in the presence of the OWCache only when the swapping node's cache channel fills up. In contrast, when Single-precision floating-point numbers. the OWCache is not assumed swap outs are much more frequently delayed because of a lack of space in the disk-controller cache. As expected, Tables 3 and  4 also show that the swap-out times are much higher under the optimal-prefetching technique than under its naive counterpart. This difference is a consequence of the fact that, under optimal prefetching, the reduced page-read times effectively cluster the swap outs in time, increasing contention.
Write Staging
Write Combining
Given the way swap outs are copied from the OWCache to the disk-controller cache, the locality of the writes in the cache is often increased. When consecutive pages can be found in consecutive slots of the disk-controller cache the writes of these pages can be combined in a single disk-write access. The data in Tables 5 and 6 confirm this claim. The tables present the average number of swap outs that are combined in each disk-write operation. The maximum possible combining factor is 4, which is the maximum number of pages that can fit in a diskcontroller cache. The results show that increases in write combining are moderate under the naiveprefetching strategy ͑Յ28%͒ but can be significant under the optimal-prefetching strategy ͑Յ77%͒. Again, the temporal clustering of swap outs under optimal prefetching is responsible for this result. It becomes more common for the disk controller to find consecutive writes in its cache at the same time. Table 7 presents the page-read hit rates in the OWCache under the optimal-and the naive-prefetching techniques. The table shows that hit rates are slightly higher under optimal prefetching than under naive prefetching except for Gauss and SOR, again because of the temporal characteristics of the page swap outs under the two techniques. In addition, these results show that hit rates can be as high as 61% ͑Gauss͒ or as low as 7% ͑Em3d͒. These results derive from a combination of two factors: the size of the memory's working set and the degree of data sharing in the applications. Gauss, MG, and Em3d exhibit a significant amount of sharing, but only Gauss and MG have working sets that can ͑almost͒ fit in the combined memory-OWCache size. The other applications achieve hit rates in the 8%-31% range. Tables 8 and 9 list the performance effects of victim caching for the optimal-and the naive-prefetching strategies, respectively. The tables present the average amounts of time ͑in pcycles͒ that it takes to bring a page into the local main memory as a result of a page fault. As can be seen from the tables, under naive prefetching the average overhead is reduced substantially when the OWCache is assumed. This is a direct consequence of the victim-caching effect provided by the OWCache, which reduces the number of disk accesses from 8% for FFT by as much as 66% for Gauss and 54% for MG. In contrast, under optimal prefetching the average overhead actually increases when the OWCache is assumed. The reason for this effect is that, on average, pages can be read faster from a disk-controller cache than from the OWCache. Nevertheless, these degradations do not noticeably affect the performance of our system because page-fault overheads represent a negligible fraction of the overall running times under optimal prefetching, as we discuss in Subsection 5.A. 4 . The results that we have presented thus far confirm that the OWCache indeed benefits performance significantly in several ways. The greatest performance gains come from fast swap outs when using optimal prefetching and from victim caching when using naive prefetching. Figures 4 and 5 show the normalized execution times of each of our applications under optimal and naive prefetching, respectively. From top to bottom, each bar in the graphs is divided into ͑a͒ the stall time resulting from the lack of free-page frames ͑No Free͒, ͑b͒ the overhead for page faults ͑Fault͒, ͑c͒ the overhead of TLB misses and TLB shootdowns ͑TLB͒, and ͑d͒ and the components of the execution time that are not related to virtual-memory management ͑Other͒, including processor busy, first-and second-level cache misses, and synchronization times. Figure 4 shows that, under the optimal prefetching strategy, no-free overheads are always very significant for the OPTNET-based multiprocessor, especially for Gauss and SOR. The operating system frequently runs out of free-page frames on the OPTNET-based multiprocessor because page reads are completed relatively quickly, whereas page swap outs are very time consuming. When the multiprocessor is upgraded with the OWCache no-free times are reduced quite significantly as a result of its much faster page swap outs. Figure 4 also demonstrates that, for several applications, the time used on non-virtual-memory operations is reduced significantly in the presence of the OWCache. These reductions come mainly from the better synchronization behavior produced by the significant reduction in the load imbalance. Overall, we can see that the OWCache provides performance improvements that average 39% and range from 16% ͑for FFT͒ to 60% ͑for MG͒ and 64% ͑for Gauss͒ when optimal prefetching is assumed. In fact, improvements are greater than 30% in all cases except Em3d and FFT.
Victim Caching
Overall Performance
The performance results when naive prefetching is assumed are quite different. Under this technique execution times are dominated by page-fault latencies because disk-controller-cache hit rates are never greater than 15%. The page-fault latencies then provide the much needed time for swap outs to complete. As a result, no-free times almost vanish, diminishing the importance of the fast swap outs in the OWCache architecture.
Under naive prefetching the addition of the OWCache to the multiprocessor improves performance from 5% ͑for Radix͒ to 39% ͑for Gauss͒ for all applications except FFT, which degrades by 4%. OWCache-related improvements come from reasonable reductions in page-fault latencies, which result from reading pages off the optical cache and from alleviating disk contention. 
Summary
In summary, we showed that the OWCache is extremely useful when prefetching is effective, mostly as a result of fast swap outs. The OWCache is not as efficient when prefetching is either ineffective or absent even though victim caching improves the performance of many applications significantly. We expect results for realistic and sophisticated prefetching techniques 1, 5 to lie between these two extremes. In addition, as prefetching techniques improve and optical technology develops, we should see greater gains coming from the OWCache architecture.
B. Impact of Architectural Assumptions
In this subsection, we study the effects of our most important simulation parameters: the number of I͞O-enabled nodes, the sizes of the OWCache, the disk-controller caches, and the main memory, and the swap-out bandwidth that is used in the different multiprocessors.
Number of Input-Output-Enabled Nodes
We varied the number of I͞O-enabled nodes in our eight-node system from two to eight. Figures 6 and  7 present the results of these experiments for our seven applications under optimal and naive prefetching, respectively. Each point in the figures represents the execution time of the OWCache-based multiprocessor when normalized to the corresponding execution time of the OPTNET-based multiprocessor. The results show that the performance improvements provided by the OWCache under optimal prefetching decrease as we increase the number of I͞O-enabled nodes ͑i.e., the I͞O throughput of the system͒ because no-free overheads decrease as a percentage of the total execution times of an OPTNETbased multiprocessor. Nevertheless, OWCache improvements remain significant with all applications benefiting by an average of 35% with eight I͞O-enabled nodes. On the other hand, increasing the number of I͞O-enabled nodes usually increases slightly the performance improvement that is achievable by the OWCache under naive prefetching because disk contention becomes less of a problem for the OWCache-based multiprocessor.
Size of the OWCache
Fixing the number of I͞O-enabled nodes at four, we also varied the storage capacity of the OWCache from 256 kbytes to 1 Mbyte of data by varying the length of our optical ring. Figures 8 and 9 show the results under optimal and naive prefetching, respectively. Again, each point in the figures represents the OWCache performance normalized to the OPTNET performance. Under optimal prefetching, we find that increasing the capacity improves the performance for almost all applications because swap outs complete faster on average. In fact, faster swap outs have a substantial impact on performance under optimal prefetching, as no-free times invariably represent a large fraction of the total execution times.
Similarly, increases in OWCache storage capacity improve the performance of several applications under naive prefetching. Here performance is improved for applications that can benefit substantially from victim caching ͑i.e., Gauss, LU, MG, and SOR͒. Gauss and SOR are the applications that benefit the most from capacity increases. The reason is that the additional space and the timing of the swap outs allow a significant increase in the number of page faults that can be satisfied from the OWCache itself. More specifically, OWCache hit rates for Gauss go from 45% to 61% to 79% as we increase the OWCache capacity from 256 Kbytes to 512 Kbytes to 1 Mbyte, respectively, whereas the SOR hit rates go from 12% to 31% to 72% over the same capacity increases.
Size of the Disk-Controller Caches
We also performed experiments in which the sizes of the disk-controller caches in the OPTNET-based architecture were varied. Figures 10 and 11 show these results. Each point in the figures represents the OPTNET execution time normalized to the OWCache execution time. These figures allow us to answer two important questions: ͑1͒ whether an OWCache-based multiprocessor would outperform an OPTNET-based system with the same overall cache storage capacity, and ͑2͒ how much combined disk-cache capacity would be required for an OPTNET-based multiprocessor to outperform our system.
To answer the first question, we compared the performance of two multiprocessors with four I͞O-enabled nodes and a total cache space of 576 kbytes: One of the multiprocessors has a 512-kbyte OWCache and 16-kbyte disk-controller caches, whereas the other has 144-kbyte disk-controller caches only. The results of this experiment show that the OWCache-based multiprocessor performs best for most applications even under these conditions.
Under optimal prefetching our system is only slightly outperformed for Em3d ͑10%͒, FFT ͑1%͒, and LU ͑1%͒. Our performance advantage comes from significant reductions in the no-free overheads, which are promoted by faster page swap outs. Swap outs complete faster in our system as a consequence of the greater flexibility of the OWCache storage. More specifically, the OWCache can deal with uneven distributions of swap outs across the I͞O nodes by the caching of more data that are directed to certain nodes than to others at each point in time. In fact, the OWCache can even be temporarily filled up with swap-out data that are directed to a single node, if necessary. Large disk-controller caches are not as flexible because the amount of data that can be directed to a node without causing delays is equivalent to the size of a single disk-controller cache. This relation illustrates the main advantage of a shared cache ͑the OWCache͒ compared with multiple local caches ͑disk-controller caches͒ that are combined to achieve the same size.
Under naive prefetching the performance advantage of our system is not as clear because faster swap outs are irrelevant in terms of overall performance. Here the OWCache-based multiprocessor with 576 kbytes of total storage performs best for only three applications ͑Gauss, MG, and SOR͒. However, our system is outperformed by no more than 8%. For Gauss, MG, and SOR our system takes advantage of its victim-caching capability. Our victim-caching policy effectively produces OWCache hit rates that are higher than the combined disk-controller-cache hit rates of the OPTNET-based multiprocessor for these three applications. Overall, these results demonstrate that our system is clearly more efficient than an OPTNET-based multiprocessor under optimal prefetching when both systems involve the same amount of storage. On the other hand, naive prefetching leads to mixed results.
To answer the second question, we compared the performance of an OWCache-based multiprocessor with the same configuration as in our previous experiments with that of the OPTNET-based multiprocessor with 64, 192, 320, 576, 1088, and 2112 kbytes of combined disk-controller-cache capacity ͑Figs. 10 and 11͒. Again, each point in the figures represents the OPTNET performance normalized to the OWCache performance.
Under optimal prefetching, we find that the OPTNET-based multiprocessor requires 2112 kbytes to outperform our system for Gauss and SOR, while requiring 1088 kbytes for MG and Radix. The other three applications require only 576 kbytes. Under naive prefetching, on the other hand, the OPTNETbased multiprocessor requires 2112 kbytes to outperform our system for Gauss and SOR, while requiring 1088 kbytes for MG and 576 kbytes for LU. The other three applications require 320 kbytes or less to outperform our system. Although these diskcontroller caches are not outrageously large for real systems, note that our simulations assume a total of only 2048 kbytes of main memory for the multiprocessors.
These results show that an OPTNET-based multiprocessor often requires a huge amount of diskcontroller-cache capacity to approach the Fig. 10 . Performance of the OPTNET multiprocessor for several combined disk-cache sizes under optimal prefetching. OWC, OWCache. Fig. 11 . Performance of the OPTNET multiprocessor for several combined disk-cache sizes under naive prefetching. OWC, OWCache.
performance of our system. Given the simplicity of the OWCache hardware, the fact that it is only an extension of an existing network, the quickly decreasing costs of optical components, and the fact that significantly increasing the amount of disk-controller memory simply to satisfy out-of-core applications is clearly overkill, our optical solution to this problem seems to be a better one than its electronics counterpart.
Size of the Main Memory
We also varied the size of the main memory of each node. To maintain the out-of-core characteristics of our applications, we also varied their input sizes by the same proportion. Specifically, we increased the main memory of each node from 256 to 320 kbytes. Thus we observed that the performance improvements of the OWCache-based multiprocessor in relation to the OPTNET-based one are the same in both cases under naive prefetching. On the other hand, under optimal prefetching the variations in performance improvements depend on the application. For instance, the improvement of the OWCachebased system in relation to the OPTNET-based improvement grows from 64% to 68% in Gauss, whereas in SOR that improvement falls from 53% to 35%. The reduction in the improvement obtained for SOR is due to the decrement in its swap-out rate. That rate falls from 5.5 to 3.8 swap outs every 10 6 cycles. Thus we can conclude that the improvements obtained by use of the OWCache will be significant for most of the out-of-core applications, although the size of the main memory in each node is bigger, because those improvements basically depend on the swapout rate. In our experiments, we observed significant improvements when the swap-out rate is greater than two swap outs for every 10 6 cycles.
Swap-Out Bandwidth
The significant difference in swap-out bandwidth between the two multiprocessors that we consider could play an important role in our comparisons. To show that this is not the case, we collected information on the amount of contention in the access to the swap channel in the OPTNET-based architecture. We find that swap-channel contention is virtually nonexistent. Moreover, when we configured the OPTNET-based multiprocessor to use one swap channel per node ͑and thus avoid arbitration altogether͒ the only applications that improved their performance were MG ͑by 4%͒ and Radix ͑by 9%͒ under optimal prefetching. The other five applications under optimal prefetching and all the applications under naive prefetching presented the same performance results as with a single swap channel. We can conclude from the above that the higher bandwidth used by the cache channels has nothing to do with the OWCache performance improvements. The true origins of these improvements are the capabilities of the channels to act as a staging area and as a victim cache.
Summary
The above results show that the OWCache is extremely effective under both types of prefetching even when compared with large disk-controller caches. In addition, our results showed that variations in the OWCache size have a beneficial effect under optimal prefetching but not necessarily under naive prefetching.
Extending a Traditional Multiprocessor
Our study so far has shown that our optical ring can be beneficial to optically interconnected multiprocessors. However, these multiprocessors are not yet widely available. In this section, to show that our optical ring can be applied successfully in more traditional and widely available multiprocessor architectures, we present the extension of a mesh-connected multiprocessor with our optical disk-write cache. This extension, called the NWCache, is more modular and flexible than the OWCache because the NWCache interface can be plugged into the I͞O bus of each node of any multiprocessor. The architecture of the NWCache and its most important performance results are discussed below.
A. Architecture
We use a traditional cache-coherent multiprocessor architecture in which processors are connected through a traditional wormhole-routed mesh network as the basis for the implementation of the NWCache. The structure of each node in the system is the same as that shown in Fig. 2 . We extend this multiprocessor with our optical ring simply by plugging the NWCache interface into one of the I͞O bus slots of each node. The disk controller of the I͞O-enabled nodes can be plugged into the NWCache interface. Thus the NWCache interface connects the node to the optical ring and filters some of the accesses to the disk, just like the OWCache interface does. Some of these accesses can be satisfied by the NWCache itself.
To implement the NWCache requires no modifications to the standard multiprocessor hardware. The virtual-memory-management code is very similar to that already described in Subsection 3.A for the OWCache.
The NWCache interface is similar to the OWCache interface presented in Fig. 3 . The only differences between the interfaces are ͑a͒ the memory-bus interface in the OWCache design becomes an I͞O-bus interface, and ͑b͒ the OPTNET interface logic in the OWCache design is replaced by a disk interface. As a result, the electronic hardware cost of the NWCache is restricted to the I͞O bus and the disk interfaces, the FIFO's, and the buffers and the drivers that interface between the electronic and the optical parts of the NWCache interface. The optical hardware cost of the NWCache is also minimal, only 4 ϫ p optical components, where p is the number of nodes and cache channels in the multiprocessor.
Again, the ring-management code is almost the same as that discussed in Subsection 3.B. The messages that are not involved with virtual-memory management still flow across the mesh network. However, in contrast to the OWCache, the messages sent to the I͞O-enabled node after a page is swapped out must also include the number of the swapping node because the NWCache interface would have no way of knowing the source of the swap out otherwise.
B. Experimental Results
To evaluate the performance of our NWCache-based multiprocessor, we used detailed execution-driven simulations of a DASH-like ͑directory architecture for shared memory-like 19 ͒ cache-coherent multiprocessor with and without the NWCache. Our simulation parameters are the same as those listed in Table 1 . The mesh network assumes a router latency of 4 pcycles and a link-transfer rate of 200 Mbytes͞s. In addition, we use the same set of applications and input parameters that are shown in Table 2 as our application workload.
Performance Benefits
The experiments performed to determine the best minimum number of free-page frames show that, in the presence of the NWCache, the vast majority of our applications achieved their best performance with a minimum of only two free-page frames, regardless of the prefetching strategy. On the other hand, the best configuration for the standard multiprocessor is not obvious. Under optimal prefetching, four of our applications ͑Gauss, LU, Radix, and SOR͒ favor large numbers ͑Ͼ12͒ of free-page frames, whereas two ͑Em3d and MG͒ achieve their best performance with only two free-page frames. The other application, Radix, requires eight free-page frames for best performance. Under naive prefetching, on the other hand, all applications except SOR favor small numbers ͑two or four͒ of free-page frames. Thus we picked 12 and 4 frames as the best minimum numbers of free-page frames under optimal and naive prefetching, respectively. All the following results are presented for these configurations. In the same way as the OWCache, the NWCache improves performance by the provision of a staging area where swapped-out pages can reside until the disk is free, by the increasing of the possibility of combining several writes to disk, and by its acting as a victim cache for pages that are swapped out and subsequently accessed by the same or a different processor. These characteristics of the NWCache improve performance in a way similar to the one that they do for the OWCache. In summary, the swapout times become 1 to 3 orders of magnitude lower when the NWCache extension is used, increases in write combining are moderate under naive prefetching ͑Յ16%͒ but can be significant under optimal prefetching ͑Յ58%͒, and hit rates in the optical ring are slightly higher under optimal prefetching than under naive prefetching and range from 9% ͑Em3d͒ to approximately 60% ͑Gauss and MG͒.
Besides the above three performance benefits that are shared with the OWCache, the NWCache also reduces the data traffic across the multiprocessor's interconnection network and memory buses because ͑a͒ page swap outs are not transferred across the interconnection network, and ͑b͒ page reads that hit in the NWCache are transferred neither across the network nor across the I͞O node's memory bus ͑whenever the request for the corresponding I͞O node can be aborted in time͒. This reduction in data traffic produces a reduction in the contention observed on the network. To evaluate the benefit of the NWCache in terms of contention alleviation, we collected statistics on the average latency of a page read from the diskcontroller cache. A comparison of the statistics from the traditional multiprocessor with those of the NWCache-capable multiprocessor provides a rough estimate of the amount of contention that is eliminated. Under naive prefetching the results show that the NWCache reduces disk-controller-cache hit latencies by as much as 63%. For most applications reductions range from 24% to 38%. Given that it takes approximately 6 kpcycles to read a page from a disk-controller cache in the total absence of contention, we can see that the contention reductions provided by the NWCache are always significant. For instance, the average disk-controller-cache hit takes approximately 21 kpcycles for LU when it is running on a standard multiprocessor. Out of these cycles approximately 15 kpcycles are due to contention of several forms. In the presence of the NWCache, the number of pcycles that are due to contention in LU is reduced to approximately 14 kpcycles, which means a 7% reduction. At the other extreme consider the reduction in disk-controller-cache hit latency achieved by MG, 63%. Out of the 19 kpcycles that MG takes to read pages from disk-controller caches on the standard multiprocessor approximately 13 kpcycles are due to contention. In the presence of the NWCache the number of pcycles that are due to contention in MG is reduced to approximately 700 pcycles, which means a 95% reduction in contention. On the other hand, under optimal prefetching the NWCache is not as successful at alleviating contention because there is usually not enough time to prevent a page transfer through the network and the I͞O bus when the page-read request hits in the NWCache.
Overall Performance
Under optimal prefetching no-free times are reduced significantly as a result of the much faster page swap outs allowed by the NWCache. In addition, the time taken on non-virtual-memory operations is significantly reduced in the presence of the NWCache. These reductions come from improvements in the cost of remote data accesses and better synchronization behavior, both of which are a result of the significant reduction of the traffic through the memory system ͑network and memories͒. Overall, we see that the NWCache provides performance improvements that average 41% and range from 23% ͑Em3d͒ to 60% ͑MG͒ and 64% ͑Gauss͒ when optimal prefetching is assumed. In fact, improvements are greater than 28% in all cases except Em3d.
Under naive prefetching, on the other hand, the addition of the NWCache to the multiprocessor improves performance from 3% ͑Radix͒ to 42% ͑Gauss͒ for all applications except FFT, which degrades by 3%. NWCache-related improvements come from reasonable reductions in page-fault latencies, which result from reading pages off the optical cache and from alleviating contention.
The above results confirm that the NWCache indeed enhances performance significantly in several ways. Its greatest performance gains come from fast swap outs, victim caching, and contention reduction. Recall that the OWCache provides performance gains as a result of fast swap outs and victim caching only; contention for the optical network in OWCache is negligible. The execution time of each application shows that the NWCache and the OWCache implementations of our optical ring provide similar performance improvements. The impact of architectural variations is also similar for both systems. These results are no coincidence, obviously. For several of our out-of-core applications most of the execution time is spent on virtual-memory operations, which benefit similarly from the two implementations. For the other applications the NWCache is able to reduce contention for the electronic network enough to make it behave almost as well as an optical network.
Related Work
A few research areas are related to our proposal, such as the use of WDM networks in computers, the use of optics as delay-line memories, and optimization of disk-write operations.
A. Wavelength-Division Multiplexing and Computers
A common approach to using optical communication in computer networks is through WDM networks. 20 The use of this type of network has become widespread as a result of recent advances in tunable transmitters and receivers and integrated-optics technology. 21, 22 A WDM network is ideal for smallto medium-scale parallel computing because it can provide point-to-point channels between each pair of nodes on a single optical medium with broadcasting capability. Larger systems can be constructed by the replacement of this single-hop scheme with multihop or multidimensional WDM approaches. 23, 24 WDM networks have already been part of several parallel-computer designs ͑see, e.g., Refs. 25 and 26͒, but the only aspects of optical communication that these designs exploit are its high bandwidth and the capability of broadcasting to a large number of nodes. Besides these aspects the OWCache and the NWCache architectures exploit the data-storage potential of optics to their benefit.
OTDM networks have been proposed as an alternative to WDM networks ͑see, e.g., Refs. 15 and 27͒. OTDM networks do have some advantageous characteristics in comparison with WDM networks, but the OTDM technology is not yet mature.
B. Delay-Line Memories
Delay-line memories have been implemented in optical communication systems 28 and in all-optical computers. 29 As far as we are aware, the only system that exploits the data-storage potential of optics in the design of scalable multiprocessors is one of our previous systems, NetCache. 30 In the NetCache system an optical ring is used to store memory blocks as if it were a third-level cache that is shared by all processors. Both the NetCache and the OWCache are simple extensions to the OPTNET network, whereas the NWCache is a shared disk-write cache attached to the I͞O bus of a conventional multiprocessor. An architectural advantage of the OWCache and the NWCache systems over the NetCache is their reduced optical hardware cost: the number of optical components in the NetCache system is 25 ϫ p, where p is the number of processors.
C. Disk-Write Operations
Several researchers previously set out to improve the performance of write operations for various types of disk subsystems. These efforts include research on improving the small-write performance of redundant array of inexpensive disks ͑RAID's͒ ͑see, e.g., Ref. 31͒ using nonvolatile RAM as a write cache ͑see, e.g., Ref. 32͒, logging writes and then writing them to disk sequentially ͑see, e.g., Ref. 33͒, using idle-or underloaded-node memory for storing page swapouts, 13 and using a log disk to cache writes that are directed to the actual data disk. 34 The last two types of research are the closest to ours.
Storing page swap outs in another node's memory is appropriate for only workstation networks in which one or more nodes may be idle or underloaded at each point in time. The same technique could not be applied in our computing environment-a multiprocessor running an out-of-core parallel application-because processors are always part of the computation at hand and usually do not have spare memory that they could use to help each other.
The storage architecture proposed in Ref. 34 , the disk-caching disk ͑DCD͒, places the log disk between the RAM-based disk cache and the data disk. DCD effectively creates an extra level of buffering for writes. New data to be written to disk is then stored in the RAM cache and later written sequentially on the log disk. Overwriting or reading a block requires searching the log disk to find the corresponding block. When the data disk is idle the data are copied from the log disk to the data disk. This scheme improves performance because it reduces seek and rotational latencies significantly when writing new data to the log disk and, as a result, frees space in the RAM cache more rapidly. Overwriting or reading involves seek and rotational latencies that are comparable with those of accesses to the data disk.
Just like the DCD, our optical ring also attempts to improve the write performance by the creation of an extra level of buffering. However, the optical ring places this buffer between the main memory and the disk caches and thus does not require any modifications to standard disk controllers. In addition, overwriting or reading data from the optical cache is as efficient as writing new data to the optical ring. Another advantage of our proposal is that it creates an exclusive path for disk writes to reach the disk controllers. Nevertheless, the technology used to implement the additional buffering in the DCD allows it much more buffer space than our optics-based buffer.
Conclusions
In this paper, we have proposed the OWCache: a simple extension of the optical network of a cachecoherent multiprocessor that improves the performance of out-of-core applications significantly through the optimization of page swap outs. The most important advantages of the OWCache are its fast swap-out and victim-caching aspects. Through a large set of detailed simulations, we have shown that an OWCache-based multiprocessor can easily outperform an OPTNET-based multiprocessor for most out-of-core applications; performance differences in favor of the OWCache-based multiprocessor can be as large as 64% and depend on the type of disk-data prefetching used. We have also shown that our optical ring can be applied successfully to a traditional, electronic, meshconnected multiprocessor in a modular and flexible fashion. To that end, we have proposed the NWCache: a plug-in optical ring device for the I͞O subsystem of these multiprocessors. The most important advantages of the NWCache are its fast swap-out, victim-caching, and contention-reduction aspects. Our execution-time results demonstrate that the NWCache and the OWCache implementations of our optical ring provide similar performance improvements even though they optimize performance in slightly different ways.
