We analyze the bandwidth needed for transmitting the addresses in future symmetric multiprocessor machines (SMP), constructed around a shared bus due to the critical obligation to preserve the coherence of the memory hierarchy. We show that an address-fransaction bandwidth as high as several hundreds of Gbit/s will be necessary not to slow down the execution of most applications in large SMP's. This communication bandwidth seems incompatible with the operation constraints of shared electrical busses, making necessary the search for other implementations of the address transmission network. We consider the introduction of optical interconnects (01) in this context We review several solutions, in the ascending order of complexity of the optical subsystems as one critical issue concerns the degree of sophistication of the optical solutions and their cost. We first consider simple point to point 01's for a SMP chipset. The interest for 01's comes from the low energy consumption and from the possibility, in the future, to integrate several thousands of optical inputloutputs per electronic chip. Then we consider the implementation of an optical bus that is a multipoint optical line involving more optical functionality. We discuss the possibility ofmultiple accesses to the bus, and the constraints related to the necessity to maintain the coherence of caches.
INTRODUCTION
The logical way to reach a performance level not accessible with a mono-processor consists of connecting several processors through an interconnection network. One must distinguish distributed multicomputer systems (DMS) from tightly bound multiprocessors (TBM). A DMS may include from a few tens to several hundreds of computers separated by an intercomputer distance ranging from a few meters to a few hundreds of meters. They are best suited for transmitting steady data streams between nodes (video, files of a few Mbytes, etc..). DMS's are the field of numerous investigations partly motivated by the preoccupation ofdeveloping more modular, low cost hardware that would simplif' maintenance and compatibility issues for manufacturers. However, they are suitable for some, but not all applications because the latency of internode communications becomes extremely long (with respect to the processor cycle, ins today) when the inter-computer distance attains a few meters (lmeter=5ns). Therefore, a DMS will execute much slower many applications (especially those requiring numerous internode exchanges) than a 1MB enclosed in a single cabinet.
We focus in this work on the communication challenges in TBM's and on the possible role ofoptical interconnects (01's) in this context. The processor distance in these machines ranges from a few centimeters to a few tens of centimeters. The latency is the key parameter of the memory-processor interactions because the processors exchange very short bursts of information (usually at least one word, i.e., four bytes, more often a cache block at a time, i.e., 32 or 64 bytes). They never establish a steady communication data stream with the memory. Although hundreds of network topologies have been proposed, the number of commercial implementations is handful and mostly reduces to shared busses, rings, meshes, tori and central switches. Obviously, all topologies are of equal conceptual interest. However, symmetric multiprocessors (SMP) represent a non-negligible fraction ofthe market, contrarily to other topologies, which are practically very marginal. SMP's are build with a shared bus, which grants a uniform access time to the shared memory, provides a single addressing space and enables the programmer to get transparent access to computational resources. In SMP's, it is possible to distinguish three contributions to the memory access latency, namely: 1) The intrinsic memory latency, typically today ofthe order of4O-60 ns.
2) The bus contention latency, which critically depends on network saturation. The bus bandwidth directly controls the contention latency depending on the number of connected processors.
3) The coherence latency. Maintaining the coherence of the caches is absolutely mandatory in tightly bound machines and requires broadcasting (or multicasting) coherence messages through the communication network. This coherence preservation contributes to slowing down the memory access. Snooping protocols in bus-based machines rest on the property that the shared bus represents a synchronisation point watched by all processors, enabling simultaneous update or invalidation ofduplicated copies ofdata in caches'. The paradox of SMP's is that the shared bus is simultaneously an enormous advantage and a terrible weakness. It is particularly advantageous as it enables implementing snooping protocols, which are much simpler and faster than directorybased protocols that have to be used in distributed networks 2 is also a terrible weakness because the bus serialises the communications with the memory and therefore represents a transmission bottleneck, which increases the memory access latency. Modern machines mitigate this issue by separating the transfers of addresses and data as will be shown below. However, the bandwidth of the electronic address bus represents the most problematic communication bottleneck of current and future SMP's. Although the transfer of data requires much more bandwidth, it is not a real issue as each processor can be connected to the crossbar switch by a private link (see figurel), the bandwidth of which can be increased with no coherence constraints.
The rest of the work is organised as follows: 1) We describe the current state of SMP's, particularly the communication performance; 2) Using a multiprocessor simulator, we study the address bandwidth requirements of the next generation of SMP's. Considering the performance of future superscalar processors, this study shows that an address-transaction bandwidth of 80-100 Gbitls will be necessary not to slow down the execution of applications in future large SMP's (including for instance 64 processors operating at 1 GHz). Reaching such a high bandwidth is particularly difficult using electric solutions because of the operation limitations of shared electric busses. It seems necessary to search for other implementations of the address transmission network in future SMP's; 3) We consider the introduction of OFs in this context. We review several solutions in the ascending order of complexity of the optical subsystems, as one critical issue concerns the degree of sophistication, the feasibility and the cost of the optical solutions. We first consider the sole introduction of point to point 01's in the framework of a SMP chipset. The interest for 01's comes from their low energy consumption and small size enabling the implementation of several thousands of optical input/outputs (10's) per electronic chip as needed by a multiprocessor chipset. Then we consider the introduction of an optical U-bus, which operates in the synchronous mode. The U-bus operates with no access arbitration and enables simultaneous transfers of different transactions to the bus. However, the price to pay is that the coherence of caches cannot be preserved by means of the usual snooping protocols (iVIES! protocol). New cache coherence protocols are under investigation.
CURRENT SMP ARCIIITECTURES
The fast increase of the processor speed has induced a dramatic growth of the bandwidth needs in recent SMP's as can be shown from the evolution of shared-bus machines. For instance, the XDbus from Sun is a 64-bit multiplexed addressdata bus running at 40 MHz (10 slots) which involves a split-transaction protocol. Each bus transaction requires 2 bus cycles for an address transfer and 9 bus cycles for a cache block transfer. The raw bandwidth is 320 MB/s, but the effective bandwidth is only ¾ of this value. In the SGI Power Challenge4, the split-transaction protocol is enhanced to take into account the presence of two busses clocked at 47.6 MHz (13 slots), one of 40 bits dedicated to the address transfer and another one of 256 bits used for the transfer of cache blocks. The two busses are synchronized and enable to transfer the address and the data in 5 bus cycles. The busses can support up to 8 outstanding read requests. The raw bandwidth is 1.5 GB/s but the effective bandwidth reduces to 4/5 of this value. The Sun Enterprise 6000 is also a non-multiplexed split-bus running at 83.5 MHz with 256 bits for the data and 41 bits for the address. This high operation frequency (for a shared bus) is due to a special backplane design (so-called "Centerplane" technology) where cards are connected by both sides (8 slots per side). The raw bandwidth is 2,6 GB/s and the number of outstanding read requests that can be in progress at the same time is 1 12 and up to 7 from each board, each processing board containing two processors). In recent SMP designs, such as the G30 machine from iBM or the 10000 from SUN5, the data network is a crossbar, but the address network remains a shared bus used for maintaining coherency. Above a certain number of processing elements, multiple address busses are needed (see Figure 1 ). For example, two XDbusses were implemented in the SparcCenter 2000 to attain a maximum number of 30 processors, four busses in the Enterprise 10000 to attain 64 processors. With a smart design of the crossbar switch, the direct transfer of blocks from cache to cache becomes possible without updating the main memory. The most critical problem at the moment is the address/snoop bandwidth discussed in the next section. • . .-. 
ADDRESS BUS TRAFFEC AND LIMITATIONS OF ELECTRIC BUSSES
We studied the address bus traffic using an instruction-driven simulator that enables tracing the bus operation during the execution of standard applications. The main features of superscalar processors (i.e., multiple instruction issue, dynamic scheduling, non-blocking reads, register renaming and speculative execution) are taken into account by the simulator and can be modified by means of adjustable parameters. Reference 6 describes in details the simulator and previous simulation results, showing in particular the address traffic traces, which reveal the high irregularity of communications. To characterize the impact ofthe address bus bandwidth on the execution time ofthe applications, we assumed that the network and the memory are contention-free so that the network transmits all the requests at each cycle.
The main prctical conclusions are reported in Figure 2 , which displays the execution time of different applications (Radix, Ocean, LU, FFT) of the SPLASH suite ' as a function of the bus bandwidth. We assumed an intrinsic memory access time of 75 processor cycles, that is equivalent in practice to considering processors running around 1-1.2 0Hz. For clarity, all execution times have been normalized to 1 when considering an infinite bus bandwidth. Figure 2 shows that the execution time of most applications very weakly increases when the address bus can transmit from 0.5 to 1 request per processor cycle (RPC). The worst case is that ofthe radix application, with an execution time approximately multiplied by 2 when the bus bandwidth is 0.5 RPC. Now, each address transaction generally requires 4-5 bus cycles due to the operations of access arbitration, transmission and preservation of the coherence (see section 2). Thus, for processors operating at F=l GHz, the transmission of 0.5 RPC requires a bus operating 4 to 5 times faster (i.e., around 2-2.5 GHz). The top bus bandwidth for 40-bit addressing will be in the range of 80-100 Gb/s. To avoid any confusion, it must be stressed that this bandwidth is not a point to point bandwidth, but that of a multipoint line, with the additional requirement that the propagation time must be shorter than one bus cycle (1 ns in our example) to be consistent with the bus operation principle. These constraints seem far beyond the capabilities of electric shared busses today based on the Gunning Transceiver Logic (GTL) technology. They operate as a transmission 8 9 Two points are important:
S The propagation time of logical signals along the electric bus is r = fL , with L and C being the linear inductance (typ. 5-10 nH/cm) and the linear capacity of the bus, respectively. The capacity is mostly due to the transceivers that connect the different processors, each transceiver_contributing typically for Cr20 pF. The propagation time when connecting Nprocessors over the length 1 is r = .IJLNCJ Ii (Eq. 1). For instance, i equals 25 ns/m when one processor is connected every second cm.
. The bus operation frequency FB must be low enough to ensure that the bus has time to attain an equipotential state (despite propagation effects) during the bus cycle.
With these constraints, even assuming a perfect adaptation of the termination load and no line reflection, the operation frequency of the bus cannot exceed F, = l/(lr), or using Eq. 1, FM = ,IJ1LNCJ (Eq. 2). This formula leads to very profound limitations. For instance, a 1.28m-long bus connecting 64 processors (with C'20 pF) could not operate at a frequency higher than 30 MHz. The address bandwidth, assuming 41-bit address transmission as in the SUN Enterprise 6000, would be ofthe order of 1.2 Gb/s, i.e., about two orders of magnitude lower than that deduced from the simulations! The only "palliative" solution consists of duplicating the number of address busses (for instance considering 10 or 20 bus) to reduce the operation frequency for each one. With NB busses transmitting with a parallelism Nw (typically 40), the accessible bandwidth B reads: B=NBNWFJ. Using Eq. 2, The maximum bus length to transmit the bandwidth B with NB busses is:
Let us consider a numerical example with B=lOO Gb/s (result of simulations), N=40 (bus width), N64 (nb of processors), L = 5nHJcm, C1 20 pF/transceiver. We deduce I 2.4x102 N with 1 in cm. Thus, with NBIO busses, the top acceptable bus length would be 2.4 cm to connect 64 processors! The corresponding interprocessor board distance shorter than 1 mm seems unrealistic. Moreover, the system will be complicate, as each processor would have to be connected to 10 caches and to 10 arbiters (one per bus). To go around this issue, one may possibly group several processors per daughter board. For instance, using quadriprocessor cards, the number of connected cards reduces to N'l6. The bus length 1=10 cm becomes possible, with an interboard distance of 6 mm. Of course, neither the large number of busses, nor their necessary short length, nor some additional energy dissipation issues are insuperable obstacles that will block in the future the feasibility of fully electronic SMP's, but a technological break seems necessary. We discuss in section 4 several strategies for introducing 01's in SMP's.
IMPLEMENTATION OF OPTICAL COMMUNICATIONS IN SMP'S
There are many technological arguments to suggest the introduction ofOl's Perhaps, the two following ones are mostly important:
Transmitting a few Gb/s with 01's is much simpler than with electric lines. There are no line adaptation problems, no capacity effects to slow down the propagation velocity. Remember that the propagation time equals 5 ns/m in the optical fibers, that is much less than the value of 25 ns/m deduced previously in the GTL bus with one transceivers connected every second cm, see Eq. 1 . Thus, an optical bus (that is of course an optical system much more complicated than a simple point to point optical link) could apriori operate at higher frequency than an electric one. The switch energy per bit is lower for 01's than for electric lines. The energy dissipated for transmitting one bit in an electric line is E=C10V2, where V is the chip operation voltage and C10 the 10 capacity, mainly due to packaging. The study of the evolution of VLSI's shows that C10 is of the order of 10 pF and does not evolve with time, despite the reduction ofthe lithographic features and the size oftransistors in the chips. Assuming 2Volt for the voltage operation, the switch energy is 40 pJ/bit. Let us consider the switch energy of an optoelectronic interface. To day, VSCEL's with a threshold current of 0.1 mA have been demonstrated. They can operate with a modulation current of 0.2 mA. At 1 GHz, the emission energy per bit is of the order of 0.2 pJ/bit (i.e.: 0.2 mA* 2volt*SOOps). The estimation of the energy involved in the reception stage is more complicated an depends on the detector capacity CPD (that is proportional to the detector area). CPD 5 typically ofthe order ofO.5 There is a necessary tradeoffbetween making the detector as small as possible (to run fast and to minimize the consummation) and maintaining an acceptable coupling efficiency of the diode with the external incoming beam that requires a minimum detector surface. Assuming for instance a detector area of 20x20 ,2 and a voltage operation of 2V, the detector switch energy is 0.8 pJ/bit (again E=C10V2). The total switch energy is ofthe order of lpilbit, that is to say much smaller than that deduced for the electric interconnect. One may expect an important reduction ofthe energy consumption of interchip communications. We stress that reducing the energy dissipated in interchip communications is crucial for the evolution of VLSI circuits as the energy is more and more dissipated by the communications rather than by the logical processing 12 We review several solutions, in the ascending order of complexity of the optical subsystems as one critical issue for the viability of optical solutions concerns they degree of sophistication and their cost. In fact, most ofthe limitations ofthe electric bus arise from the sole fact that it is a transmission line, which slowly reaches the equipotential state due to the capacity of the connected transceivers. Thus, integrating the whole bus (or possibly different busses, as shown in Figure 1 ) in a single VLSI would dramatically diminish the electric constraints, particularly those resulting from the "equipotentiality" condition of the bus. Note that integrating the bus does not necessarily imply maintaining the bus architecture in the new chip (particularly the presence oflong parallel electric lines). One may consider constructing a logical circuit that mimics the bus operation, mainly the mechanism of arbitration, the serialization of the address transactions with the memory, and the preservation of the cache coherence. Replacing the addrs& bush by a central
7
Multiprocessor Chipset switch is nothing but the continuation of an idea already put into practice in the monoprocessor architecture with the chipset integration. However, the generalization to multiprocessors with the definition of a multiprocessor chipset (MPC) is not simple for several technological reasons. The most important obstacle is the chip connectivity. Connecting N processors to the MPC would require at least 64*N pins (assuming 64-bit wide transmissions), plus 64*M extra connections to M memory chips, plus some more pins for the power supply, ground., etc. With N-32 processors, M 4 memory chips, the number of pins would be around N=2300. This estimation does not include the data transfer network (i.e., the private links to the crossbar) so that integrating both the data and the address network would require of the order of 500() pins. The mechanical feasibility of such a high number of pins seems so far not demonstrated.
01's might contribute to better solve the problem. The vertical emission of VCSEL's enables integrating optical 10's anywhere on a VLSI chip, not only on the perimeter. Arrays of 8x8 emitters have been demonstrated using the hybrid integration of Ill-V VCSEL's on top of CMOS 0.25 ttm drivers . Similarly, arrays of Si-diode detectors with their transimpedance amplifiers have been also integrated in CMOS circuits. The inter-emitter (or inter-receiver) distance of the order of 400 .tm enables considering a density of about 600 optical 10/cm2. Therefore, 01's represent a technological breakthrough that could permit to reach the JO densities in the range of several thousands necessary thr a MPC. This will he also possible because the low switch energy of 01's (0.2 mW for a VCSEL operating at I GHz with a current of 0.2 mA) would dramatically contribute to reducing the energy consumption of the chip. There are several solutions to connect the emitters and the receivers of different chips. One may consider free space interconnects 14, or guided transmissions through an optical backplane 15 or ribbons of fibers. The different solutions seem viable, hut the simplicity of implementation. the mechanical stability (that is a critical feature of optical systems) and above all the cost will he the final decisive arguments. 
Optical bus
The MPC approach suggested in the previous paragraph only requires point to point optical interconnects and is based on the underlying idea of extending the bus bandwidth by electronic integration without quest ion ing the basic operat ion principle of the system, in particular the serialization of the address transactions to the memory. Another alternative to go around the address bandwidth bottleneck in future SMP's might consist of relaxing the bus access arbitration. Two options are possible:
• Considering this approach in the framework of the MPC. It is possible to consider the integration of a true optical bus with the tools of integrated optics, i.e., optical guides, Y switches. etc... It is also possible to consider a pure electric solution with a new design of a MPC architecture that would enable simultaneous multiple accesses.
• Implementing of a real optical U-bus as shown in Figure 4 '' 17 The bus operation is described below. Naturally, the construction of an optical bus (on-chip or off-chip) implies an increase of the involvement of optical technologies with respect to the N4PC approach described in section 4.1. The bus is a multipoint optical line. In particular, it requires that each node adds and drops optical data with concomitant energy loss and energy balance issues.
The U-bus may operate in the synchronous and asynchronous modes. The synchronous operation is based on two phases, namely:
. The LOADphase. Each processor inserts (ifneeded) one optical pulse in the bus LOAD zone (see Fig. 4 ). Although the pulse duration depends on the optical emitter technology, duration as short as a few tens of picoseconds is easily achieved today with mode-locked lasers or fibers. The only geometric restriction to avoid access arbitration is that the internode distance L must be longer that the optical pulse (i.e., L>cW where Wp is the duration of the pulse). In that case, there is no data overlapping from adjacent processors during the load phase. Considering a bus as large as the address size makes possible transmitting one ftill address per cycle.
. The propas'ationphase. It follows the LOAD phase and lasts until the address inserted by the first node bypasses the last one (respectively nodes 1 and 5 in figure 4 ). This time I is necessary to free completely the LOAD zone from data.
T depends on the length of the LOAD zone, and can be as short as N.W, where N is the number of connected processors. Thus, an optical bus may potentially transmit one address to the memory every Wp seconds. For instance with 10-ps pulses and N=64 processors, the bus access latency could be as short as 640 ps and the bandwidth as high as 100 GAddress/s. This extremely huge bandwidth is to large for existing memories, but it demonstrates the optical bus potentiality. In practice, the solution to match the bus bandwidth with that of the memory consists of moving away the nodes to slow down the bus traffic (also increasing the bus access latency). For instance, with one node every 20 cm, the U-bus bandwidth reduces to 1 GAddress/s. Bus operation up to a few GHz (or higher) becomes possible (in replacement of electric bus operating around a few tens of MHz) because the transmission of optical pulses in guides is not penalized by capacity effects and critical load adaptations encountered for electrical transmissions in a multi-point line. Moreover, the parallel transmission through optical lines is almost skew-free in the GHz domain for transmission over a few tens of meters. This simplifies data sampling in case of parallel transmissions. As a result, the SMP architecture (i.e., the processor, the bus and the memory) would become more scalable.
Dropping the access arbitration to the U-bus simplifies the implementation and extends the bandwidth, but unfortunately there are also several bad consequences. The price to pay is very high in term of system operation, due to impossibility to maintain the coherence ofcaches using the snooping protocols of standard busses 2 must be stressed than maintaining the coherence of caches is absolutely mandatory to preserve the machine performance. Dropping the coherence would have two dramatic consequences: S A very important increase of the bus traffic as each processor would have to fetch the data operands in the memory (caching shared data would not be allowed).
. A dramatic increase of the data access time because of the obligation to read in the memory. In the presence of the full hierarchy, the average data access time is ofthe order of2-3 processor cycles, and would be close to 100 cycles without caches.
These simple considerations demonstrate that changing the SMP architecture might be meaningless if the coherence of caches cannot be preserved. Of course, maintaining the coherence is always possible with the implementation of directorybased protocols , but they are much slower than the snooping solutions due the increase of the coherence traffic. At the moment, it not demonstrated that the advantages gained by dropping the arbitration access in an U-bus will offset the consequences related to the implementation of directory-based protocols. A new snooping solution might be possible as the U-bus serializes the accesses to the memory so that all processors can simultaneously watch the memory transactions at point S in figure 4 . Nevertheless, the issue comes from the fact that the atomicity oftransactions is not preserved that makes more complicated the actions to maintain the coherence of caches. Adapted coherence protocols are under investigation. 
CONCLUSION
We showed that the increase of the processor power leads to dramatic bandwidth needs for future SMP's, especially concerning the address transmission that is complicated by the obligation to maintain the coherence of caches. Neither the increase of the bus number, nor their necessary small length, nor some possible energy dissipation issues are insuperable obstacles that will block in the future the feasibility of fully electronic SIv1P's, but a technological break seems necessary. The integration ofOl's can help in designing new SMP's. We considered two possible solutions.
. The MPC approach requires only point to point optical interconnects and is based on two underlying ideas: I) Including the minimum of optical technology that enables to improve significantly the communication system, due to the economic risk factor; 2) Extending the bus bandwidth by electronic integration in a single VLSI without questioning the basic operation principle of the system, in particular the serialization of the address transactions to the memory. The interest of 01's comes from the low energy consumption and the small size enabling the integration of several thousands of optical imput/outputs per electronic chip needed by the MPC.
. The optical bus approach based on the construction of a multipoint optical line. This alternative requires more sophistical optical components, which often are expensive and not very integrated at the moment. Optical solutions with ribbons of Y switches are not available. The only possible solution seems to use an optical backplane with diffractive gratings for coupling the different processors to the optical bus. It is clear that this technology is at the moment expensive and in a preliminary development stage. The most critical question for the viability of this approach is the preservation ofthe coherence ofcaches that is mandatory for the efficient operation ofan SMP.
