We briefly review some of the communication issues of computers to conclude that the introduction of optical interconnects might more contribute to solving the foreseeable communication bottlenecks in multiprocessors than in monoprocessors.
Introduction
Perhaps, the greatest challenge of computer architecture is that the performance of microprocessors has increased much faster than that of the memory system. Thus, the time needed by the processor to read instructions or data in the main memory has continuously increased (in terms of processor cycle) making the direct exchanges with the main memory more and more penalizing for the global performance of all machines. Of course, the architecture of chips and computers has permanently evolved. However, as it has been impossible to change the memory technology and to shorten the memory access latency (MAL) in accordance with the processor demand, the evolution has mostly consisted of hiding the MAL with hardware or software solutions [1] . Modern microprocessors are designed to hide the latency using speculative and out-of-order executions. Multithreading [2] is a software solution to hide the MAL. Moreover, all general-purpose microprocessors operate with a hierarchy of caches L1, L2 (and sometimes L3) to limit the frequency of the accesses to the main-memory. We analyze the opportunity to insert optical interconnects (OI) in this context. Communication issues are best described in terms of latency [3] . Concretely, the access time to L1 costs 1 processor cycle, that to L2 typically 4 -5 cycles. The latency T R (N) of a READ operation in the memory is particularly critical as it can stall the processor for a long time. It lasts from 50 to 100 processor cycles in monoprocessors, up to several thousands of cycles in complex multiprocessors. T R (N) can be decomposed approximately as follows:
where T M is the intrinsic memory access time (today, typically 50-100 ns, directly related to the DRAM technology), N is the number of bits transferred from the memory in the READ operation (often a cache block, i.e., 32 or 64 bytes), N P is the internode transmission parallelism, F is the internode transmission bandwidth, T IN is the internode transmission latency directly related to the average internode distance, T BN is the node bypass latency that depends on the node switch electronics, K is the average number of nodes between the processor and the memory (i.e., the average message distance), and T C is the latency to maintain the coherence of caches that can be long in large multiprocessors operating with a repertory protocol [1] . Eq. 1 is approximate but it clearly shows that: 1) the fundamental memory access problem consists of reducing the dominant latency term (DLT); 2) believing that the extension of the internode transmission bandwidth F N P . (possibly with OI's) automatically improves the system performance (a little bit as in telecommunication networks) is a misleading idea when
We briefly discuss Eq. 1 in monoprocessors, weakly bound systems, symmetric multiprocessors (SMP) and k-ary Ncubes. We conclude that the introduction of OI's seems more favorable in multiprocessors. The interest for introducing OI's in computers comes more from technological arguments (e.g., higher interconnect density, high parallelism, low energy consumption, improved noise immunity and security, etc…) than from a potential renewal of computer architectures.
Monoprocessors
OI's add an optoelectronic conversion latency of the order of 100-300 ps [4] that degrades the global performance when it is crucial to reach ultrashort transfer time between processing units, registers or caches. Simulations show that introducing OI's between the registers and L1, or between L1 and L2 degrades the performance of the monoprocessor [5] . The logic of chip integration is to implement side by side the functional units very sensitive to the transfer latency without resorting to OI's. They could be reserved for "long-distance" transmissions as simple calculations show that the latency of an OI is shorter than that of an electric resistive line for distances typically longer than 1 cm. [ 6] . However, the limitations of the L2-memory transfers might become dramatic in 5 to 10 years from now if the processor power keeps growing exponentially as observed over the two last decades. Future processors could execute up to one hundred billions of instructions per second (i.e., 10 11 instructions/s) with a few percents of these instructions requiring memory accesses due to caches misses. This demand will induce a terrible stress on the memory. The most urgent task will be to improve the memory system (which is the DLT), and depending on the memory evolution, to increase the communication bandwidth between the memory and L2. OI's could help here to reach a bandwidth in the range of 100 Gb/s between L2 and the memory with cost-effective solutions.
Multiprocessors: weakly bound systems
T M is not generally the dominant latency term in the multiprocessor machines. Interprocessor communications rest on a distributed control and the transmission of packets rather than on the reservation of paths due to the small granularity of transfers. The latency primarily depends on the network topology and on the conditions of propagation through the network, i.e., on the internode propagation time T IN and on the node bypass time T BN . When T IN is much larger than T BN (case of distant nodes), there is no simple solution to shorten the latency of interprocessor communications, as TIN is incompressible. Indirect methods consist of: 1) increasing the granularity (i.e., increasing N) and (or) the size of caches to reduce the frequency of transfers; 2) increasing the network connectivity to reduce the distances; 3) Increasing the number of threads. OI's cannot compress TIN. They can help to increase the internode bandwidth and to reduce the network cost.
Symmetric Multiprocessors
A SMP architecture operates as a multiplexer/demultiplexer, which serializes the communications of all processors with the memory [1, 7, 8] . It is logically equivalent to a single symmetric node so that, K=1 and T IN vanishes in Eq. 1. Serialization is a paradoxical property. On one hand, it enables preserving the critical role of caches in the multiprocessor environment by quickly solving the cache coherence issue with some snooping protocol [1] , but on the other hand, it generates a communication bottleneck, which extends the memory access latency. T NB becomes the DLT when increasing the number processors (i.e., T NB >T M ). Simulations show that an address-bus bandwidth apparently beyond the capabilities of shared electric busses [9] , in the range of 100-200 Gb/s, will be needed not to slow down the operation of future superscalar SMP's [10] . A possible strategy to go around this limitation might consist of integrating the bus in a single VLSI to reduce all dimensions and to relax the electric constraints, particularly those resulting from the "equipotential-bus" condition. Replacing the address bus by a central switch is nothing but the continuation of an idea already put into practice in the monoprocessor architecture with the so-called chipset integration. However, its generalization to a multiprocessor chipset (MPC) is not trivial. The most important obstacle is the chip connectivity. Connecting N processors to the MPC would require at least 64*N pins for addresses (the factor 64 corresponds to today's 64-bit transmission parallelism), plus 64*M ext ra connections to M memory chips, plus some more pins for the power supply, ground., etc.. For instance, with N=64 processors, M=4 memory chips, the number of pins would be of the order of 4500. This estimation does not include the data transfer network so that connecting both the data and the address networks to the MPC might require about 10000 pins. So far, the mechanical feasibility of such a high number of electrical pins is not demonstrated. Contrarily, it is possible with OI's [11] . The integration of OI's in an optoelectronic MPC is a solution that might help in pushing away the communication bottleneck in SMP's [9] .
Multiprocessors: k-ary N-cubes
Though a wide variety of networks has been studied and implemented fo r massively parallel machines (MPM, see for instance a brief overview in ref. [12] ), most today's MPM's use meshes, tori, which belong the general family of k-ary N-cubes [13] . The communication performance of these machines is primarily controlled by the conditions of packet propagation through the interprocessor network. is few interconnects leaving each node. Increasing the connectivity reduces K but makes the node router more complex and therefore increases T BN . So, which is the best network in terms of latency? It is easy to calculate K as it only depends on topological considerations. It is much more difficult to calculate T BN as it depends on the node routing strategy. The challenge is to make the router latency T BN as short as possible (in terms of electronic cycles) without reducing the router functionality and maintaining the circuit complexity (i.e., the cost) in some acceptable limits. An extensive discussion of these constraints is reported in [14] with numerous references. Depending on the cost/performance ratio, the choice of the network connectivity, topology, routing strategy, node complexity is open.
What could be the role of OI's in this context? They can help to reduce the term the internode transmission frequency F or the parallelism NP. NP is currently of 16 or 32 and could be extended to 256 or 512 (i.e., the cache block size) to transfer cache blocks between nodes in one cycle (i.e., N/N P =1). The role of OI's can be important although it is clear that the heart of the matter is essentially the router optimization.
