Abstract. The extremely high data rates of optical computing technology (100 Mwords/s and upward) present unprecedented challenges in the dynamic memory design. An optical fiber loop used as a delay line is the best candidate for primary, dynamic memory at this time. However, it poses special problems in the design of algorithms due to synchronization requirements between the loop data and the processor. We develop a theoretical model, which we call the loop memory model (LLM), to capture the relevant characteristics of a loop-based memory. An important class of algorithms, ascend/descend-which includes algorithms for merging, sorting, discrete Fourier transformation (DFT), matrix transposition, and multiplication and data permutation-can be implemented without any time loss due to memory synchronization. We develop both sequential and parallel implementations of ascend/descend algorithms and some matrix computations. Some lower bounds are also demonstrated.
Introduction

Success of Very Large Scale Integration Theory
Over the last 15 yr, very large scale integration ͑VLSI͒ has moved from being a theoretical abstraction to being a practical reality. As VLSI design tools and VLSI fabrication facilities such as the metal-oxide semiconductor system ͑MOSIS͒ became widely available, the algorithm design paradigms such as systolic algorithms, 1 which were thought to be of theoretical interest only, have been used in highperformance VLSI hardware. Along the same lines, the theoretical limitations of VLSI predicted by area-time tradeoff lower bounds 2 have been found to be important limitations in practice.
Promise of Optical Computing
The optical computing technology is considered to be one of the several technologies that could provide a boost of two to three orders of magnitude in computing speed over the currently used semiconductor ͑silicon͒-based technology. The field of electro-optical computing, however, is in its infancy stage; comparable to the state of VLSI technology, say, 10 yr ago. Fabrication facilities for many key electro-optical components are not widely availableinstead, the crucial electro-optical devices must be specially made in the laboratories. However, a number of prototype electro-optical computing systems have been built recently, perhaps most notably at Bell Laboratories under Huang 3, 4 and also at Boulder under Jordan ͑see later͒. In addition, several optical message-routing devices have been built at Boulder, 5 Stanford, and the University of Southern California 6-9 ͑USC͒. Thus optical computing technology has matured to the point that prototypical computing machines can be built in the research laboratories. But it certainly has not attained the maturity of VLSI technology. What is likely to occur in the future? The technology for electro-optical computing is likely to advance rapidly through the 1990s, just as VLSI technology advanced in the late 1970s and 1980s. Therefore, following our past experience with VLSI, it seems likely that the theoretical underpinnings for optical computing technology, namely, the discovery of efficient algorithms and of resource lower bounds, are crucial to guide its development. This seems to us to be the right moment for algorithm designers to get involved in this enterprise. A careful interaction between the architects and the algorithm designers can lead to a better-thought-out design. This paper is an attempt in that direction.
Data Storage: Key Problem in Optical Computing
The optical computing technology can obtain extremely high data rates beyond what can be obtained by current semiconductor technology. Therefore, to sustain these data rates, the dynamic storage must be based on new technologies that are likely to be completely or partially optical. Jordan at the Colorado Optoelectronic Computing Systems Center 10 and some other groups have proposed and used optical delay line loops for dynamic storage. In these data storage systems, an optical fiber, whose characteristics match the operating wavelength, is used to form a delay line loop. In particular, the system sends a sequence of optically encoded bits down one end of the loop and after a certain delay ͑which depends on the length and optical characteristics of the loop͒, the optically encoded bits appear at the end of the loop, to be either utilized at that time and/or once again sent down the loop. This idea of using propagation delay for data storage dates back to the use of mercury delay loops in early electronic computing systems before the advent of large primary or secondary memory storage. Jordan 10 has been able to store 10 4 bits per fiber loop with the fiber length of approximately 1 km. This was achieved in a small, low-cost prototype system with a synchronous loop without very precise temperature control. Jordan used such a delay loop system to build the second known purely optical computer ͑after Huang's͒, which can simulate a counter. Note that this does not represent the ultimate limitations on the storage capacity of optical delay loops, which could in principle provide very large storage using higher performance electro-optical transducers and multiple loops. Maguire and Prucnal 11 suggest using optical delay lines to form disk storage.
The main problem with such a dynamic storage is that it is not a random access memory. A delay line loop cannot be tapped at many points since a larger number of taps leads to excessive signal degradation. This implies that if an algorithm is not designed around this shortcoming of the dynamic storage, it might have to wait for the whole length of the loop for each data access. Systolic algorithms ͑Kung 1 ͒ also exhibit such a tight interdependence between the dynamic storage and the data access pattern.
Loop Memory Model and Our Results
In this paper, we propose the loop memory model ͑LMM͒ as a model of sequential electro-optical computing with delay-line-based dynamic storage. The LMM contains the basic features that current delay loop systems use, as well as the features that systems in the future are likely to use. It would seem that the restrictive discipline imposed on the data access patterns by a loop memory would degrade the performance of most algorithms, because the processor might have to idle waiting for data. However, we demonstrate that an important class of algorithms, ascend/descend algorithms, can be realized in the LMM without any loss of efficiency. Note that many problems including merging, sorting, discrete Fourier transformation ͑DFT͒, matrix transposition and multiplication, and data permutation are solvable with an ascend/descend algorithm ͑described in Sec. 3͒. In fact, we give sequential algorithms covering a broad range for the number of loops required. A parallel implementation performing the optimal amount of work is also shown. The work performed by a parallel algorithm running on p processors in time T is given by p*T.
An ascend or descend phase takes time O(n log n) in the LMM using log n loops of sizes 1, 2, 4, . . . , n. Note that a straightforward emulation of a butterfly network with O(n log n) time performance requires O(n) loops: n loops of size 1, n/2 of size 2, n/4 of size 4, . . . , 1 of size n. It can be implemented in time n 1.5 just with two loops of sizes ͱn and n each. This can be generalized into an ascenddescend scheme with time nkϩn 1.5 2 Ϫk/2 with 2рk рlog n loops. At this point in time, a loop is a precious resource in optical technology, and hence tailoring an algorithm around the number of available loops is an important capability. The k loop adaptation of the ascend/descend algorithm provides just this capability. A single loop processor takes n 2 time. A matching lower bound also exists for this case, which is derived in Sec. 6 from one tape Turing machine crossing sequence arguments. Matrix multiplication and matrix transposition can also be performed in an LMM without any loss of time.
We also consider a butterfly network with p log p LMM processors, where 1Ͻ pрn. The work ͑number of processors, time product͒ of this network for ascend-descend algorithms is shown to be O(n log n). Note that a butterfly network performs n log n work. This shows that the ascend-descend algorithms can be redesigned in such a way as not to incur any work loss due to the restrictive nature of the loop memories.
Related Work
Several models for virtual memory have been considered recently. The random access machine ͑RAM͒ was proposed as a simple model of sequential computation without any virtual memory in Aho et al. 12 Aggarwal et al. 13, 14 introduced the first hierarchical memory model, which also incorporates block transfer. They extended 15 it to the parallel computation model PRAM. Vitter and Shriver 16 consider the PRAM case where parallel block transfers are permitted. In all these models, the algorithm optimization consists of exploiting either temporal or spatial locality given that the access to a block of size b at address x costs time f (x)ϩb, where f (x) is the seek time. In our model, the synchronization of the loops ͑primary storage͒ with processor is the main objective in algorithm design. In a more fully developed optical computing system, it seems almost imperative that a memory hierarchy will exist, with either fast semiconductor memory or laser disks at the bottom of the hierarchy. An analysis of the traffic between the loop memory and secondary memory could be similar to the previous work. In the early days of computing, acoustic delay lines were used as mass storage. But using a delay line as a secondary storage medium does not pose the problems that are encountered when it is used as the primary storage. Its use as a secondary storage can be analyzed by the techniques developed for the hierarchical memory model 13 ͑HMM͒. The related work in the theory of optical computing is very sparse. Barakat and Reif 17 introduced VLSIO ͑electro-optical VLSI͒, a model of optical computing. They considered volume-time trade-offs and lower bounds in this model. We demonstrated 18 energy and energy-time product lower and upper bounds for optical computations. We introduce 19 the DFT-PRAM model, where a DFT can be computed in unit time to model the power of optical computing. We develop several O(1) time algorithms in this model.
The work most related to this paper is that of Jordan, 10 Heuring et al., 20 and Jordan. 21, 22 They describe an optical delay loop system and show that a number of networks could be simulated by the use of optical delay loops. However, their work does not imply our results; in particular, they did not look at the algorithmic aspects of the loop memory systems.
Optical waveguides have been used in several optical computing systems, but primarily for computing and not storage. Daeshik et al., 23 Jordan et al., 24 and Kyungsook 25 use delay lines for time slot permutation and sorting. Barbarossa and Laybourn 26 use optical delay lines for processing. Wagner et al. 27 perform adaptive array processing with recirculating loops. There are several applications of microwave processing with delay line loops. 28 The algorithms presented in this paper schedule the memory loops for their data needs. However, all the problems solvable with an ascend-descend algorithm entertain structured data and hence data scheduling can be built into the algorithm as in systolic algorithms case. For random data, the memory loop scheduling ͑similar to register allocation in current-day compilers 29 ͒ poses an interesting problem. Tyagi 30 proposes some algorithms for scheduling data on loops.
Organization
The rest of the paper is organized as follows. Section 2 introduces both the sequential processor and parallel LMM. The various sequential processor implementations of the ascend-descend algorithms are described in Sec. 3. Section 4 sketches the p processor version of ascend-descend algorithms. A brief sketch of matrix transpose and multiplication is given in Sec. 5. Some lower bounds are shown in Sec. 6. An alternative model for parallel optical computation where the delay line loops serve as both a communication link and a dynamic memory is presented in Sec. 7. Section 8 concludes the paper.
Model
The intent here is to develop a model for a processor with the primary dynamic storage in a delay line loop memory. The processor can have an instruction set similar to that of the RAM in Aho et al. 12 To make it a little more reasonable, we assume that the processor also contains a constant number of registers, each capable of holding one word of data. Note that a register operating at gigahertz rate (10 9 /s) can be built using a very short delay line with a directional coupler.
10 Figure 1 shows a processor with three memory loops.
Memory Loops
A processor can have several delay line memory loops of prespecified lengths associated with it. Let there be k loops L i of lengths n i for 0рiϽk. Note that the length of the fiber ͑as a multiple of the wavelength͒ determines the number of bits stored in it. We assume that the size of data stored in each loop L i is a pair of words, i.e., it is a wordparallel implementation. This could mean that each conceptual loop L i is implemented using 64 physical loops for a 32-bit word. An alternative method could be to encode 2 bits into each wave, thereby requiring only 32 physical loops. In either case, L i can store n i word pairs. The processor has a tap into each of these loops. A simple conceptual model is to assume that the current word under the tap is stored in a register associated with the loop. In practice, a memory loop could be directly coupled to a computational device without any need for an intermediate register.
In the following, we make the assumption that for all practical purposes, the CPU accesses a loop through its tap register. A write into the loop is also performed from its tap register. We refer to the value in the tap register of the loop L i by v(L i ). We use the notation v(L i )ϭe to denote a write of the result of an expression e into L i . Similarly, x ϭe͓v(L i )͔ denotes a read of L i for the evaluation of the expression e.
Note that the digital optical computer at Colorado 20, 22, 31 uses a bit-serial design. Each loop stores 64 words of 16-bits each in a bit-serial manner. The primary reason for choosing a bit-serial design was the cost of the optical switches. In their memory loop implementation, 31 each loop has a distinct read and write port. A lithium niobate directional coupler is used at each of these ports. Each data word recirculates serially through a delay line passing the input/output ports once each memory cycle ͑20.5 s, time to go through the loop once͒. A memory counter keeps track of the address of the memory word at the read/write port. There are no latches in the system. The entire processor is designed with ''time-of-flight'' synchronization.
Read/Write Delays
The delay to read or write from a memory loop is a fraction of time required for computation. The situation is similar to the register access time being a fraction of cycle time. In this paper, we assume that both read/write require unit time. Recall that the Colorado optical computer 22 has a memory cycle of 20.5 s with a loop storing 64, 16-bit words. If these loops were to be made word-parallel, then the time between successive words is 20 ns, which corresponds to a 50-MHz processing rate. This indeed is the processing rate of the Colorado digital optical computer. Hence our assumption about each loop access taking a machine cycle is not unrealistic. We also blur the distinction between an instruction cycle and a machine cycle in the algorithms presented in Sec. 3. We assume that instructions finish in one machine cycle as well, or that the cycles per instruction 32 ͑CPI͒ is nearly 1. The algorithms in this paper can be easily modified to fit any other delay model.
The operands for the instructions can be any of the local registers or one of the loop tap registers.
The parallel butterfly model is a direct extension of the sequential processor model. The interesting scenario occurs only when an n-input problem instance needs to be solved on a p log p processor butterfly network for pϽn. In such a case, the loop memory at each processor is used to maintain the copies of multiple data resulting from folding. For a description of a butterfly network, see Ullman. Note that the prototype of the electro-optical computing system being built at Colorado by Jordan 10 resembles our model. All the register and main memory is realized with fiber delay line storage loops. The processing logic is also derived from the optical technology in the form of a directional coupler, where the electric field between the waveguides is used as the control mechanism. 10 
Ascend-Descend Algorithms
The CASCADE algorithms ͑ascend-descend͒ introduced in Preparata and Vuillemin 34 are an important class of algorithms. Consider an algorithm with nϭ2 k input data items. Let the i'th input be located at address i for all 0рiр(n Ϫ1) initially. An algorithm is an ascend algorithm if it operates successively on pairs of data items that are located 2 0 , 2 1 , 2 2 , . . . ,2 kϪ1 ϭn/2 distance apart. This involves ascending right to left on the k address bits. A descend algorithm is defined similarly to operate on pairs of data items 2 kϪ1 , 2 kϪ2 , . . . , 2 0 apart. We define CASCADE to be the class of algorithms that are composed of a sequence of ascend and descend algorithms. Many problems such as merging, sorting, DFT, matrix transposition and multiplication, and data permutation have CASCADE algorithms. Hence, an efficient implementation of ascend and descend algorithms translates into an efficient implementation of all these problems. Besides, the topology of many interconnection networks such as butterfly, perfect shuffle, hypercube, and cube-connected cycles is ideally suited for ascend/ descend algorithms.
In this section, we show a single processor implementation of an ascend-descend algorithm with following resources: ͑1͒ time O(n log n), log n loops of sizes 1, 2, 4, . . . , n; ͑2͒ time n 1.5 , two loops of sizes ͉ͱn͉ and n; and ͑3͒ time nkϩn
, k loops, with 2рkрlog n, of sizes (n/2 k ) 1/2 and n/2 k , n/2 kϪ1 , . . . , n. Note that the first implementation performs optimal amount of work Ϫ⌰(n log n), demonstrating that an ascend/descend algorithm can be realized without any loss in time performance due to high latency evident in loop memories. Also note that the number of loops in a real optical processor might be limited. The k loop realization is helpful in this situation as it provides an implementation of ascend/descend algorithms on any optical processor.
Loop Primitive Operations
In the following algorithms, we perform same set of actions on loops many times. Hence we encapsulate these actions into loop primitives. We use the notation ͉L i ͉ to denote the length of the loop L i , also given by n i . Recall that we have assumed that each location in the loop is capable of holding two words, as if there were two conceptual tracks on the loop. We refer to these words by v u (L i ) and v l (L i ) for upper and lower track words, respectively. We also refer to the words
The notation L i j denotes the word that would be at the read/write head ͑tap register͒ of L i in j steps from the current time. Figure 2 shows a loop L i with data circulating left to right. The right end of the loop is connected to the left end. Sometimes, we need to tag a value in a loop L i for reasoning about the correctness of our algorithms. In that case, we assume that we can associate a memory counter with each word in the loop L i with values between 0 and n i Ϫ1. We refer to the upper ͑lower͒ word with memory counter value j by the notation
Input copying. Our initial assumption is that the primary program input values reside in some other memory. We would copy different input values into a given loop ͑of size 1͒ over time. The primitive copy -input(x i ,L j ) copies the first argument x i to the upper track of the loop L j at the word aligned with the write port when this primitive is invoked, i.e., v u (L j 0 )ϭx i . We assume that copy -input takes one cycle.
Copying. In ascend-descend algorithms in the next section, many copying steps involve copying each value from a loop Figure 3 shows the copy pattern.
The second copy of L i ͑through second ᭙ at line 2͒ is synchronized for L i and L j . After line 1 has completed copying L i into the first half of L j , the word at L j n i at the entry into copy -upper is now at the read/write port ͑cur-rent L j 0 ͒. At the entry into copy -upper, L i 0 , is again at the read/write port. We need another similar primitive for copying the upper track words from L i to the lower track of L j . Note that this copying takes time n j ϭ͉L j ͉. 
end copy -lower
Loop computation. Another step in the ascend-descend algorithms is to apply a function f of two arguments to words on the upper and lower tracks for an entire loop.
end compute -loop
An assumption made here is that the reads of two words 
The time taken by this computation is n i ϭ͉L i ͉, assuming one cycle read-compute-write.
The log n Loop Version
We number the levels of a butterfly network from 0 through log n ͑bottom-up, as in Fig. 4͒ such that the crossconnections between level i and iϩ1 have a span 2 i . The processors on the same level are numbered 0 through n Ϫ1 from left to right. Let x i, j for 0рiр(nϪ1) and 1р j рlog n be the value computed in the processor p i, j . The input value x i corresponds to x i,0 for 0рiр(nϪ1). Note that for the ascend phase x i, j ϭ f (x i, jϪ1 ,x iϩ2 jϪ1 , jϪ1 ) when the j'th bit from the right in the binary representation of i is 0, and
Here f is some operation performed in the processors. Similarly, for the descend phase x i, j , for 0рiр(nϪ1) and 0р j р(log nϪ1), is given by the following expressions: x i, j ϭg(x i, jϩ1 ,x iϩ2 jϩ1 , jϩ1 ) when the ( jϩ1)st bit from the right in the binary representation of i is 0, and x i, j ϭg(x i, jϩ1 ,x iϪ2 jϩ1 , jϩ1 ) otherwise. The skeleton of the implementation is as follows. A butterfly computation graph is scanned in the order shown in Fig. 4 . We demonstrate only the ascend-phase of the algorithm. The descend phase is very similar.
We assume that log n memory loops The ascend phase is implemented by a call to Process-Ascend(0,n) ͑Fig. 5͒, where n is assumed to be a power of 2. It takes time 3n log nϩn as T͑n ͒ϭ2T͑ n/2͒ϩ3n, T͑1 ͒ϭ1.
The two recursive calls to Process-Ascend in steps 2 and 4 of Fig. 5 give the term 2T(n/2) and the two copying steps in steps 3 and 5 lead to the 2n term. The computation in step 6 adds another n to this equation. These equations have a solution in T(n)ϭ3n log nϩn.
The main concern in determining the correctness of this implementation and its timing analysis is the synchronization of all the loop accesses. For instance, at the return from the call Process-Ascend(0,n/2) at step 2, are the loops L n/2 and L n synchronized so that copying takes place in n steps? Note that the usage pattern of the loops by this algorithm has the following form: 8 , and so on. The following inductive proof es- tablishes that the timing of the algorithm is correct, as thesynchronization is done by ''time-of-flight.'' Note that we have not defined the term ''synchronization'' used in the statement of the following theorem Theorem 1. The ascend phase algorithm in Fig. 5 has synchronized data transfer and computation to emulate the behavior of a butterfly network.
Proof. The proof is inductive. Assume that all the data transfers and computation are synchronized for Process-Ascend(i,k) for all values of kϽn. The inductive basis for kϭ1 is trivial, as it involves only the transfer of x i into L 1 . The inductive step is to prove that Process-Ascend(0,n) is synchronized given that Process-Ascend(0,n/2) and Process-Ascend(n/2,n/2) are synchronized.
Assuming that nϾ1, Process-Ascend(0,n) first calls Process-Ascend(0,n/2), copies L n/2 into L n , calls Process-Ascend(n/2,n/2), copies L n/2 into L n , and finally computes L n . After Process-Ascend(0,n/2) is done, the assumption ͑part of inductive hypothesis͒ is that L n/2 is scanning x 0,log nϪ1 , or v u (L n/2 0 )ϭx 0,log nϪ1 . In addition to showing that all the copying steps work correctly, we also need to prove that at the end of Process-Ascend(0,n), v u (L n 0 )ϭx 0,log n . Let Tϭ0 when the copy -upper at step 3 starts copying L n/2 into L n . Let the memory counter of the word copied into L n at 0рTϭkрnϪ1 be k. Similarly, let the memory counter in L n/2 for the word copied at 0рTϭk
u ϭx j,log nϪ1 . The copying is completed at TϭnϪ1, when Process-Ascend(n/2,n/2) is called. This call is completed in (3n/2)(log nϪ1)ϩn/2 steps. Once again, at time Tϭnϩ(3n/2)(log nϪ1)ϩn/2, v u (L n/2 0 )ϭx n/2,log nϪ1 by inductive hypothesis. The key question from synchronization perspective is which word is at the read/write port of L n at Tϭnϩ(3n/2)(log nϪ1)ϩn/2? We would like v u (L n 0 ) to be either L n ͓0͔ u or L n ͓n/2͔ u , which holds if Tϭn ϩ(3n/2)(log nϪ1)ϩn/2 is a integral multiple of n/2. Since n/2 divides the time elapsed since the last access to memory counter 0 in L n , our synchronization holds for this copying step for the following reason. After copy -lower of step 5 is completed at Tϭ(5n/2)ϩ(3n/2)(log nϪ1), L n ͓ j ͔ l ϭL n ͓ jϩn/2͔ l ϭx n/2ϩ j,log nϪ1 for 0р jϽn/2 and we already had L n ͓ j ͔ u ϭL n ͓ jϩn/2͔ u ϭx j,log nϪ1 . Hence when compute -loop of step 6 starts at time Tϭ(5n/2) ϩ(3n/2)(log nϪ1), v u (L n j )ϭx j,log nϪ1 and v l (L n j ) ϭx n/2ϩ j,log nϪ1 for 0р jϽn/2, as expected by the ascend algorithm, resulting in L n ͓ j ͔ u ϭx j,log n . Similarly, v u (L n j ) ϭx jϪn/2,log nϪ1 and v l (L n j )ϭx j,log nϪ1 for n/2р jϽn, and hence L n ͓ j ͔ u ϭx j,log n . This establishes one part of the inductive hypothesis. At time Tϭ(7n/2)ϩ(3n/2)(log nϪ1), when compute -loop is done, v u (L n 0 )ϭx 0,log n , establishing the second part of the inductive hypothesis. ᮀ Figure 5 gives a recursive version of the ascend phase implementation. We give an iterative version in Fig. 6 as it helps understand the timing of the ascend phase from a different perspective. The key point to note here is that Process-Ascend(i,k) calls one computing primitive ͑compute -loop usually, or copy -input for an input bit with kϭ1͒ and one copying primitive ͑copy -upper or copy -lower͒ for each feasible value of (i,k). The temporal order in which these two primitives for different (i,k) pairs are called is as follows: ͑0,1͒,͑1,1͒,͑0,2͒,͑2,1͒,͑3,1͒,͑2,2͒,͑0,4͒,͑4,1͒,͑5,1͒,͑4,2͒,͑6,1͒, ͑7,1͒,͑6,2͒,͑4,4͒,͑0,8͒,...,͑0,n͒. Note that iϩk denotes a diagonal in the butterfly network of Fig. 4 . Also note that all the (i,k) pairs are invoked in the increasing order of their diagonal dϭiϩk. Of course, some of the diagonal entries are never invoked, for instance ͑iϭ1, kϭ2͒ is not used due to the nature of this computation. Line 3 in Fig. 6 checks for the validity of each (i,k) pair. Lines 1 and 2 set up these (i,k) pairs going over diagonals d from 1 to n.
The preceding description assumes that the n input words are available in some other memory ͑not the optical loops͒. This is not a realistic assumption since the processor contains only a small sized ͓O(1)͔ local storage in addition to the memory loops. The only other place to hold the input is the slow secondary memory. In the following, we show that if the input words are resident in the loop L n initially, then they can be trickled down to the smaller loops in synchrony with Process-Ascend without any loss of efficiency. Here, we assume that either there is a separate third track on each loop to hold the input values or there is a separate set of log n input loops. In either case, let I k denote the input loop of size k, which is either part of the loop L k or is an independent loop. Let the n input words be in I n initially. In fact, we assume that the memory counter for the input word x j is j, i.e., I n ͓ j ͔ϭx j for 0р jϽn. We define two more primitives for the input loops: copy -input -left and copy -input -right. The primitive copy -input -left(I k ,I k/2 ,i) copies I k 's left half ͑k/2 words͒ to I k/2 starting with input word x i , which would be at memory counter value 0.
copy -input -left(I k ,I k/2 ,i) ͕comment: Argument i is only for reading convenience. Whenever this primitive is called, we maintain the invariant that I k ͓0͔ϭx i .͖ 
We define copy -input -right(I k ,I k/2 ,iϩk/2) to copy the second ͑right͒ half of I k ͑k/2 words͒ into I k/2 . This task could have been performed by copy -input -left, but again for presentation clarity we also define copy -input -right. Once again, the specification of i ϩk/2 is really redundant. The expectation is to copy I k ͓k/2ϩ j ͔ϭx iϩk/2ϩ j for 0р jϽk/2 to I k/2 ͓ j ͔. The invariant maintained is that whenever this primitive is called
copy -input -right(I k ,I k/2 ,iϩk/2) ͕comment: Argument iϩk/2 is only for reading convenience. Whenever this primitive is called,
The procedure Process-Ascend also needs to be modified as follows:
Process-Ascend(i,k) ͕comment: Process the k input bit positions starting at x i for ascend phase.͖ ͕comment: k is assumed to be power of 2.͖ if kϭ1 then return;
for 0рiр(kϪ1).͖ end if ͕comment: Note that x i,k/2 and x iϮk/2,k/2 are available in L k at position i due to previous two copying steps.͖ end Process-Ascend
We now need to argue that this new version of ProcessAscend works correctly and that the input words are also synchronized. All that we have modified in ProcessAscend is to insert copy -input -left(I k ,I k/2 ,i) at line 2 and copy -input -right(I k ,I k/2 ,iϩk/2) at line 5. Also, the action taken when kϭ1 ͑at line 1͒ has changed from copying the input word x i into L 1 to doing nothing. This version of ascend algorithm takes time 4n log n. Note that T͑n ͒ϭ4nϩ2T͑ n/2͒, T͑1 ͒ϭ0.
The input copying steps at lines 2 and 5 each take time n/2. This results in extra n term for the expression for T(n). Since nothing is done for kϭ1 case, the expression for T(1) is 0. These equations have a solution in T(n) ϭ4n log n.
We now give a proof sketch for the timing of this version. The key invariant is that right before Process-Ascend (i,k) is invoked I k contains the input words Let us establish the hypothesis for Process-Ascend͑0,n͒ assuming it is true for all kϽn. Note that by assumption v(I n 0 )ϭx 0 and hence copying step at line 2, copies x 0 ,x 1 ,...,x n/2Ϫ1 into I n/2 establishing the same invariant for the Process-Ascend(0,n/2) on line 3. Line 2 copy -input -left takes n/2 steps. Hence at the entry into Process-Ascend(0,n/2), v(I n 0 )ϭx n/2 ϭI n ͓n/2͔. Process-Ascend(0,n/2) takes time 2n(log nϪ1), which is an integral multiple of n. Note copy -upper(L n/2 ,L n ) is synchronized to the beginning of L n/2 by hypothesis 2. This copying at line 4 takes n steps. At the end of this copying, v(I n 0 ) still is x n/2 ͓2n(log nϪ1)ϩn steps since entry into line 3 Process-Ascend(0,n/2)͔. Hence copy -input -left at line 5 is entered with the correct invariant copying x n/2 ,...,x nϪ1 into I n/2 , establishing hypothesis 1 for the following Process-Ascend (n/2,n/2) call. Note copy -lower is synchronized by the argument used in Theorem 1 since the number of steps elapsed from the end of copy -upper ͑line 4͒ is 2n(log nϪ1)ϩn/2, which is an integral multiple of n/2. Since copy -lower takes n steps, compute -loop is also synchronized leaving v u (L n 0 )ϭx 0,log n , establishing hypothesis 2. ᮀ
Two-loop Version
An ascend-descend algorithm can be implemented with fewer loops ͑two͒ but then it takes longer time ͓O(n 1.5 )͔. There seems to be a trade-off between the number of loops deployed and the time taken by the algorithm. The early optical computers are likely to have a small number of loops and hence this trade-off provides a welcome opportunity.
The basic idea for this algorithm comes from the square implementation of a barrel shifter as described in Ullman ͑Ref. 33, page 69͒. It is shown in Fig. 7 . The n bits to be shifted are stored along a ͱn ϫ ͱn array. We number the rows bottom-up from 0 to ͱn Ϫ1 and columns left-right from 0 to ͱn Ϫ1. The least significant input bit x 0 is stored at ͑0,0͒ position, x ͱn Ϫ1 at ( ͱn Ϫ1,0), x ͱn at ͑0,1͒, and x nϪ1 at ͑ ͱn Ϫ1, ͱn Ϫ1͒. In the following, we use ͱn instead of ͱn to reduce the clutter of notation. The log n control bits c log nϪ1 c log nϪ2 ...c 1 c 0 form the word specifying the shift amount 0рsϽn. The least significant (log n)/2 control bits specify the vertical shift amount. This is because the bits c (log n/2)Ϫ1 ...c 1 c 0 encode a shift amount 0рs v Ͻͱn. Note that any bits shifted out of column j for 0р jϽͱnϪ1 move into appropriate entry of column ( j ϩ1). Similarly, the most-significant half control bits c log nϪ1 ...c (log n)/2 specify the horizontal shift amount. Note that each 1-bit shift horizontally corresponds to a shift by 2 log n/2 ϭͱn in the original word. This shifting takes time O(ͱn), in particular, at most 2ͱn.
For an ascend-descend algorithm, the processing at level j ͑as shown in Fig. 4͒ requires communication between words x i, jϪ1 and x iϮ2 jϪ1 , jϪ1 . This communication is achieved through a square-shifter emulating mechanism in the two-loop version. We use one loop of size n to store all the n words, entire array of the square shifter. This loop is called main loop. Another loop of size ͱn is used to process one column or row of the square shifter array at a time.
We refer to the ͱn size loop by context loop. The input data is initially stored in the main loop L n . Note that any row or column can be copied from the main loop L n to the context loop L ͱn within n steps.
We process the ͱn columns first followed by the ͱn rows. We copy each column and row from the main loop into the context loop in turn and process it. The processing of column i for 0рiϽͱn corresponds to processing lower log n/2 levels ͑levels 1-log n/2͒ for the columns i * ͱn-(iϩ1) * ͱnϪ1 of the butterfly network. Similarly, the computation of the j'th row in the square shifter ͑for 0р jϽͱn͒ processes upper log n/2 levels ͑levels log n/2 ϩ1-log n͒ for the columns j, jϩͱn, . . . , nϪͱnϩ j. Note that if n is chosen to be a power of 2, all the words needed by level i computation ͑for 1рiϽlog n/2͒ are contained in loop L ͱn . If n were not a power of 2, the context loop would need to be 2ͱn long to be able to contain entries for columns i and iϩ1 when column i is being processed. For instance, for nϭ9, the context loop has three entries. When level 1 ͑see Fig. 4͒ is computed, x 0 Ͼͱn. We compute kϭ2 log n /2 values for each iteration of the context loop for a square shifter column ͑except the last one͒. For simplicity of illustration, we assume in the following that n is a power of 2.
For the two-loop algorithm, we are assuming the synchronization of the loops is done through memory counters for the loops. Let us assume that the i'th butterfly column value x i,l for all levels l is at memory counter i for L n . We use the upper track on L ͱn to keep the level lϩ1 values computed from the level l values in the lower tracks. For the square shifter column i, resulting in butterfly columns iͱn, iͱnϩ1, . . . ,(iϩ1)ͱnϪ1 being copied into L ͱn , the memory counter values are 0, 1, . . . , ͱnϪ1, respectively. The primitives are copy -column(i,L n ,L ͱn ) ͕comment: copies square shifter column i from L n to L ͱn . Either track can be used on L n and the lower track is used on L ͱn .͖
end copy -column
The primitive copy -column can take up to nϩͱn Ϫ1 steps. Thus, L n may have to wait for up to nϪ1 steps for the L n memory counter to get to iͱn and ͱn steps are needed to copy after that. We also need a primitive to move the computed values back from the upper track of L ͱn to L n .
end move -column -to -main move -column -to -main also takes up to nϩͱnϪ1 steps. The following primitive copy - ͕comment: copies square shifter row i from L n to L ͱn . Either track can be used on L n and the lower track is used on L ͱn .͖ for ͑ jϭ0; jϽͱn; jϭ jϩ1͒
end copy -row
Note that copy -row can take up to 2nϪ1 steps, n Ϫ1 for the L n memory counter synchronization and n steps for the copying. Again, we need a reverse primitive to move the upper track of a row from L ͱn back to L n .
end move -row -to -main move -row -to -main can also take up to 2nϪ1 steps. The following primitive computes a level l value from two level lϪ1 values in L ͱn .
This version of compute -value takes up to ͱn steps collecting the two level lϪ1 values. Figure 8 shows the algorithm for the two-loop version. This algorithm takes time n 1.5 log nϩ6n 1.5 . The 6n 1.5 term is due to copying of context to and from the main loop. This is because for each level of butterfly, each row/ column takes time proportional to n. It can be easily converted into a O(n 1.5 ) algorithm as follows. In compute -value, waiting for one rotation to access x iϩk is wasteful. If we pipeline this computation, then in one rotation of the context loop L ͱn ,(ͱn/2k) rather than one value can be computed. For instance, for the l'th level with column 0, in the first rotation x 0,l , x 2 l ,l , x 2 lϩ1 ,l , . . . , x ͱn,l . Thus for column ͑row͒ computation, the total computation time per column ͑row͒ is ͚ iϭ1 log n/2 2 i ͱn, which is n. This gives a total time of . The term 2n 1.5 in the pipelined case compares favorably with the term n 1.5 log n. To implement the pipelined version, the algorithm in Fig. 8 would have to be modified at two places. Lines 4 and 5 would be replaced by for jϭ0 up to 2 lϪ1 do ͑4͒
compute -pipelined -value
end for
Similarly, lines 10 and 11 are replaced by
The new primitive compute -pipelined -value (x jϩ2k,lϪ1 , x jϩ2kϮk,lϪ1 ) ,... in one rotation of the loop L ͱn . It is is defined as follows: 
end for end compute -pipelined -value
Note that this algorithm also provides a O(n 2 ) one loop implementation. The processor contains only one loop of length n. The i'th level of butterfly takes time 2 i n for 1 рiрlog n.
A k Loop Version
What if we only had 2рkϩ1Ͻlog n loops available? A hybrid algorithm based on the preceding two algorithms works in time nkϩn The two-loop algorithm for each mesh takes time 8n 1.5 /2 1.5k . Hence total time taken in the two-loop phase for all the n/2 k meshes is (8n
). The top k levels of the butterfly are processed using the diagonal scan algorithm mentioned earlier. This phase takes 4nk steps. Thus the total time is bounded by 4nkϩ8n
Ϫk/2 steps.
Parallel Ascend-Descend Algorithm
All the algorithms presented so far have assumed that a single processor is used to solve the problem. In this section, we are interested in using the optical fiber loops for local storage for processors that are part of a parallel computer. In particular, we assume that we have a p-processor machine organized with butterfly connectivity. The problem occurs when an ascend-descend problem with n input words such that nϾp is solved on this butterfly network of p nodes. In such a case, multiple instances of data need to be stored at each processor. One loop of size n/p per processor seems to suffice to hold this data except for the bottom level processors, which require log nϪlog p loops.
Recall that a p-node butterfly has log pϩ1 rows and each row has p processors. As stated earlier, we number the i'th processor in the j'th row p i, j for 0рiϽp and 0р j рlog p. Note that we have a log pϩ1 level, p-node butterfly, for 1р pϽn. For the time being, let us assume that n is a multiple of p. Let p i, j Ј refer to the i, j'th processor in the n-node computation graph for 0рiϽn and 0р jрlog n. The following mapping from p i, j Ј to p k,l will be used in our algorithm. This mapping has been used in earlier work on parallel algorithms. Note p i, j
The bottom log nϪlog p levels of the n-word ascend-descend problem are solved sequentially at one of the level 0 processors. Levels l ϩlog nϪlog p of the ascend-descend problem are mapped into level l of the butterfly network.
Once again, let us consider the implementation of the ascend phase. Each processor in the bottom level of the p-node butterfly simulates an n/p-node butterfly sequentially, as shown in Sec. 3, Algorithm 1. Thus, each of these p processors contains log nϪlog p loops of sizes 1, 2, 3, . . . , n/p. The time taken for this processing is 4(n/p)log(n/p). The top log p levels of the n-node butterfly are pipelined in the p-node butterfly. The j'th level of the p-node butterfly processes the log (n/p)ϩj'th level of the n-node butterfly. Each processor p k,l contains n/p consecutive data values as dictated by the mapping from p i, j Ј to p k,l . The connectivity of the p-node butterfly is adequate to emulate the connections of the n-node butterfly. To see this, note that the span of the connections between level j and jϩ1 in an n-node butterfly is 2 j . The j'th level is mapped into the iϭ jϪlog (n/p)'th level of the p-node butterfly. The span of connections between the i'th and i ϩ1st levels is 2 i /(n/p). Thus the l'th element in p j,i ͓which would have been in p ( jn/p)ϩl,iϩlog(n/p) Ј ͔ has a crossconnection with the l'th element of p jϮ2 i ,iϩ1 ͓which would be p n/p͓ jϮ2 i ͔ϩl,iϩ1ϩlog(n/p) Ј .͔ This is the right crossconnection. The loops between two consecutive levels are also synchronized. This can be proven formally with an The total time taken by this algorithm is 4(n/p)log(n/p)ϩlog pϩn/p. Then the work of this algorithm is O(n log n), which is the work performed by an n-node butterfly network as well.
Matrix Computations
The matrix multiplication of two nϫn matrices can be accomplished in O(n 3 ) steps, while the transpose takes O(n 2 ) time. Most matrix computations are very similar to their systolic counterparts. 1 In a systolic algorithm, there are several active data streams interacting with each other. Each element of a data stream could be interacting with a distinct data stream in parallel. For instance, a commonly used systolic algorithm for matrix multiplication uses the data streams as shown in Fig. 10 . Two nϫn matrices A and B are multiplied in time O(n 3 ) by this algorithm. Processor p i, j accumulates the value for C͓i, j ͔ such that CϭAϫB. The i'th row of A is input from the left edge of the array ͑Fig. 10͒ such that (iϩ1)'st row is one time step behind i'th row. For instance, A͓0,0͔ arrives at p 0,0 at Tϭ0, but A͓1,0͔ arrives at p 1,0 at Tϭ1. Similarly, j'th column of B arrives at p 0,j at time Tϭ j. The rows of A and columns of B keep marching down horizontally and vertically respectively, while p i, j collects the product term A͓i,k͔ * B͓k, j ͔ one by one. This systolic algorithm could be emulated in loop-based processors by allocating a loop to each data stream ͑2n data streams in this case, one for each row of A and one for each column of B͒. However, the values for C͓i, j ͔ are static-sitting in one place. Where should the matrix C be mapped? In general, all the systolic algorithms consisting of only moving data streams can be mapped to a loop algorithm. Most of the time, a static data stream ͑such as matrix C͒ in this example, can also be mapped on a loop with proper synchronization. Several transformations for systolic algorithms exist to convert spatial data streams ͑such as C͒ into temporal ͑moving͒ data streams or vice versa. [35] [36] [37] [38] There is one problem that seems to occur in synchronization of data streams for several matrix operations. In the following, we present the problem along with a solution. We use matrix transposition as the example for the illustration of this problem. It appears as if for transposition, one loop of size n 2 and n loops of size n each would give an obvious n 2 time algorithm. The outline of the algorithm could be as follows. Initially, the nϫn matrix A is stored in the row-major order in the loop L n 2. Each n-size loop accumulates one row of the transposed matrix A T ͑which is a column of A͒ during the rotation of the main loop, L n 2. On the surface, this step seems to be trivial. Let us illustrate it with Fig. 11 . The matrix A is stored in row-major order on the loop L n 2 initially. Let us assume that L n 2͓ 0͔ϭA͓0,0͔, with the data recirculating right to left. There are n loops of sizes n each for accumulating the i'th row of A T ͑in L n,i in Fig. 11͒ . Let us evaluate the synchronization requirements of this problem. Assume at Tϭ0, L n 2 scans A͓0,0͔, L n 2͓ 0͔ϭA͓0,0͔, which can be copied to L n,0 ͓0͔ at Tϭ0. Similarly, at Tϭk for 0рkϽn, we can place A͓0,k͔ from L n 2͓ k͔ into L n,k ͓0͔. The problem occurs at the copy of the next row of A starting at time Tϭn. At Tϭn, L n 2 is scanning A͓1,0͔, which goes to L n,0 ͓1͔. However, at Tϭn, L n,0 is scanning L n,0 ͓0͔. The same problem occurs at T ϭnϩ j for 0р jϽn since L n, j is scanning L n, j ͓0͔, while we need to access L n, j ͓1͔. In short, L n 2 and L n, j are not synchronized. A solution for it may be to read from L n 2 at time Tϭ j and write it into L n, j ͓1͔ at time Tϭnϩ jϩ1. However, the problem gets worse for the third entry or at time Tϭ2nϩ j for 0р jϽn. We scan A͓2,j ͔ in L n 2, which goes to L n, j ͓2͔, but we scan L n, j ͓0͔ at this time. In general, at time Tϭi * nϩ j, we need to scan L n, j ͓i͔, but we scan L n, j ͓0͔. In other words, L n 2 runs ahead of L n, j by i words for Tϭi * nϩ j for 0рi, jϽn. One solution for this problem is to introduce nϪ1 delay loops of sizes 1, 2, 3, . . . ,nϪ1. The other solution is to have just one delay loop of size nϪ1 with nϪ1 taps into it. Each of these taps requires a directional coupler switch. However, a delay loop seems to be very useful for synchronization problems encountered in matrix computations. We assume that we have a single delay loop D nϪ1 with nϪ1 taps. The word at tap i is referred to by D nϪ1 ͓i͔ for 0рiϽnϪ1. The word D nϪ1 ͓0͔ is the word written at current time, D nϪ1 ͓1͔ is the word written one time unit earlier and so on. This loop acts as a shifting loop. Figure 12 shows this loop. Now the transposition can be done as follows:
end if end for end for
This version of transposition takes time n 2 to get all the rows of A T into n loops of size n. If these rows were to be copied back into an n 2 size loop, this would take another n 2 steps. An alternative is to use a butterfly network with 2 log n loops. As shown in Stone, 39 for a matrix stored in row-major order, the transpose corresponds to log n shuffle steps. This can be accomplished in log n ascend-descend steps. The sequential version of ascend-descend then can transpose a matrix in 4n log 2 n steps. Matrix multiplication can similarly be performed in n 3 steps. For the computation of CϭAϫB where nϫn matrices A and B are initially stored in row-major order on loops L n 2 ,A and L n 2 ,B , respectively, the following straightforward method can be used. Let us assume that C is to be stored on L n 2 ,C in row-major order as well. First transpose 
This matrix multiplication takes time O(n 3 ) with n, n-size and 3, n 2 -size loops. Most of the systolic matrix algorithms in Mead and Conway ͑Ref. 40, pages 271-285͒ map into loop algorithms very nicely.
Lower Bounds
This model offers a rich milieu for lower bounds. As we saw, a higher number of loops seems to lead to a lower time. Not just the number of loops, but also the period of data circulation in the loops is important. Ideally, we need a combinatorial technique that can decompose the data access pattern of a problem into some canonical periods. Unfortunately, at this time, we do not have such a technique. However, we do have some simple lower bounds that are tight for their subcases. Let us first consider the single-loop version.
Note that a one-loop processor is like a single-tape Turing machine, where the head is forced to move in the same direction at each time step. The tape is also wrapped around to form a simple loop. The crossing sequence arguments given in Hennie 41, 42 can be used to show that the ascenddescend algorithms need ⍀(n 2 ) time on a one-loop processor. The number of distinct input patterns is 2 n , assuming a binary input. Note that ⍀(n) crossing sequences can be shown to have length ⍀(n), leading to the lower bound.
The second technique quantifies the lower bound in terms of the size of the smallest loop. The assumptions are that there is only a constant amount of ''register'' storage available to the processor in addition to the loop memories. Note that all these loops have exactly one read/write port, ), which matches the time of the algorithm. Similarly, for the two-loop version, the smallest loop has size ͱn and hence the lower bound translates into ⍀(n 1.5 ). This is also a tight bound.
Theorem 2. Let an optical computing system consisting of kϾ1 loops have the smallest loop of size s. The computation of an n-word ascend-descend algorithm requires time ⍀(snϩn log n).
Proof. We assume that the data words at level l are stored in a memory loop in the butterfly column order x 0,l , x 1,l , . . . ,x nϪ1,l . For the lower bound purposes, any order that is a permutation of (0, 1, . . . ,nϪ1) will do, but the proof is more understandable with this order. Another claim is that the lower mϭ log s levels of the butterfly network are best computed in the smallest sized (s) loop. The following discussion will prove this point.
Level l computation for ascend phase consists of lϪ1 such values must be remembered for the next scan of the level l data and hence that is also the size of the backward communication.
If we compute the lower mϭlog s levels of butterfly ͑for 1рlрm͒ on the smallest loop L s , the forward communication requirement limits us as follows. We need loops of sizes 1, 2, 4, and 2 mϪ1 so that forward communication does not require additional scans of the level lϪ1 and level l data in L s . However, sϭ2 m is the smallest size loop available. What is the penalty of this limitation? Since only a constant amount of register storage is available on the processor, for each level l we can remember at most a constant number of words to carry forward. Let this constant be 1 without loss of generality. Hence, we can compute at most n/2 l level l values in each scan of L s . In fact, this is exactly what is done by the two-loop, pipelined computation described earlier. Hence we need at least ͚ lϭ1 log s 2 l scans of L s to compute s of the n lower log s level values. This gives at least 2s scans of L s for the computation of s values. Hence, the computation of n words for lower log s levels requires n/s * 2sϭ2n scans of L s . Since each scan of L s takes s time steps, the time needed for 2n scans is at least 2sn. Is this the lowest time for computing lower log s levels of the ascend phase? The answer is that any other loop ͑of larger size͒ would take at least 2sn steps as well. Let us consider computing these lower log s levels in loop L s Ј for sЈϾs. The number of scans for computing s values from the lower log s level would again be at least 2s. The number of iterations for computing all the n words would be n/s and hence the total time would be at least 2nsЈ, which is no lower than 2ns.
Note that we did not even consider the backward communication requirements to get this lower bound. We can derive the n log n lower bound based on the structure of the butterfly network. Since we are dealing with a sequential computation model, and there are n log n computation nodes in the butterfly graph for an ascend phase, the lower bound follows trivially.
Alternative Parallel Loop Model
In Sec. 4, we used the loop memory for holding the data blocks due to problem folding. The communication links between processors in a butterfly network are also implemented in fiber technology to keep up with the data rates. We propose another way of using the fiber loops so as to serve as both the data holders and the communication links. The disadvantage of this approach is that it requires many taps into a fiber. Consequently, at this point in time, this approach is not technically feasible. Figure 13 shows the scheme. Between the two levels shown the four left-hand processors need to communicate with the four processors right below them and with the four right-hand processors on the right. The two loops shown serve this purpose. Each processor drops a data value into either the vertical connection loop or the cross-connection loop. This value is picked up by the target processor after 2 i ticks for the loops connecting levels i and iϩ1.
Conclusions
The optical computing technology has matured to the point that prototypical computing machines can be built in the research laboratories. This also is the right moment for the algorithm designers to get involved in this enterprise. A careful interaction between the architects and the algorithm designers can lead to a better-thought-out design. This paper is an attempt in that direction. We have studied the repercussions of the use of memory loops on algorithm design. The use of delay line loops as memories is necessitated by the required data rates ͑upward of 100 M words/ s͒. We develop a computational model, the LMM, to capture the relevant characteristics of the memory loops.
It would seem that the restrictive discipline imposed on the data access patterns by a loop memory would degrade the performance of most algorithms, because the processor might have to idle waiting for data. We demonstrate that an important class of algorithms, ascend-descend algorithms, can be realized in the LMM without any loss of efficiency. In fact, the sequential realizations span a broad range for the number of loops required. A parallel implementation performing the optimal amount of work is also shown. Some matching lower bounds are illustrated, as well.
Optical computing is an emerging field. There are a myriad of open questions. We have only covered one class of algorithms in the LMM, the ascend-descend class. Many more problems and algorithms need to be investigated in this model. As we mentioned earlier, a large secondary memory that would be either semiconductor based or optical technology based would almost certainly be needed. In fact, the memory hierarchy is likely to have many more levels in this case due to the large mismatch in the speeds of the dynamic and secondary storage. A model similar to HMM ͑Ref. 13͒ to capture the hierarchical nature of memory would help.
The lower bounds in the LMM also pose many interesting problems. A new set of techniques seems to be required. The data access pattern of a problem has to be classified in terms of some basic frequencies. In summary, we believe that this field would serve as a fertile ground for future research.
