Abstract-This paper introduces a new concept by which it is possible to design and implement arithmetic processors using permutation networks. To demonstrate this concept, several optoelectronic arithmetic units combining optical directional coupler switches and cyclic permutation networks are designed. The designs show that addition, subtraction, and multiplication can all be performed in O(log n) time in residue code domain and using O(n 2 ) directional coupler switches and gates, where n = log M, and M is the integer range of interest. These arithmetic units also have the capability of concurrent error detection and fault-tolerance, and they can be used to construct constant time inner product processors.
INTRODUCTION
major disadvantage of conventional arithmetic algorithms is that their realizations require different hardware structures. For example, addition and multiplication cannot use the same hardware unless a repeatedaddition algorithm is used to perform multiplication operations. But this means a reduction in speed which, in general, is not tolerable. Another disadvantage of conventional arithmetic algorithms is that they cannot easily be parallelized due to carry propagation.
One way to alleviate these disadvantages is to transform arithmetic operations into a residue code domain and then perform the requisite computations in this domain. In the residue code domain, each binary number is represented as a set of residue digits and the arithmetic operations are performed on residue digits in parallel. This provides carryfree realizations of binary addition, multiplication, and other operations, but these realizations typically do not have any overlap, and require different hardware for each operation as in conventional algorithms [16] , [27] .
Another way to overcome the disadvantages of conventional arithmetic algorithms is by using optical computing [3] , [5] . The motivation behind optical computing is the observation that photons in optical fibers, optical integrated circuits, and free space travel faster than electrons do in electronic circuits since they do not have to charge a capacitor. Furthermore, photons are uncharged and do not interfere with one another as readily as electrons. This implies that optical signals can be easily handled in parallel [20] .
Many implementations of optical logic operations have been suggested in the literature. Among these, spatially invariant and spatially variant techniques are dominant. Spatially invariant techniques include symbolic substitution logic (SSL) which is implemented by additive methods [5] , [6] or convolution techniques [7] , [12] , shadow-casting logic [28] , programmable logic [21] , [22] , and the combinatorial logic-based system [10] .
The basic idea behind SSL is to first detect the presence of one or more patterns in an input plane and then substitute an appropriate pattern for each detected pattern. The major disadvantage of this approach is that, for n inputs, it requires W(2 n ) hardware [19] . In the programmable logic technique, a set of n-input variables are mapped to an input array along with their complements. A series of n + 1 interconnection stages consisting of one crossover network and one mask per stage is then used, along with n + 1 arrays of AND gates, to generate all the minterms of the desired function. Finally, the appropriate minterms are combined through a similar series of n + 1 interconnection stages, along with n + 1 arrays of OR gates, to generate the function. Thus, the total number of gates required by this technique is O(n2 n ). The basic idea of the spatially variant technique proposed in [19] is to build a set of the spatially variant elementary logic functions. These elementary functions are used to construct larger functions as in conventional logic designs. As an example, an n-bit carry look-ahead adder can be implemented by this approach with O(n 3 ) gates and O(log n) delay.
All these approaches, however, are based on full spatial optical principles, and they could not be easily realized by integrated optics [20] . Recently, a number of optical computing schemes that mix electronic and optical techniques have also been proposed [1] , [2] . These mainly use directional coupler switches as building blocks. In essence, a directional coupler switch is a five-terminal gate with two inputs, two outputs, and one control input. The two outputs are logical functions of two throughgoing input signals and the control input signal.
Benner et al. [1] , [2] have proposed a scheme that shows how to design oscillators and divide-by-N counters using optical directional coupler switches. Although such gates do not constitute truly all optical devices, the logical inputoutput signals are optical and the use of electronic signals in the control input provides some flexibility that is not yet available in well-developed optical switches. This also provides an opportunity for optical computing based on integrated optics [20] .
Recently, Lea [14] designed photonic interconnection networks using directional coupler switches. In these networks, electronic switches are replaced by directional coupler switches, and single-mode waveguides are used to establish the connections between directional coupler switches. Historically, interconnection networks were introduced to enhance the communication bandwidth in telephone systems and parallel processors. In this paper, we explore the possibility of using interconnection networks to perform arithmetic operations. The rationale behind this is that any algebraic computation can be viewed as a sequence of permutation operations. The key problem is how to pick up an appropriate mapping function that could map all of the desirable input patterns into corresponding permutations. Thus, the underlying network must have enough permutations so that it can cover all of the input patterns.
We refine these ideas, and combine optical directional coupler switching, residue arithmetic, and permutation networks to obtain a novel optoelectronic computational model, and to develop various optoelectronic arithmetic processor modules. Optical devices reduce the signal propagation delay and therefore increase the computation speed. Permutation networks provide a way of computation which replaces the propagation delays in conventional logic circuits by the transmission delay of light through waveguides, and the residue arithmetic keeps the cost of permutation networks to an acceptable level.
The remainder of the paper is organized as follows: Section 2 summarizes the mathematical preliminaries used in this paper. Section 3 shows how to design addition, subtraction, and multiplication modules, and Section 4 shows how to design an input encoder and output decoder using directional coupler switches and cyclic permutation networks. Section 5 analyzes the performance of various arithmetic modules designed in the previous sections. The paper is concluded in Section 6.
PRELIMINARIES
We begin with a review of some basic mathematical facts [8] , [23] .
Finite Groups
Let a, b, and n be integers with n being positive. Then the expression The set [a] n is called the congruence class modulo n of a. Writing b OE [a] n is the same as writing b ∫ a(mod n). The set of all such congruence classes is denoted as:
or more simply as:
For a given modulus n, define the operations + n and ¥ n as
Then (Z n , + n ) forms an additive group of order n. Similarly, the set Z a Z a n n n n
, that is, the set of all elements relatively prime to n forms a multiplicative group G e is a one-to-one and onto (i.e., bijection) mapping from G to G that preserves the group operation. That is,
If there is an isomorphism from G to G , we say that G and G are isomorphic and denote it by G G . .
THEOREM 1. Every finite cyclic group with order n is isomorphic to
, where p is prime.
Now we define permutations and permutation groups. A permutation of a set A is a function from A to A, that is both one-to-one and onto. A permutation group of a set A is a set of permutations of A that forms a group under function composition. The degree of a permutation group on a set A is the cardinality of A.
A convenient way to denote a permutation is by using an array notation. Let A = {0, 1, º, n − 1} be a set of n elements and a be a permutation on set A. Then we write a a a a a = --
The set of all permutations of n elements is called the symmetric group of degree n and is denoted by S n .
Composition of permutations expressed in array notation is carried out from left to right by going from top to 
The cycle notation (0 4 2 3)(1) expresses the same permutation in a more compact way.
THEOREM 3 (cyclic permutation group).
A set of permutations generated by permutation function
forms a cyclic group of order n under function composition.
Finite groups are related to permutation groups by the following well-known Cayley Theorem.
THEOREM 4 (Cayley Theorem [8] 
As an example, the regular representation of (Z n , + n ) is given by ( ,
and T i is given by
The next corollary immediately follows. COROLLARY 1. Every cyclic group is isomorphic to a cyclic permutation group.
Realizations of Permutation Groups
In the previous section, we have established the relationship between a finite group and a permutation group through the well-known Cayley Theorem. In this section, we will show how to realize a permutation group on a permutation network. A permutation network is an interconnection network with an input port and an output port, where each port has s lines, labeled from top to bottom as 0, 1, º, s − 1. An example is shown in Fig. 1 . To represent the permutation map 
input 0 is connected to output 3, input 1 to output 4, and so on. Fig. 1 shows the composition of two permutations. The time taken to add two numbers is basically the time taken to compose two permutations, which depends on the kind of permutation networks used. Although we show a direct connection between the inputs and outputs, this is in general not necessary. All that is required is that there be a path from each input to a distinct output as specified by the permutation map. One way to realize a permutation network such as the one shown in Fig. 1 is to use a rearrangeable interconnection network [4] , [29] . However, for n inputs, these networks use O(n log n) switches, and they are not an attractive choice to realize cyclic groups of order n as the latter only requires n permutations. For cyclic groups such as those we will consider in this paper, a cyclic permutation network, performing n permutations, is sufficient as implied by Cayley theorem. However, even with n permutations, the realization can become prohibitively difficult as n gets large if we use the Cayley isomorphism. The reason is that the Cayley isomorphism of a cyclic group will exact n permutations, where n is the order of the group. This problem can be overcome by decomposing the group into a set of subgroups whose orders are relatively prime. We then apply the Cayley regular representation to each subgroup.
is the set of all r-tuples for which the ith component is an element of G i and the operation is componentwise, that is,
In general, G 1 , G 2 , º, G r are groups with the same operation but different orders.
An example is shown in Fig. 2 . (Z 30 , + 30 ) is decomposed into three smaller groups, (Z 2 , + 2 ), (Z 3 , + 3 ), and (Z 5 , + 5 ). These groups are then mapped into permutation groups by using Cayley Theorem, respectively. In terms of permutations, three disjoint cycles with order 2, 3, and 5 are used to represent the groups (Z 2 , + 2 ), (Z 3 , + 3 ), and (Z 5 , + 5 ), respectively. Now, we extend the above direct product group into a direct product of commutative rings as follows. , , ,
The zero is (0, 0, º, 0) and the identity element is (1, 1, º, 1). It is readily seen that R 1 ¥ R 2 ¥ º ¥ R r is also a commutative ring with identity element.
Input Encoding and Output Decoding
The input encoding procedure is based on the following theorem [25] . 
The isomorphism is defined as
The r-tuple (x 1 , x 2 , º, x r ) is called the residue code of X. The relationship between X and its residue codes is bridged by the Chinese Remainder Theorem [25] . 
ARITHMETIC UNIT DESIGN
This section introduces various optoelectronic arithmetic circuits. All of these optoelectronic arithmetic circuits consist of three basic components: cyclic permutation networks, directional coupler switches, and Y-junctions.
Basic Components

Cyclic Permutation Networks
In this section, we will show how to realize a cyclical permutation network which will be used to realize cyclic groups. A cyclical (right shift) permutation network with degree m, denoted by CRPN(m) has m inputs and m outputs and is a cascade of Èlog m˘ switches, numbered form left to right Èlog m˘ − 1, º, 2, 1, 0 in that order, and each having m inputs and m outputs such that the switch in stage k, for all k = 0, 1, º, Èlog m˘ − 1 has two switching states: state 0: input i is passed directly to output i, for all i = 0, 1, º, m − 1; state 1: input i is connected to output j such that
for all i = 0, 1, º, m − 1.
It is easy to verify that a right circular shift of the sequence of elements 012 º m − 1 r times, 0 £ r £ m − 1, can be realized on this network by expressing r in binary as a 0 2 0 + a 1 2 1 + º + a Èlog m˘−1 2 Èlog m˘−1 and setting all those switches for which a i = 0 to state 0 and those for which a i = 1 to state 1.
Directional Coupler Switch
The directional coupler switch can be used like a 2-by-2 switch with two states: the straight state and cross state [13] , [24] . In the straight state, the signal of the upper input goes to the upper output while the signal of lower input goes to the lower output. In the cross state, the signal of the upper input goes to the lower output while the signal of lower input goes to the upper output, as shown in Fig. 3a .
The logic model for the directional coupler switch is shown in Fig. 3b . By properly controlling their states, directional coupler switches can be used to implement permutations on permutation networks. Two factors affect the fabric size of integration when using directional coupler switches. These are the attenuation of signals passing through the device and the cross talk inside the device [11] . However, these two factors are minimal when the underlying network topology has a logarithmic depth and if we can ensure that no two inputs on a directional coupler switch are active at the same time [11] .
Here, only a subset of functions of a directional coupler switch is required to design arithmetic circuits. More precisely, we will set b = 0 for all times while using the directional coupler switches. Since all cyclic permutation networks used to design arithmetic circuits have a logarithmic depth and one of the inputs of each directional coupler switch is set to 0, that is, no two inputs of the same directional coupler switch in the cyclic permutation networks are active simultaneously, the signal attenuation and cross talk are minimal.
The other limits on the fabric size of integration are the large length of directional coupler switches in relation to their width and the large minimum bending radius of the diffused waveguides. All these constraints add up to a maximum integration array size of 32 by 32 [11] . However, from Table 1 we see that the product range of the first 10 primes is already about 2 32 and the largest prime is only 29. Therefore, the arithmetic circuits proposed in this paper are feasible within the domain of current integrated optics.
Another component needed to realize a cyclic permutation network in optics is a Y-junction, which is a special kind of optical coupler, and sometimes called a combiner [1] . This device joins two signals into one that can then propagate to the next stage. Since the technology of integrated optics is still in its developing stages, the signal losses of directional coupler switches and Y-junctions are high but they are likely to be reduced to some extent in the near future. For practical purposes, if it is necessary, optical regenerators may be used to recover the signals [1] .
Input Encoder and Output Decoder
Occasionally the operands must be converted to residue representation and the results in residue representation must be converted to binary. We use encoder and decoder circuits to convert between binary and 1-out-of-m position codes. More specifically, the 1-out-of-m encoder in the input stage codes its Èlog m˘-bit binary input into a 1-out-of-m position code, and the 1-out-of-m decoder in the output stage decodes its 1-out-of-m position code into its Èlog m˘-bit binary equivalent. Positions are numbered 0, 1, to m − 1, and position i is identified with binary input i.
Since the encoder and decoder are electronic circuits, it is necessary to do the conversions between optical and electronic signals in these circuits and vice versa. We will see that optical-to-electronic and electronic-to-optical circuits are only needed in the encoder and decoder stages.
Modulo m Addition
As implied by Cayley theorem, the set Z m = {0, 1, º, m − 1} under addition modulo m, is isomorphic to a cyclic permutation group of order m. Thus, it is sufficient to consider the implementation of a cyclic permutation group on the CRPN(m). Here, the implementation of a cyclic group refers to composing any two elements of the group to obtain another element in the group. This is to be done by entering the elements into the network, and the network will produce as output the composition of the elements. We first note that a cyclic group (G, •) of order m is generated by a generator g and its elements are • g y is then computed by first cyclically right shifting the inputs x times and then copying the outputs back onto the inputs, and finally cyclically right shifting the inputs y times. This design reduces the hardware cost, but like the first design it requires a total of x + y cyclical right shifts. The number of shifts can be reduced to y by noting that one of the elements, say g x , can be entered into the network directly by coding it in terms of binary m-tuples in which exactly one entry is 1. The position of that entry is specified by x.
The central component of this implementation is the CRPN(m), which is constructed as described above. Given two elements g x and g y , the network receives exponent x to activate its xth input. The network then cyclically shifts the "1" down y positions and outputs it at output x + m y. For (Z n , + n ), the operation is addition so we use multiples instead of powers. We set its generator to "1" so that the operation g x • g y can be performed simply by using x + m y. Fig. 4 shows an example of modulo 5 addition. To illustrate how the adder operates, suppose that the operation 1 + 5 3 is required. Then input x (in this case it is 1), activates the line 1. The other input y (in this case it is 3), is applied to the control inputs of the directional coupler switches through inverters and determines the switching state of the network. The final result, which is shown in darker line is 1 + 5 3 = 4. We note that all inputs of stage i is shifted right (down) by 2 i mod 5, 0 £ i £ 2. . We must note that for a fixed m, the shift-m complementer has a fixed pattern of connections that do not change with the value of y. We must also note that addition and subtraction can be combined together by adding the 0 state to the shift-m complementer, i.e., by connecting its ith input to its ith output, 0 £ i £ m − 1.
Modulo m Multiplication
The set {1, 2, º, m − 1} under multiplication modulo m forms a cyclic group of order m − 1, when m is a prime. A generator of this group is an element h such that h m-1 ∫ 1 (mod m). In general, there is no straightforward method to determine the generators of a multiplicative group. However, for small primes, one can always find the generators by trial and error. Let Z m * denote the multiplicative group mod m with a generator h. To realize Z m * on the CRPN(m), 1 we set up an 1 . In reality, the network in this case has m − 1 inputs, but for notational convenience, we will refer to it as an CRPN(m). decoder and 1-out-of-m encoder are used to convert y into a power of the generator. In all these cases, the connections of the mip and mop boxes are fixed for a given m and a generator. Hence, they do not incur any additional cost. Fig. 7 illustrates the multiplication process on the CRPN(m) for x = 4, y = 3 and when the generator is 2.
INPUT ENCODING AND OUTPUT DECODING
Now that we have completed our discussion of how to implement various algebraic operations on a CRPN(m), we will show how to convert between residue codes and binary number systems on such a network. The problem of converting a binary number into its residue codes is called the input translation or input encoding problem. Likewise, the problem of converting a residue codes into its binary form is called the output translation or output decoding problem [15] . 
Input Encoding
The first stage in the ith network computes
) mod m i , and inductively the last stage computes the entire expression.
Output Decoding
The output decoding circuit is based on the idea of base extension and scaling by 2 t [26] . First, we carry out the residue decoder using an extension of Garner's algorithm [9] . Let (x 1 , x 2 , º, x r ) be the residue codes of X with moduli m 1 , m 2 , º, m r , respectively. Garner's algorithm receives x 1 , x 2 , º, x r as input and produces a single output. The algorithm given below is a modified version of Garner's algorithm and computes t bits of the binary output at a time, where t is some positive number between 1 and log M, and M = m 1 m 2 º m r . This step is carried out off line and is not part of the residue decoder.
Step 2:
Step 3: Base extension to mod 2 t . Step 5: Repeat Steps 3 and 4 for log M t -1 times.
Compute
End
The correctness of the algorithm can be found in [9] , [26] . Fig. 9 depicts a network implementation of the residue decoder for r = 3, m 1 = 3, m 2 = 5, m 3 = 7, and t = 3. The number X (take (0, 2, 3) = 87 as an example) enters the circuit on the left in residue codes, and exits it in binary on the right in three iterations; the first iteration computes the least significant three bits (= 111), the second iteration computes the next three significant bits (= 010), and the last iteration computes the most significant bits (= 001).
Basically, a network arithmetic unit is a CRPN(m) modulo m subtracter cascaded with a constant modulo m multiplier. Since one input of the modulo m multiplier is a constant, it can be implemented by a fixed set of connections as in the shift-m complementer case (Fig. 6) . A network arithmetic unit that is set to compute 2(y 2 − 5 r 1 ) is shown in Fig. 10 .
PERFORMANCE ANALYSIS
Now we consider the hardware cost and delay for the arithmetic circuits designed above. Let e j e j , where n = log M.
Thus,
The delay through the network is given by the delay through the largest subnetwork, i.e., subnetwork CRPN(m r ). e j e j , we have
The delay through the multiplier is 
Recalling that n = Èlog M˘, and that m 1 + m 2 + º + m r = O(log 2 M/log log M) from [16] , the cost of this network is
REMARK 1. We note that this input encoder uses 1-out-of-m i codes. This allows it to detect all unidirectional errors concurrently [17] . If we relax this property, then a binary tree algorithm can be used to reduce the cost and delay complexities of the input encoder to O(n 2 ) and O(log n), respectively [16] . e j e j , the last summand is O(tn 2 /log n + n 2 ) as established in [16] . Summing these three terms together we find C OUTPUT (M) = O(n 3 /log n + tn2 t /log n + tn 2 /log n + n 2 ). (26) Now as for the delay through the network, we note that inputs are circulated through the network 
. Now, if we let t = log log M = log n, then we have
D OUTPUT (M) = O(n 2 /log n).
REMARK 2. These cost and delay complexities can be reduced to O(n 2 log n) and O(log 2 n) if we use an algorithm based on Chinese Remainder Theorem [16] . However, an outstanding feature of the output decoder presented here is that, like the input encoder, it also has the capability of detecting all unidirectional errors concurrently [17] . Therefore, all modules described in this paper can be combined into a system that has the capability of detecting all unidirectional errors concurrently.
In summary, modulo addition, subtraction, and multiplication exact O(n 2 ) hardware and O(log n) delay in the residue code domain, O(n 3 /log n) hardware and O(n 2 /log n) delay in the binary domain. The increase of complexity in the binary domain is due to the required conversions between binary and residue codes. Table 2 compares these complexities with the complexities of previously published designs.
CONCLUSION
In this paper, cyclic permutation networks are defined and used to construct various arithmetic circuits. Unlike conventional arithmetic units, these arithmetic circuits are based on coding numbers into permutation maps, and then carrying out arithmetic operations by composing permutations. The resulting permutations are then converted back into sums and products. It has been established that in order of complexity terms, optical implementations of these arithmetic circuits are more efficient and faster than the previously reported arithmetic architectures. Another added advantage of these new arithmetic circuits is that they are inherently capable of concurrent error detection [17] . Moreover, they can be used to construct inner product processors with O(1) computation time [18] .
One potential drawback of these new arithmetic circuits is their time overhead for converting between binary and residue codes. This may not be as critical in signal processing applications as it is in general purpose computations. Nonetheless, in the case of general purpose computations, much of the potential performance degradation due to conversion time overhead can be alleviated by pipelining the conversion steps over input encoding and output decoding circuits. 
