Abstract. This paper investigates the hardware implementation of arithmetical operations (multiplication and inversion) in symmetric and alternating groups, as well as in binary permutation groups (permutation groups of order 2 r ). Various fast and space-efficient hardware architectures will be presented. High speed is achieved by employing switching networks, which effect multiplication in one clock cycle (full parallelism). Space-efficiency is achieved by choosing, on one hand, proper network architectures and, on the other hand, the proper representation of the group elements. We introduce a non-redundant representation of the elements of binary groups, the so-called compact representation, which allows lowcost realization of arithmetic for binary groups of large degrees such as 128 or even 256. We present highly optimized multiplier architectures operating directly on the compact form of permutations. Finally, we give complexity and performance estimations for the presented architectures.
Introduction
Several cryptosystems, such as rsa, elliptic curve systems, idea or safer, utilize operations in algebraic domains like polynomial rings or Galois-fields. Efficient implementations of the basic arithmetical operations in those domains have been extensively studied but not much attention has been spent to simpler constructs like permutation groups. Our research on permutation group arithmetic has been motivated by the implementation of a secret-key cryptosystem called pgm (Permutation Group Mapping) [11, 12] , which utilizes some generator sets, called group bases, for encryption.
Briefly, a basis for a permutation group G is an ordered collection β = (B 0 
(c)).
To accomodate the cryptosystem to some binary cleartext and ciphertext space M = C = Z Z 2 k , an additional, fixed mapping λ : M → X has to be effected prior to and λ −1 after the actual encryption. It is very natural to represent permutations in a computer in the so-called Cartesian form. Section 2 introduces the basic principle of multiplying two Cartesian permutations in a switching network. Section 3 presents different, mostly novel multiplier architectures operating in the symmetric group. Unfortunately, a symmetric carrier group S n has a serious drawback. It is namely that any basis for S n (n > 2) has several blocks with length r i = 2 ki . It follows that λ, which can be seen as a conversion from a binary to a mixed radix r = (r 0 , r 1 , . . . , r w−1 ), is computationally rather intensive [14] .
As oppesed to a symmetric group, any basis for a binary group (a permutation group of order 2 r ) has block lengths r i = 2 ki , and thus the mapping λ for such a carrier group is trivial. Since a binary group of degree n is only a small subgroup of the symmetric group S n , the use of some large degree (n = 128 . . . 256) is indicated, which makes the use of use of the multipliers, proper proper for symmetric groups, infeasible. The problem lies in the fact that the Cartesian representation of binary group elements contains a large amount of redundancy. In Sec. 4 we introduce a novel, non-redundant representation, the so-called compact representation, and present various multiplier architectures operating directly on the compact form of permutations. Finally, Sec. 5 gives complexity and performance estimations for the presented architectures. It turns out that a pgm system with a binary carrier group is indeed much more efficient than one based on a symmetric group of similar order.
Multiplication in Permutation Networks
To briefly recall, a permutation p of degree n is a bijection p : L → L, where L is a set of n arbitrary symbols or points. In the arithmetic we propose, elements of the symmetric group S n are represented in the so-called Cartesian form. In this representation, L = {0, 1, . . . , n − 1} and a permutation p ∈ S n is a vector p = (p(0), p(1), ..., p(n − 1)) of the n function values. Suppose now, elements of vector p are physically stored in their natural order in a block P of registers, i.e. 
Using the Cartesian representation, the product q can be computed by means of n memory transfer operations:
for 0 ≤ i ≤ n − 1, where A, B and Q are the memory blocks storing a, b and q, respectively. For simplicity of the notation, we are not going to distinguish register blocks from their content, but simply write Q = A · B to denote the product.
By definition, the inverse of a permutation a is permutation q = a −1 , for which a · a −1 = a −1 · a = ι, where ι denotes the identity permutation (i.e. ι(i) = i). The inverse a −1 can be obtained in block Q by applying n memory transfers Q[A [i] ] := i. We denote the inverse simply as Q = A −1 . The memory transfers can be carried out either sequentially or in parallel. The former is the typical software implementation. The parallel implementation exploits the fact that the n memory transfers are completely independent and can thus be carried out simultaneously in switching networks, as follows. For multiplication, the A [i] th register of source block B is connected to the i th register of the destination block Q, i.e. A is interpreted during routing as a vector of source addresses. After setting up the network, the content of B is copied to Q via the established connections, forming the product A·B in Q. Fig. 1a illustrates this principle on a small example. When interpreting A as a vector of destination addresses, the reverse connections are established. By copying then the content of B to Q, the product A −1 ·B is obtained in Q. By substituting B = ι, the network delivers the inverse of A, as shown in Fig. 1b. 
Arithmetic in Symmetric Groups
According to the computing principles introduced above, a multiplier network for S n should be an n-input, n-output network (briefly (n, n) network). For the sake of full parallelism, the network must be able to connect the n inputs to the n outputs simultaneously, that is without collision (or blocking) at any of the links. Since the network has to be completely re-routed after each multiplication, rearrangeable networks are favourable compared to the more complex non-blocking networks [3] . Moreover, since the routing operand may come from the entire symmetric group S n , the network must be able to realize all possible n-to-n connections. Exactly these characteristics describe a specific class of switching networks, called permutation networks.
Crossbar Networks
The crossbar network is the most fundamental single stage permutation network [3] . As depicted in Fig. 2 , it consists of n × n switches in a matrix form, which pass the signals from input port C to output port Q. In the following, we propose three different schemes for fast routing of the network.
In the first routing scheme the pure crossbar network is equipped with a routing port A and with corresponding horizontal lines, each controlling an nto-1 multiplexer (mux), as shown in Fig. 2a . According to the control signal, each mux selects one of the n signals of port C, and forwards it to port Q. Note in this mechanism that the data items entered at port A are used as source addresses. Hence, when entering Cartesian permutations at A and C, the network computes the product Q = A · C. The second routing scheme in Fig. 2b adds routing port B and corresponding vertical lines to the pure crossbar network. Each of these lines controls a 1-to-n demultiplexer (dmux) which transmits the input signal through one of the n output lines towards Q, while disconnecting from all other output lines. Note that data items of A are interpreted in this mechanism as destination addresses. Accordingly, when entering Cartesian permutations at B and C, the network computes the product Q = B −1 · C. A combination of the above two routing schemes yields the third one of Fig. 3 . Both routing ports A and B are included here, and are connected to horizontal and, respectively, vertical addressing lines. In addition, each switching cell is equipped with an equivalence comparator, which compares the addresses received from the neighboring addressing lines. If the addresses are equal, the comparator closes the switch, otherwise opens it. By entering Cartesian permutations at ports A, B and C respectively, the result obtained at port Q is the product Q = A · B −1 · C. A more detailed description of this architecture and of a bit-parallel realization has been published in [13] . 
Sorting Networks
The Beneš-network [3, 5] is known to be the most efficient rearrangeable multistage network topology, based on elementary (2,2) switching cells. However, its routing algorithm, the so-called looping algorithm [5] , is intrinsically sequential and can only be effected in a centralized control unit. Accordingly, the mechanism is rather slow and thus not suitable for a multiplier network. On the other hand, there exists a large class of multistage networks, the so-called digit-controlled (or delta) networks, which possess a very convenient routing algorithm, the socalled destination-tag routing (or self-routing) [5, 7, 8, 10] . This distributed control mechanism is very fast, since the individual cells decide independently and simultaneously. Unfortunately, delta networks are blocking ones. A sorting network is an (n, n) multistage network effecting some deterministic sorting algorithm [2, 4, 6, 9] . The network is built from elementary (2,2) compareexchange modules. Each module compares the two incoming numbers and routes them according to their magnitudes. No matter in which order input numbers are entered at the input, the network applies the proper permutation to them and delivers the sorted sequence. Hence, any sorting network can be regarded as a rearrangeable permutation network.
Though all known sorting networks are more complex than the Beneš network, they offer a way for destination-tag routing, as follows. Figure 4 illustrates the method on a small example. In Fig. 5 we introduce two classical sorting networks. The odd-even transposition sorter is the parallel implementation of the insertion sort and, at the same time, of the selection sort algorithms. The n input numbers are sorted in n stages, comprising n(n − 1)/2 modules. Accordingly we say that the network has depth n and complexity n(n − 1)/2. The arrows in the symbols of the compareexchange modules indicate the direction which the larger numbers are forwarded to. Note that this network has a completely "straight" wiring topology, which is advantageous in view of wiring area. The bitonic sorter as well as Batcher's odd-even sorter [9] are known to be the most efficient regular topologies, having O(log 2 n) stages in a recursive structure. Note that many lines cross between certain stages, which is in direct correspondence with the wiring area. In a straightforward realization of the compare-exchange modules, comparison is carried out first, and the result is then used to set the switches. In this method, no data can be transferred until the comparison is completed. Considerable acceleration can be achieved by recognizing that comparison can be performed sequentially, scanning from the msbs towards the lsbs of the input numbers X and Y , according to the following algorithm:
1. As long as X i = Y i while scanning bits in decreasing order of i, it does not matter how the switch is set, and thus X i and Y i can be passed to the next stage; 2. as soon as difference is noticed at bit j, i.e. X j = Y j , all switches for bits i ≤ j can be set to the same state, which is determined by the relation of X j and Y j .
In 
Separation Networks
Sorting networks are able to sort arbitrary number sequences. Note however that Cartesian permutations are special sequences, such that each number of the range 0 . . . n− 1 occurs exactly once. This kind of sequence we call a permutation sequence. In the following, we introduce a class of novel permutation network architectures, which exploit this property to reduce hardware complexity. The new networks employ the radix sorting algorithm [1] for routing: destination addresses are represented as binary strings, starting with the msb as first letter. Sorting proceeds as follows: first the strings starting with a '1' as first letter are separated from those starting with a '0'. As second step, both of the resulting subsequences are further be split up so that strings having '1' as second letter get separated from those having '0' at the same position. The 'divide-and-conquer' principle is followed in this way till the last step, where strings with trailing '1' are separated from those with trailing '0'. Since a permutation sequence contains a predetermined set of strings, the number of strings with '1' and respectively '0' at any particular position is constant, irrespective of the actual sequence. Due to this fact, the length of the separated subsequences is known and constant for all separation steps. In the specific case of n = 2 m , all separated subsequences are balanced, i.e. contain exactly as many 1's as 0's at any particular position. This property is the basis for the design of separator networks. Each separation step is effected in a dedicated separator stage. The first separator stage splits the input sequence in two halves of length n/2 (without actually achieving perfect ordering), the next stage produces subsequences of length n/4, and so on. Networks of degree n = 2 m can be constructed by omitting parts of a network of degree 2 m , where n < 2 m . The strength of the technique lies in the fact that any particular stage can achieve the separation by looking at corresponding single bits of the destination tags. The method can thus be considered as the generalization of the bitcontrolled self-routing algorithm for permutation networks. Interestingly, comparing corresponding bits X and Y of two destination tags and routing them towards the proper output H ("higher" value) and respectively L ("lower" value) requires no logic at all. To see this, consider the truth- By choosing the switch state for don't care's as shown above, the switches can be controlled directly by input bit X, whereas H and L can be formed by a single or-and respectively and-gate. See [14] for implementation details.
In the following, we present a couple of novel separator network architectures. The first scheme is related to the bitonic sorter (Fig. 5b) . The two bitonic sorters of length n/2 and a half-cleaner stage of this network form a so-called selection network [9] , which separates the n/2 largest from the n/2 smallest elements. If entering a sequence of n/2 1's and n/2 0's, the selection network separates the 1's from the 0's. Clearly, this is also achieved when the magnitude comparator modules are replaced with "binary" comparator modules. Such bitonic separator stages can be used to build a bitonic separator network, as illustrated in Fig. 6 for n = 8. The network has depth of order O(log 3 n). (8) separator (4) sep (2) 
Fig. 6. A separator network based on bitonic separators
Note that the (n/2,n/2)-sorters in front of the half-cleaner can actually be replaced by any kind of sorting network, for instance, by odd-even transposition sorters. We call the network obtained in this way the linear odd-even separator network. As the name suggests, it has depth of order O(n).
Separator stages can rely on other principles, too. The sorter depicted in Fig. 7 employs a novel separator type of depth O(n), which we call a diamond separator. The underlying sorting principle is similar to that of the odd-even transposition sorter. The advantage of this architecture is the completely straight wiring pattern. Its drawbacks are that the network is rather deep and hard to lay-out in a rectangular form.
The rotation separator offers lower depth, rectangular layout and still a "nearly" straight wiring topology. The separation principle can be followed in Fig. 8 . Links running across the network are considered to be of two types: 0-lines, which are expected to deliver 0's at the output, and 1-lines, that should deliver 1's. A '1' on a 0-line (and a '0' on a 1-line, respectively) is considered as a 1-error (a 0-error, respectively) . Due to the balance in a permutation sequence of length n = 2 m , 1-errors are present in the same number as 0-errors at any particular stage. Each compare-exchange module receives input from a 0-line and a 1-line, and outputs to a 0-line and a 1-line. When a 1-error and a 0-error are received, they "neutralize" each other, i.e. both errors disappear. (4) separator (8) separator (2) Fig . 7 . The "diamond" separator network with 8 inputs
The topology of the network implements the following strategy for eliminating all errors in the input sequence: 0-lines (carrying potentially 1-errors) are iteratively "rotated around" and combined pairwise with 1-lines (carrying potentially 0-errors). In order for all 0-lines to be combined with all 1-lines, n/2 rotation steps are needed, and hence the separator stage has depth n/2. The total depth of the entire network is n − 1. 
Arithmetic in Binary Groups
As mentioned, any binary group of degree n is a subgroup of S n . Unfortunately, even a so-called Sylow-2 subgroup H s of S n , which is of maximal order, is rather small; it has order |H s | = 2 n−1 if n = 2 s . Hence, if a certain group size is required, the usage of a binary group of some large degree is indicated. For instance, if a group order of at least 2 127 is required, not unusual in cryptographic applications, either a symmetric group of degree n = 34 or a Sylow-2 subgroup of degree n = 128 may be chosen. Unfortunately, the storage of Cartesian permutations (7 * 127=896 bits in the above example) as well as the multipliers based on permutation networks are very extensive for binary groups of such large degrees. Note however that the Cartesian form is very redundant for representing binary group elements, and that the multiplier networks would be used rather inefficiently too, because most of the possible permutation patterns, namely thus in S n but not in H s , would never be configured.
A study of the indirect binary cube (ibc) network for n = 2 s has shown that though the set of permutations realized by the network is not a group, it embeds a Sylow-2 subgroup H s of S n . Similar results can be obtained for the "inverse" of the ibc network, the so-called generalized cube (also called butterfly or SWBanyan) network, as well as for other (n, n) delta networks, such as the omega, the baseline, the modified data manipulator mdm and their respective "inverses", the reverse omega (also called flip), the reverse baseline and the inverse mdm networks [5, 10] . The different delta networks realize various instances of H s , while it is known that all Sylow-2 subgroups H s of S n are isomorphic.
A delta network of degree n = 2 s comprises s * n/2 switches in s stages, and is thus considerably more efficient for H s then any of the permutation networks. The construction of a multiplier we illustrate on the ibc network of degree n = 8, depicted on the left of Fig. 9 . The network contains 12 binary switches and since it is a banyan network (i.e. there is one unique path from each input to each output), all of the 2 12 different configurations realize different permutations. This permutation set of size 2
12 is not a group, but it contains H 3 , which is of order 2 7 . Clearly, some configurations will never be used if working in H 3 . As illustrated in the figure, the ibc network can be configured by the bitcontrolled self-routing algorithm, where switches in the first stage are controlled by the lsbs, while subsequent stages by succeeding bits of the destination tags. If first routing with a Cartesian permutation a ∈ H 3 and then transferring another permutation b ∈ H 3 , the network delivers, according to the general multiplication principle of Sec. 2, the product q = a −1 · b. It turns out that if working in H 3 , switches of certain switch groups are always set to a common state, and can thus be unified in one switching module, as depicted on the right side of the figure. The unified switches can be controlled by one common signal, which sets either a "straight" or a "swapping" connection pattern. We call an ibc network with unified switches an uibc network. The 2 n−1 =2
7 different connections patterns of the uibc network realize exactly the elements of (a specific instance of) H 3 .
The control bits can be extracted from the Cartesian permutation a by simply selecting certain bits of a. Actually, the n−1=7 control bits can be seen as a special representation of the group elements of H 3 , which we call the compact representation. From the fact |H s | = 2 n−1 it is seen that the compact representation is non-redundant and hence optimal. Expanding the compact form to the Cartesian form is similarly simple, it can achieved by reproducing (copying) certain bits of the compact permutation. Note that the ease of the conversions is not a general feature but specific to the instance of H s induced by the use of the uibc network.
A great advantage of the compact representation is that it allows spaceefficient storage of elements of H s . Note furthermore that since the Cartesian form of b ∈ H s is redundant, more bits than actually necessary are transferred by the uibc network while multiplying. By removing links and switching components from the network which convey redundant bits of b, the complexity of the scheme can be significantly reduced. The optimized scheme transmits merely the compact form of b. The resulting multiplier network, called mulaib, is shown in Fig. 10 for n = 8. A [2] A [3] A [4] A [5] A [6] SW (1) SW (1) SW (3) A [6] A [4] A [3] A Figure 10 illustrates further multiplier and inverter architectures deduced from the ibc network and respectively, from its inverse, the generalized cube network. All architectures work directly on the compact form of operands.
Above we followed an illustrative approach to introduce the arithmetic for binary groups. An accurate description of the construction of group H s underlying the arithmetic, a formal definition of the compact representation, proofs of the multiplication algorithms, further multiplier and inverter schemes as well as a generalization of the theory to a large class of binary groups of arbitrary degree n have been omitted here in lack of space, but can be found in [14] .
Conclusions
In the following, we give complexity and performance estimations for the presented multiplier architectures. The examined multipliers operate in the symmetric group S 32 and in the binary group H 7 (degree n = 128), which have comparable orders: |S 32 | ≈ 2 117 and respectively |H 7 | = 2 127 . Estimations of complexity and delays have been made for the 0.7 µm es2 standard-cell cmos technology of European Silicon Structures. The complexity of the typically extensive wiring of switching networks, indicated also by the measure wiring width, has been taken into account. The throughput of the networks has been calculated for a purely combinational, full-parallel, non-pipelined implementation. The estimation methodology as well as other implementation styles are detailed in [14] . Table 1 below summarizes the results. Among the multipliers for S 32 , the crossbar architectures are very fast and cost-effective too. The bitonic separator network and the rotation separator network perform quite similarly, and are slightly smaller and slower than the wellknown bitonic sorter. All multipliers for H 7 have extremely low gate-complexity, whereas about 60 % of the total area is spent for global wiring in all designs. The reason that mulab performs significantly worse than mulaib is that control signals, that are to be distributed at a particular stage, are produced by the preceeding stage. Therefore, the delay of signal distribution adds to the total delay at each stage, a rather undesirable phenomenon.
To summarize, the multipliers for the binary groups outperform those for the symmetric group and because of their O(nlogn) complexity, the gain becomes even more striking for larger groups. We stress here again that the very fundamental invention which allows both space-efficient storage and efficient computation in binary groups is that of the compact representation.
