Abstract. A symmetric key cryptosysteni, called PGM, based on log~r i t h r n i~ signatures for finite pcrmutation groups was invented by S. Magliveras in the late 1970's. PGM is intended to be used in cryptosystems with high data rates. This requires exploitation of the potential parallelism in composition of permutations. As a first step towards a full VLSI implementation, a parallel multiplier has been designed and implemented on an FPGA (Field Programmable Gate Array) chip. The chip works as a co-processor in a DSP syst,em. This paper explains the principles of the architccture, reports about implementation details and concludes by giving an estimate of the expected performance in VLSI.
Introduction
A symmetric key cryptosystem PGM based on logarzthmic signutures for finite permutation groups was invented by by S. Magliveras in the late 1970's. The system was described in [l] , and its statistical and algebraic properties were studied in 121, 131, 141. Recent significant results have been obtained on closely related material by S.A. Vanstone and by M. Qu [5] . Here we include only a short description of PGM. The corresponding decryption transformation is obtained by reversing the order of the pair of logarithmic signatures, i.c. D,,p = E-' a ,P = E P ,~ = 8 G -l .
To effect the fastest possible PGM encryption and decryption operations, one must compute efficiently products of permutations as in equation (1) . Unlike multiplication of integers, coniposition of t,wo permutations is inherently parallelizable. Hence, we can achieve fast computation of ii and its inverse by designing a permutation multiplier which takes advantage of this property of permuta.tion composition. In this paper we describe a design for such a permutation multiplier, as a first step towards a full VLSI implerrientation of PGM.
2
For easy understanding, we shall explain the principles by means of a simple example. We consider permutations of degree 4, and represent them in carte-
This form is particularly convenient for representing permutations in hardware, where a vector register of length n is used to represent a permutation of degree n. For example, r = [3,2,1,0] is our notation for the permutation .rr = (0 3)(1 2) as the product of disjoint cycles. In general, this representation needs nlogan bits tjo store a permutation of degree n. Throughout the example, we define five input operands to work with,
The multiplication unit is in essence a crossbar switching network. A 4x4 switching matrix is depicted in Figure 1 . The matrix has t,hree input ports, labeled A , B and C respectively, and one output port named Q . Ports We remark here that the partial product K = cy o p-' is implicitly stored in the state of the transfer gates, and can be retrieved by passing C = L through the matrix. Furthermore, it is possible to compute several products with the same first operand 7r, without setting up the matrix again. This kind of operation we call contznuous mode. By dedicating separate lines to A , B , C and Q respectively, it becomes possible to overlap in time the pass-through phase of a multiplication and the setup phase of the next one. This two-stage pipelining is shown in Figure 2 . The state of the gates is always changed at the end of the phases, thus pass-through operations can take place using the previous setup. 
Implementation details
As a first step towards a VLSI implementation of PGM, a hybrid hardwaresoftware prototype has been developed based on a 'Texas Instruments 320C30 DSP processor. Multiplication of permutations is effected in the permutation co-processor chip, which is connected to the DSP system via a 16 bit peripheral bus, called DSPLINK. The DSP accesses the co-processor through I / O instructions. The co-processor is a n XC3190 FPGA (Field Programmable Gate Array), a product of the Xilinx Corporation. The FPGA is a perfect prototyping tool, in view of the flexibility it affords for design changes. However, the achievable complexity is rather low, only a. few thousand gate equivalents. This constraint limits the degree n of permutations that are processed on the chip to n = 16. In order t o be able to carry out one setup or pass-through operation in each cycle, the operands have to be led through the crossbar network in parallel, i.e. needing logzn lines per operand. For practical applications n should be at least 32, requiring thus at least 5.25 = 160 lines. Although a fully parallel implementation may still be feasible on a VLSI chip, we follow a different approach. The vectors of first, second, etc. bits of the n elemerils in the permutation are sent through the crossbar serially, in loyzn cycles. This principle reduces dramatically the total number of lines needed, the complexity of t,he cells, and hence the overall chip area. Due l o shorter lines, propagation delays shorten considerably, too.
We estimate the performance of a serialized multiplier to be about 50% that of a fully parallel one. This seems to be a good trade-off between price and speed.
Let us now take a closer look at the FPGA multiplier. The circuitry belonging to one cell is depicted in Figure 4 . As a convention, the vector of least significant bits (LSBs) is processed first, followed by the other bit layers in order of significance. The cells function as follows: 0 Essentially, the XOR gate compares the corresponding bits of the operands A and B , received from the neighboring horizontal and, respeclive vertical lines.
The result of a bit comparison is AND-ed with the accumulated result of previous comparisons, and is reflected by the state of ACCUFF. The output of the A N D function becomes the iiew accumulated result, and is written into ACCUFF at the end of the cycle due to a low-high transition on the global ACCU clock net.
During the first cycle of the setup phase a global signal, called INIT, is activated. This makes the cells ignore the accumulated result, and simply enter the output of the XOR gate into ACCUFF.
0
At the end of t8he last, cycle a transition occurs in FIRE, the sccond global clock net, which causes the final result to be cntered into FIREFF. The output of this flip-flop controls GATE, a tri-state buffer, thus a new setup also comes into effect at this moment. The data-path of the permutation chip, reduced to 4 bit vector length, can be seen in Figure 5 . Ports A and Q of the multiplier array are unified as port AQ on the left edge. Similarly, ports B and C are fused to form port BC on the top edge. The external pins of AQ and BC are also connected together on the embedding card to form one data bus for connecting to DSPLINK. The identity operand I is hard-wired on the chip in units icoDE n All signals controlling the ports and cells, are generated in unit CTRLOGIC. The 4 bit address bus of DSPLINK presents the instruction code to the chip, thus instruction and data are transferred at the same time. An instruction set of 6 elements has been defined to control the assembly. Because of space limitations we can not go into their semantics here.
Conclusions
A permutation multiplier chip has been developed, verified by simulation, attached to a DSP system and successfully tested by means of a simple DSP program. The processing speed is satisfactory, 100ns for one cycle. In our implementation the degree of processed permutations was set to n -16. This is of course too small for practical applications. Nevertheless, we consider this prototyping work an important step towards a full VLSI implementation of PGM. Our multiplier architecture can be easily extended t o larger n , and be quickly transferred to larger scale technology.
For future work we plan to complete the DSP implementation so as to gain more insight into the actual processing and storage requirements of the PGM algorithm. Afterwards we intend to augment the perinutation matrix with other hardware units t o embrace the entire algorithm with fast, special-purpose hardware.
