Encryption performance, in terms of bits per second encrypted, has not scaled well, as network performance has increased. The authors felt that multiple encryption modules, operating in parallel would be the cornerstone of scalable encryption. One of the major problems with parallelizing encryption is ensuring that each encryption module is getting the proper portion of the key sequence at the correct point in the encryption or decryption of the message. Many encryption schemes use linear recurring sequences, which may be generated by a linear feedback shift register. Instead of using a linear feedback shift register, the authors describe a method to generate the linear recurring sequence by using parallel decimated sequences, one per encryption module. Computing decimated sequences can be time consuming, so the authors have also described a way to compute these sequences with logic gates rather than arithmetic operations.
Introduction
End-to-end encryption can protect proprietary information as it passes from one end of a complex computer network to another, through untrusted intermediate systems. Encryption performance, in terms of bits per second encrypted, has not scaled well, as network performance has increased. Encryption performance in terms of long term secrecy also suffers as computer performance is scaled, cracking previously secure algorithms.
The overall problem addressed in the authors' research is:
How can end-to-end encryption technology be scaled for high performance in the gigabit per second networking arena? The authors, along with other members of the project team, identified and analyzed current research efforts in scalability of encryption and interoperability of scaled and unscaled encryptors.
Amroach
The authors found that multiple encryption modules operating in parallel would provide an alternative for creating a scalable encryption architecture able to keep pace with the needs of modern, high performance communication networks.
Varying numbers of encryption modules could be combined with control circuitry to produce an encryption unit for a specific computer system or class of systems with similar capabilities. The number of encryption modules making up a parallel encryption unit would be determined by the speed at which the computer system is expected to communicate to networks. A supercomputer may require 8 or 16 encryption modules operating in parallel to keep up with its network communications demand, while a workstation may only need 2 or 4 encryption modules in parallel to accommodate its workload. A PC may only require 1 encryption module. To decimate a sequence is to replace an integer sequence i i = { a,, a2, a3, a4, ..., } with the ak(.-l)(mod2") 1, thereby producing a sequence of every P element of the main sequence. Using decimated sequences in parallel, one can reduce the amount of time it takes to generate the main sequence, ii, by a factor of k. For k parallel encryptors fed by k decimated sequences, it is now possible to send the first element of the sequence 5 to the first encryptor, the second sequence element to the second encryptor, and so on, wrapping around feeding elements k+l, 2k+l, 3k+l, ... to the first encryptor, k+2, 2k+2, 3k+2, ... to the second encryptor and so on. By doing this, the same portion of the linear recurring sequence is matched with the same portion of plaintextkiphertext in the encryptioddecryption units without regard to the degree of parallelism within communicating units.
sequence ak = a2k(mod 2"), a3k(tnod 2")> a4k(mod 2"), '") In general, a companion matrix for an n cell, left shift register, would be a zero filled n x n matrix, with 1's along the diagonal just below the main diagonal, and binary coefficients (representing shift register tap locations) down the last column.
Companion matrices allow manipulation of linear feedback shift registers using matrix arithmetic To get the second decimation, or a matrix that would produce every other element in the sequence (2 shifts of the LFSR), square the companion matrix (mod 2). To get the fourth decimation, a matrix that would produce every fourth element in the sequence (shifting the LFSR 4 times), raise the companion matrix to the 4" power (mod 2).
Multiplying that resulting decimation matrix (as raised to the 4* power) by the initial state vector (I,,), mod 2, gives the vector that would be obtained after shifting the LFSR 4 times (I4). Now, every 4 ' vector can be produced by multiplying the current vector (0 by@ mod 2.
The application of this is we can now seed 4 of these matrix multiplies, all in parallel, with 4 different initial As shown in Table 2 , the first sequence generator is seeded with vector 0 and produces vectors 2, 4, 6, 8, 10, 12, ... The second generator, seeded with vector 1, produces vectors 3, 5, 7, 9, 11, 13, ... By using the first 4 states of Table 1 as our initial states, I, we can generate the next states by multiplying Ihf' mod 2 as shown in Table 3 .
As illustrated in Table 3 , the first sequence generator is seeded with vector 0 and produces vectors 4, 8, 12, ... The second generator, seeded with vector 1, produces vectors 5, 9, 13 ,..., and so on. This can be repeated k times (all operating in parallel) to attain the desired encryption rate.
At this point we should note that the sequences in Tables 1, 2 , and 3 are identical. This shows that as long as the blocks are sent through the network in the proper order, 'units having differing degrees of parallelism will be able to interoperate.
Whereas this technique looks usefbl in concept, in reality it is not practical in most general purpose computers due to the amount of time necessary to carry out the multiplication (mod 2) of a 1 by n vector with an n by n matrix. What can be done to make this technique practical, is to use logic gates rather than arithmetic operations to implement the matrix multiplication. Since addition modulo 2 can be represented by a logical "Exclusive-or" and multiplication modulo 2 can be represented by a logical "And", the matrix multiplication reduces to a set of logic gates, many of which can be wired in parallel. The general case would have each element in a column of the decimation matrix (say M') wired to an "And" gate along with the corresponding element of the current state vector. The output of these "And" gates can be fed into a cascade of "Exclusive-or'' gates. This effectively performs the multiplications (mod 2) concurrently and then sums the results (mod2) to produce the corresponding element of the new state vector. This is repeated (in parallel) for each column of the decimation matrix. This entire group of logic gates will have to be replicated k times, once for each encryption module in the parallel encryption unit.
In each encryption module (of a unit containing k encryption modules operating in parallel), for any characteristic polynomial of order n, n2 "And" gates could be executed concurrently. There could also be n cascades of "Exclusive-or" gates operating in parallel. These logic gates will also be operating ktimes in parallel with the other encryption modules in the parallel encryption unit. Table 3 .
Conclusion
This method of generating linear recurring sequences in parallel could be applied anywhere LFSRs are currently used, in order to achieve increased performance. As applied to increasing encryption rates, one could now design an encryption unit using k decimated linear feedback shift register sequences feeding k nonlinear encryption fhctions operating in parallel. Since these units will interoperate with each other, as higher encryption rates are needed, the number of encryption modules operating in parallel (k) within a unit can be scaled, without rendering previous versions obsolete. These slower units can still be used in portions of the network where top encryption speeds are not necessary.
