Abstract. MMX is a new technology to accelerate multimedia applications on Pentium processors. We report an implementation of IDEA o n a P entium MMX that is 1:65 times faster than any previously known implementation on the Pentium. By parallelizing four IDEA's we reach an unprecedented 78 Mbits s throughput per output block on a 166MHz MMX. In the light of rapidly increasing popularity o f m ultimedia applications, causing more dedicated hardware to be built, and observing that most of the current b l o c k ciphers do not bene t from MMX, we raise the problem of designing block ciphers and encryption modes fully utilizing the basic operations of multimedia.
Introduction
The second main objective besides security in designing cryptographic primitives is speed: even 10 di erence in speed by the same security level may bias industry to prefer one cipher to another. Still, it is not an easy task to compare ciphers by virtue of speed. The reasons are manifold, depending on the human factor the best known implementation may not be the best possible implementation but also on the hardware available: ciphers optimized for 32-bit processors may not be optimal on 64-bit processors and vice versa. Application of new microprocessor techniques DSP | Digital Signal Processing, VLIW Very Long Instruction Word, SIMD | Single Instruction Multiple Data in current general-purpose microprocessors will signi cantly sway our beliefs in the speed ratio of available ciphers Cla97 .
Because of the quickly increasing importance of multimedia, dedicated hardware will be commonplace tomorrow. Today's multimedia extensions to name a few, Intel's MMX, Sun's VIS, HP's MAX-2, Cyrix's MMX, AMD's 3DNow! are just the rst owers. New generations of multimedia enhanced processors will even more change our judgment of what it means to be software" optimized.
MMX, incorporated in every new Intel processor e.g., in the Pentium with MMX and the Pentium II, is a relatively new extension made to accelerate multimedia applications. Considering the worldwide spread of MMX capable computers, design and implementation of cryptographic primitives utilizing the basic operations of multimedia applications should be considered very seriously. Some work in this area has already been done by designing new hash functions and stream ciphers HK97,Cla97,DC98 . Biham viewed a 64-bit processor as a SIMD parallel computer, which can compute 64 one-bit operations simultaneously, getting signi cant acceleration of DES Bih97 . Using the same method bit-slicing", several papers SAM97,Kwa98 h a ve later improved Biham's results.
There is a wide variety of block ciphers in more or less general use. The popularity of some of those ciphers is based on the trust in the design of the cipher, the popularity of some other ciphers is based on the high throughput in combination with reasonable security. In particular, the block cipher IDEA LM90,LMM91 is believed to be very secure due to the proper interaction between three di erent group operations. Although, apart from DES, IDEA seems to be the most studied block cipher, no currently known attack e.g., BKR97 , DGV94 or Haw98 against the full IDEA performs better than exhaustive search. Interaction between three di erent group operations adds con dence in IDEA's security, but the frequent use of multiplication does not allow fast software implementations on common microprocessors We describe an implementation of IDEA on MMX, that is signi cantly faster than the best possible implementation of IDEA on the standard Pentium. One attempt to optimize IDEA on MMX has already been taken: Masayasu Kumagai's implementation of non-standard IDEA Kum97 encrypts three IDEA blocks in parallel, achieving 45.6 Mbits s per individual encryption on a 200MHz Pentium MMX. Our implementation includes a fast version of standard IDEA and a parallel version that is about twice as fast as Kumagai's.
The MMX architecture was chosen for it being the de facto standard, IDEA was chosen because no other current industry-standard" block cipher seems to bene t from the Pentium MMX and because of its practical importance. Moreover, in the following we demonstrate that IDEA utilizes only about one third of the Pentium MMX and is, additionally, easily parallelized without a signi cant parallelization overhead. The resulting parallel 4-way IDEA" is faster than any o f the 64-bit block ciphers in Table 1 ; by doing this we transform a relative s l o w a n d as generally believed, a very secure cipher into a very fast and still very secure cipher. Observing that, we raise a question of designing new, multimedia optimized block ciphers.
Section 2 gives a background to MMX and multimedia extensions. Section 3 outlines the basics of the IDEA algorithm. Section 4 describes our implementation of IDEA on MMX. Section 5 describes shortly the fast parallel implementation of IDEA. Section 6 takes a more broad view of multimedia architectures and Sect. 7 gives a short description of why can't most of the block c i p h e r s b e parallelized on the MMX" and raises the problem of designing new, multimedialike constituted block ciphers. In Sect. 8 we outline the results and nally, Sect. 9 acknowledges the people who have t o b e a c knowledged.
Introduction to MMX
At the time of writing this paper Intel's Pentium was the most widely used general purpose processor. We shall not present a detailed outline of Intel Pentium's architecture an interested reader may turn to Int97b or BGV96 .
MMX MultiMedia eXtensions is a relatively new technology to enhance performance of advanced media and communication applications. The MMX technology introduces new general-purpose instructions that operate in parallel on multiple data elements packed into 64-bit quantities the`SWAR' | SIMD Within A Register | architecture, Die97 . These instructions accelerate the performance of multimedia applications such as motion video, combined graphics with video, image processing, audio synthesis, speech synthesis and compression, telephony, video conferencing, 2D graphics, and 3D graphics. These applications were broken down to identify the most compute-intensive routines, which w ere then analyzed in detail using advanced computer-aided engineering tools. The results of this extensive analysis showed many common, fundamental characteristics across these diverse software categories. The key attributes of these applications were:
Small integer data types for example: 8-bit graphics pixels, 16-bit audio samples. Small, highly repetitive l o o p s . Frequent m ultiplies and accumulates. Compute-intensive algorithms. Highly parallel operations.
The new MMX instructions work on 8 new 64-bit registers called mm0 : : : mm7. Some of the instructions have an 8-way parallel 8-bit, a 4-way parallel 16-bit, a 2-way parallel 32-bit and a 64-bit version but most of the operations like multiplication and addition have o n l y v ersions corresponding to some subset of these possibilities. There are more operations for 8-bit and 16-bit data than for larger data types the small data types" paradigm.
All microprocessors in the Pentium family have another level of parallelism, called super-scalar parallelism. In particular, most of the MMX instructions can be executed in both U and V pipelines in parallel with any other instruction, with the following exceptions.
Multiplication requires three cycles has latency 3 but can be pipelined, resulting in one multiplication operation every clock cycle has throughput 1. Multiplication instructions cannot pair with other multiplication instructions. Shift, pack and unpack instructions cannot pair with each other. MMX instructions that access memory or integer registers can only execute in the U-pipe and cannot be paired with any instructions that are not MMX instructions.
After updating an MMX register, one additional clock cycle must pass before that MMX register can be moved to either memory or to an integer register. Throughput is 1 for every operation, latency is 1 for every operation but multiplication. It is important to understand the di erence between the SIMDparallelism provided by the MMX technology and the super-scalar parallelism. The rst permits to execute the same operation on up to eight di erent data entities as one instruction, the second makes it possible to execute two possibly di erent instructions during the same machine cycle. Hence, the total level of parallelism inside a Pentium MMX can be up to 16.
Still, most of the applications do not bene t from MMX. Some of the limitations of MMX and the Pentium family in general are outlined below cf Int97b,Int97a for more information:
Maximum two operands. Pentium MMX instructions have the maximum of two operands, causing a high frequency of the move movq instructions in Pentium MMX programs.
Lack of registers. There are only 8 MMX registers, which is rather insu cient for most of the compute-intensive applications.
Slow interaction with integer registers and memory. Data in memory has to be aligned to 64-bit boundaries misalignment costs three cycles on the Pentium processor family and arranged in a way that minimizes the numberof cache misses. Correct data alignment m a y signi cantly expand the data structures in the worst case, expanded data will not t into the cache. The delay for a cache miss is at least eight i n ternal clock cycles. Pairing limitations were already mentioned.
Limited number of instructions. MMX has only a limited set of speci c operations. Because of the slow i n teraction between integer and MMX register sets, small programs using intensively both integer and MMX instructions will generally not bene t from MMX.
No ags register. The MMX command set does not change the ags register and therefore the wide variety of branch instructions available on the Pentium is not useable. The only two comparison operators on MMX pcmpgt* and pcmpeq*; greater than, equal to act on signed data and change the corresponding bits of the destination register to 1 true or 0 false. Emulating di erent | especially unsigned | comparisons takes additional time.
No commands with immediate operands. Immediate operands have t o b e loaded from memory or generated by other means e.g., by xoring or comparing a register to itself.
Only 16-bit signed multiplication. Applications intensively using the unsigned multiplication may become signi cantly slower. IDEA multiplication Section 3, which is expensive t o e m ulate using unsigned multiplication is even more expensive t o e m ulate using only the signed multiplication see Section 4. Emulation of using the available MMX instructions needs two m ultiplications: one to calculate the higher 16 bits of the result pmulhw and another to calculate the lower 16 bits pmullw.
Standard reference for MMX optimization is Int97a .
De nition 1. Let the subscript s resp. u under a binary operator denote signedness resp. unsignedness of the corresponding operation. Let s and u be respectively the signed and unsigned multiplication operations from ZZ 2 2 16 to ZZ 2 32 u is the standard multiplication, expandable to ZZ 2 2 32 . Let True be 2 16 ,1 if is true and 0 otherwise. Next we de ne several basic operators corresponding one-to-one to the instructions of MMX. Actually the correspondence i s 4-way, i.e., the MMX instructions execute four such operations in parallel. Let IDEA satis es most of the key attributes of multimedia applications used by designing MMX, therefore being an almost ideal candidate cipher to get bene t from MMX:
IDEA has small integer data types all the operations work on 16-bit data.
Having only small data values enables to pack several of them into one register and thereafter process multiple plaintext blocks in parallel one of the main factors in e ective parallelization. IDEA processes the same data over and over without requiring random memory accesses, therefore needing less interaction with the slow memory. Additionally, IDEA lacks operations necessitating expensive, non-parallelizable, table lookups another main factor in e ective parallelization. IDEA is based on two 16-bit operations that are common in multimedia applications 16-bit multiplication and addition and on exclusive or that is a primitive instruction in almost every microprocessor. Although IDEA's multiplication is not trivial to implement on MMX, MMX still provides some speedup compared to the Pentium per every multiplication an important factor to get an overall speedup.
Fast Implementation
We h a ve addressed all problems mentioned above and completed a fast implementation of IDEA on a Pentium MMX. Some of the tasks we had to solve are outlined below. We assume the plaintext to be in an MMX register and the pointer to the key schedule in an integer register. The ciphertext can be read afterwards from the same MMX register.
General optimization. Optimal use of registers, with minimized number of move instructions. Minimized use of memory: only constants and subkeys are read from memory. S u b k eys and constants are correctly aligned to avoid time Results got by analyzing the four cases can be generalized by simple means to complete the proof.
As already mentioned, MMX lacks unsigned comparison instructions. Our implementation needs one of them, which will be emulated using existing in- These formulas give a direct way to break down the operation into basic operations, corresponding one-to-one to MMX instructions. For example, Cmpeqh; l corresponds to the instruction pcmpeqw, Cmpgth; l to pcmpgtw, Subusa; b to psubusw, Mulla; b to pmullw, Mulha; b to pmulhw. T h e given formula for emulation of 16-bit unsigned multiplication is, as far as we know, faster than any previously published algorithm for MMX and therefore interesting in itself.
Including also the necessary move instructions, the minimal numb e r o f M M X instructions needed to emulate by the procedure given above is 26. Additional highly processor and algorithm dependent mechanisms enable to get rid of three more instructions per IDEA multiplication, therefore resulting in 69 in- Table 2 . Test data. The real life" throughput of IDEA-ECB on the Pentium MMX and on the Pentium II. Seconds -the time to encrypt four million 64-bit blocks.
Di erent Multimedia Extensions
If MMX had the unsigned multiplication instruction, the number of instructions per IDEA multiplication would decrease by 6. If MMX had the unsigned comparison instruction pcmpgtuw, the number of instructions per IDEA multiplication would decrease by 2. In the presence of both of these instructions, IDEA encryption on MMX machines could be done much faster than DES we estimate 250 , 255 cycles; 4-way IDEA would be faster than Square DKR97 o r a n y of the recently proposed AES candidate ciphers we estimate 95,100 cycles.
Conditional move instructions, present in the Cyrix's | but not in the Intel's | v ersion of MMX, would further speed up IDEA. If even such imperceptible changes fastened up a cipher signi cantly, what about the multimedia extensions that di er from MMX in major aspects?
Lately, in May 1998, Motorola unveiled their new multimedia architecture called AltiVec Mot98 , claimed to be much more powerful than any of the previously mentioned architectures. In particular, AltiVec has increased parallelism 128-bit vector registers and a family of instructions to perform up to eight 16-bit unsigned multiplications with accumulate in parallel. Additionally, A ltiVec has a special inter-element b yte permutation instruction and several vector rotation instructions and therefore allows to implement new fast ciphers using data-dependent rotations and byte permutations. One of the goals of AltiVec unlike the MMX was to accelerate data encryption algorithms Mot98, page 1-4 . A short comparison between MMX and AltiVec is given in 
Block Cipher Parallelization
Ciphers using S-boxes and or lookup tables e.g., DES, alleged RC4, SEAL, Blow sh, Khufu do not take major advantage from the multimedia extensions of MMX though they could bene t from the larger cache or word-size as the MMX registers cannot be used as memory pointers. Parallelization of these ciphers would need accessing several randomly" chosen memory cells simultaneously. RC5 Riv95 , which does not use S-boxes, does not bene t from MMX either because of the expensive non-parallelizable variable rotation involved.
It is interesting to note that some of the newest block ciphers, including the AES candidates MARS BCD + 98 and RC6 RRSY98 , rely on the 32-bit unsigned multiplication. The reasoning of the authors is that such m ultiplication is very cheap on nowadays common microprocessors. This claim is indeed true, but MMX technology cannot be used to accelerate these ciphers and neither can AltiVec because of the lack o f a 32-bit parallel multiplication. There is a certain tradeo and even a contradiction here. MARS and RC6 are optimized for the new 32-bit processors mainly for the Pentium II, utilizing fully the 32-bit operations provided by s u c h processors. At the same time, these ciphers ignore the multimedia extensions existing in the very same processors.
Further work can be done in trying to optimize di erent conventional ciphers for the Pentium MMX, but as it was pointed out, most of the commonly known block ciphers do not bene t from MMX. Still, in some cases interleaving Pentium integer and MMX instructions may result in some speedup. In particular, bitslice MMX implementations of di erent b l o c k ciphers should be more than twice as fast because of the longer wordsize and additional logical operations.
One could think that MMX was designed especially" to accelerate IDEA, b u t i t w ould be more correct to say that IDEA is a cipher with key attributes very similar to those of multimedia applications cf Sect. 3, by a loose de nition of multimedia applications as applications bene ting from the Pentium MMX di erent vendors have optimized their processors to be optimal for di erent subsets of multimedia applications.
A family of new block ciphers can be designed to take full advantage of MMX. A straightforward way w ould be to iteratively execute four copies of the IDEA round function in parallel and then mix their outputs in a suitable way. Would it be su cient t o a p p l y a w ell chosen 8 16-bit word permu t a t i o n t o t h e 256-bit output of every round of this 4-way IDEA to get a secure cipher? A way providing more e cient di usion would be to use Pseudo-Hadamard Transforms Mas94,SKW + 98 . Further research in this area is deferred to a future work. An interested reader may turn to Cla97 , where parallelized versions of the stream cipher Wake w ere proposed.
A more general task is to study design principles of secure ciphers based on the same basic operations e.g., massively parallel 16-bit multiplication and addition of sequential data as the existing multimedia applications. Such ciphers would perform well on nowadays microprocessors, therefore reducing the need for separate encryption and multimedia hardware it can be compared to the approach o f B P 9 7 that uses the same hardware for RSA and IDEA. E cient confusion on such ciphers may b e a c hieved by using 16-bit multiplication mixed with other 8-bit and 16-bit operations; di usion may b e a c hieved by additionally using 32-bit and 64-bit operations e.g., shifts | but remember the small data type" paradigm.
Yet another task is to study encryption modes allowing fast parallel encryption and decryption. The ECB mode can be used for both parallel encryption and decryption, but it has limited security in real life situations. The CBC mode can be used for parallel decryption but not for parallel encryption. The resulting throughput of IDEA encryption on a 233 MHz Pentium II would be 32 , 33 Mbits s for encryption and 105 , 107 Mbits s for decryption in standard CBC mode Table 2 . Encryption modes allowing both fast parallel encryption and decryption are needed. Note that such encryption modes are not only important for software but also for hardware architectures. The hardware solution mentioned before provides a throughput of 300 Mbits sec in ECB mode, and a throughput of 100 Mbits sec in the other modes. An example candidate is the counter mode MOV96, Sect. 7.2.2 which allows parallel encryption decryption while providing almost ideal security in the random oracle model BDJR97 b u t which is not suited for use with di erentially weak ciphers BK98 .
One could see the problem also from the viewpoint of a processor designer and ask what minimal extensions should be added to an existing generalpurpose processor to achieve signi cant speedup of industry-standard cryptographic primitives. While the general answer seems to be out of our reach d u e t o the diversity of cryptographic primitives, suggestions can be given to accelerate any xed primitive see discussion in the beginning of Sect. 6.
Conclusion
We h a ve shown that it is possible to speed up the IDEA block cipher signi cantly by using the MMX extensions of Intel's Pentium processor. This is remarkable when taking into account the unfriendliness of the instruction set of MMX. Our fast implementation is 1:65 times faster than the best known assembler implementation on the Pentium by A n toon Bosselaers, 2:55 times faster than the C version on the Pentium in the popular library SSLeay v0.90b, when compiled with egcs 1.0.2 and full optimization. By parallelizing four IDEA's, the encryption speed is increased by a factor of about 2:64 times, giving a total acceleration of 4:35 times compared to the implementation of Bosselaers. Implications including the massive parallel key search of using such parallel versions of conventional ciphers were already described in Bih97 a n d w ere not repeated in this paper.
By noting that most of the nowadays industry-standard" block ciphers do not bene t from MMX, we raise the problem of designing block ciphers and encryption modes fully utilizing the basic operations of multimedia.
