Abstract. To enhance system performance computer architectures tend to incorporate an increasing number of parallel execution units. This paper shows that the new generation of MD4-based customized hash functions (RIPEMD-128, RIPEMD-160, SHA-1) contains much more software parallelism than any of these computer architectures is currently able to provide. It is conjectured that the parallelism found in SHA-1 is a design principle. The critical path of SHA-1 is twice as short as that of its closest contender RIPEMD-160, but realizing it would require a 7-way m ultiple-issue architecture. It will also be shown that, due to the organization of RIPEMD-160 in two independent lines, it will probably be easier for future architectures to exploit its software parallelism.
Introduction
The current trend in computer designs is to incorporate more and more parallel execution units, with the aim of increasing system performance. However, available hardware parallelism only leads to increased software performance, if the executed code contains enough software parallelism to exploit the potential bene ts of the multiple-issue architecture.
Cryptographic algorithms are often organized as an iteration of a common sequence of operations, called a round. Typical examples of this technique are iterated block ciphers and customized hash functions based on MD4. In many applications, encryption and/or hashing forms a computational bottleneck, and an increased performance of these basic cryptographic primitives is often directly re ected in an overall improvement of the system performance.
To increase the performance of round-organized cryptographic primitives it su ces to concentrate the optimization e ort on the round function, knowing that each gain in the round function is re ected in the overall performance of the primitive m ultiplied by t h e n umber of rounds. Typical values for the number of rounds are between 8 and 32.
This paper confronts one class of cryptographic primitives, namely the customized hash functions based on MD4, with the most popular computer architectures in use today or in the near future. Although only the MD4-like hash functions are considered in the sequel, much of it also applies to other classes of iterated cryptographic primitives. Our main aim is to investigate the amount o f software parallelism in the di erent m e m bers of the MD4 hash family, and the extent to which n o wadays RISC and CISC processors are able to exploit this parallelism. This approach di ers of the one in BGV96] i n t h a t w e n o w t a k e t h e hashing algorithms as a starting point, and investigate the amount of inherently available parallelism, while previously we took a particular superscalar process o r a s s t a r t i n g p o i n t, and investigated to which extent an implementation o f t h e hashing algorithms could take advantage of that architecture.
The next section considers the basic requirements a processor has to meet to enable e cient implementations of MD4-like hash functions. Section 3 gives an overview of currently available processor architectures, and lists their, for our purposes, interesting characteristics. Section 4 introduces the notion of a critical path. The available amount of instruction-level parallelism in the MD4-like algorithms is determined in section 5, and confronted with the available hardware of section 3. Finally, section 6 formulates the conclusions.
Basic hardware requirements
The customized hash functions based on MD4 include MD4 Riv92a], MD5 Riv92b], SHA-1 FIPS180-1], RIPEMD RIPE95], RIPEMD-128 and RIPEMD- 160 DBP96] . It are all iterative hash functions using a compression function as their basic building block, the input to which consists of a 128 or 160-bit chaining variable and a 512-bit message block. The output is an update of the chaining variable. Internally, the compression function operates on 32-bit words. The conversion from external bit strings to internal word arrays uses a big-endian convention for SHA-1 and a little-endian convention for all the other hash functions. Depending on the algorithm the compression function consists of 3 to 5, possibly parallel, rounds, each made up of 20 (SHA-1) or 16 (all other) steps. Finally, a feedforward adds the initial value of the chaining variable to the updated value. Every round uses a particular non-linear function, and every step modi es one word of the chaining variable and possibly rotates another. De nitions of the round and step functions can be found in Tables 1 and 2, This short overview allows us to conclude that an implementation o f M D 4 -like hash functions will bene t from a processor that Algorithm Step function using Boolean function: 1. supports 32-bit operations. 2. can handle both little-endian and big-endian memory addressing. 3. has a rotate instruction, and, in addition to the standard logical instructions and, or, a n d xor, instructions like nand, nor, nxor, and-not, a n d or-not, where the latter two are de ned as, respectively, t h e and and or of the rst operand and the complement of the second. Remark that xor-not would be the same as nxor. 4. is able to keep all local variables in registers: 16 message words, 5 chaining words, and 2 auxiliary words. The RIPEMD-family, h a ving two parallel lines, requires two copies of the last two items. So in total up to 30 registers are required. 5. supports parallel execution of arithmetic or logical (ALU) operations. This item will be further investigated in the next section.
Hardware parallelism
The basic implementation technique, applied by all nowadays processors, to improve CPU performance is pipelining. A pipeline is organized in a number of stages, each o f w h i c h executes part of a CPU instruction. Multiple instructions can overlap in execution by letting each stage in the pipeline complete a part of a di erent instruction. Hence, this technique allows di erent parts of consecutive instructions to be executed in parallel. As a consequence, pipelining increases the CPU instruction throughput. The execution time of each instruction usually slightly increases due to pipeline control overhead, but this is more than compensated for by the increase in instruction throughput. The net e ect is a substantial decrease in the number of clock cycles per instruction, ideally resulting in a speedup equaling the number of pipeline stages. To enhance performance even further two approaches are available: increase the number of pipeline stages, or use a number of parallel pipelines. The former architecture is called superpipelined and emphasizes temporal parallelism, while the latter relies on spatial parallelism and comes in two a vors: superscalar or very long instruction word (VLIW). The aim of these techniques is to further increase the throughput. A superpipelined architecture achieves this by reducing the clock cycle time, while a superscalar/VLIW architecture tries to issue more than 1 instruction per clock cycle. However, there is a limit to what can be gained in terms of performance. This limit is determined by t wo factors: a software one and a hardware one. The software factor is the amount of parallelism in the instruction stream, i.e., the amount of data dependencies between the instructions. In the next section the available instruction-level parallelism in an instruction stream will be characterized by the its critical path. The hardware factor is the impact of the increase in the number of pipeline stages or pipelines on the clock cycle time.
In case of a superpipelined architecture limited parallelism in the instruction stream will eventually lead to so-called pipeline stalls due to data dependencies: the execution of an instruction has to be stalled until the data needed to complete it become available. But even in the absence of dependencies superpipelining will eventually run out of steam. The clock cycle time can never be lower than the overhead pipelining incurs on each stage: clock s k ew and pipeline register overhead HePa96]. Therefore, increasing the number of pipeline stages beyond a critical point will result in performance degradation rather than performance gain.
Further increase in performance can then only be obtained by either going superscalar or using VLIWs.
{ A superscalar processor has dynamic issue capability: a varying number of instructions is issued every clock cycle. The hardware dynamically decides which instructions are simultaneously issued and to which pipelines, based on issue criteria and possible data dependencies. { A VLIW processor has xed issue capability: every clock cycle a xed number of instructions is issued, formatted as one large instruction (hence the name). The software (i.e., the compiler) is completely responsible for creating a package of instructions that can be simultaneously issued. No decisions about multiple issue are dynamically taken by the hardware. An advantage of a VLIW over a superscalar is that the amount of required hardware can be reduced: choosing the instructions to be issued simultaneously is done at compile-time, and not at run-time. However, the superscalar has two major advantages: its code density is little a ected by the available parallelism in the instruction stream, and it can be object-code compatible with a large family of non-parallel processors. The major challenge in the design of a superscalar processor will be to limit the impact on the clock cycle time of issuing and executing multiple instructions per cycle. This is illustrated by the fact that to date a factor of 1.5 to 2 in clock rate has consistently separated the highest clock rate processors and the most sophisticated multiple-issue processors HePa96] .
A nal uniprocessor technique to exploit parallelism inherent i n m a n y algorithms is single-instruction, multiple-data (SIMD) processing, a term originally only used in the context of multiprocessor environments Fly66]. A SIMD instruction performs the same operation in parallel on multiple data elements, packed into a single processor word. Tuned to accelerate multimedia and communications software, these instructions can be found in an increasing numberof general-purpose processor architectures. CPUs can be di erentiated among based on the type of their internal storage: a stack, an accumulator, or a set of registers. Only the latter class of CPUs will be considered in the sequel, since virtually every processor designed after 1980 uses that architecture, called a (general-purpose) register architecture. A further division of this call can be made based on the way instructions can access memory and on the operands for a typical ALU instruction.
{ In a register-memory architecture memory can be accessed as part of any instruction, while in a register-register architecture memory can only be accessed with load and store instructions, for which reason the latter is also called a load-store architecture. { The maximum number of operands of an ALU instruction is either two o r three. A three-operand instruction contains a destination and two source operands, while in a two-operand instruction one of the operands is both a source and a destination for the operation. { The number of memory operands of an ALU instruction can vary from none to the maximum number of operands (2 or 3). It turns out that two 1 combinations su ce to classify all the CPUs that will be considered:
class 1 -a tree-operand load-store architecture (no memory operands in ALU instruction): MIPS, Precision Architecture (PA-RISC), PowerPC, SPARC, Alpha. class 2 -a t wo-operand register-memory architecture (at most one memory operand in ALU instruction): 80x86 (including Pentium and PentiumPro), 680x0. Remark that the same division also distinguishes between RISC processors (class 1) and CISC processors (class 2).
1 Three su ce to classify nearly all existing machines, see HePa96, Section 2.2] Table 3 summarizes the characteristics of these architectures with respect to the requirements formulated at the end of the previous section, including the available hardware parallelismfor ALU instructions Sta96, H e P a96, Bha96]. The gures are for the most recent processors of each a r c hitecture. As far as RISC processors are concerned, these are all 64-bit, although compatibility with their 32-bit predecessors is retained. Since Alpha was designed as a 64-bit device, the support for 32-bit operations is limited. All RISC architectures include support for both little and big-endian addressing, but especially with PA-RISC and Alpha architectures an implementation is not required to implement both addressing modes. An Alpha implementation is not even required to support changing the convention during program execution, but only at boot time Dig96]. The other RISCs can use either format, selectable in either software or hardware. Some architectures are more than 2-way superscalar, but none can issue more than 2 instructions in parallel of the ALU subset that interests us: add, logical operations, rotate/shift. 620 SPARC PPro a The PA-RISC 2.0 instruction shrpw r1,r2,x,t shifts the concatenation of r1 and r2 to the right o ver x bits, and puts the result in t. By taking r1 = r2 = t it is in e ect a rotate. b The R4000 is superpipelined (but not superscalar) and its pipeline clock i s t wice the external clock frequency, so that 2 instructions can be issued per clock cycle. c The R10000 is superscalar, but not superpipelined. d The Alpha architecture has just 3 32-bit integer operations: add, subtract, multiply.
In addition, it has a set of in-register manipulation instructions on 32-bit quantities, such as extract, insert, and mask. 
Critical path length
To determine the amount o f a vailable instruction-level parallelism in the MD4-like hash functions, a critical path analysis is applied. To that end the algorithms are represented as a so-called activity-on-edge network, which is a directed graph with weighted edges.
Geometrically a graph G is de ned as a set V (G) o f v ertices v i interconnected by a s e t E(G) of edges e i . In a directed graph or digraph an edge e i is a directed pair hv i v j i and represented by a n a r r o w from the tail v i to the head v j . A directed path from v p to v q is a sequence of vertices v p v i1 v i2 : : : v in v q such that hv p v i1 i hv i1 v i2 i : : : hv in v q i are edges in E(G).
A n e t wo r k i s a g r a p h w i t h w eighted edges, i.e., to each e d g e e a w eight w(e) i s assigned. In an activity-on-edge network (AOE-network) tasks to be performed are represented by directed edges. The vertices in the network represent e v ents, signaling the completion of certain activities. Activities represented by edges leaving a vertex cannot be started until the event a t t h a t v ertex has occurred. An event occurs only when all activities entering it have been completed. The weight w(e) assigned to an edge e represents the time required to complete the activity associated with e.
The length of a path is then de ned as P e w(e), where e runs over all edges on the path. It is the time it takes to complete the task represented by the path. Assuming the activities in an AOE network can be carried out in parallel, the minimumtime to complete the overall task is the length of the longest path from the start vertex to the termination vertex. Such a path is called a critical path.
The evaluation of an arithmetic expression can be modeled as an AOE network. The start vertex corresponds to the availability of the input data, the activities represented by the edges correspond to the arithmetic operations constituting the expression, and the termination vertex corresponds to the result of the expression. The weight of an edge represents the time it takes to complete the corresponding arithmetic operation. Maximum performance in evaluating an arithmetic expression will therefore be obtained by making its critical path as short as possible, using, as much as possible, parallel execution of individual arithmetic operations. However, we m ust take i n to account t h a t e v entually the evaluation of the expression will take place on a multiple-issue architecture of the kind described in the previous section, i.e., all parallel execution units are pipelined, and all advance at the same rate. Unless out-of-order execution is supported, operations executed in parallel all deliver their result at the same moment, and therefore not faster than the time of the slowest operation. For this reason the critical path length will be expressed in terms of required pipeline stages, rather than in clock cycles. A measure similar to critical path length is depth, as used in the analysis of parallel algorithms Ble96].
CPL analysis of the MD4-family
The critical path length (CPL) of the MD4-like compression functions is mainly determined by the CPL of the individual rounds: the CPL of the feedforward is at most 2. The CPL of each round is equal to the sum of the CPLs of each step, so that the CPL of the compression function is easily derived from the CPL of a step. Each step updates one of the chaining words, and this updated word is then input to the next step. It is this basic dependency between steps that will determine their CPL. An inspection of two consecutive steps of every MD4-family member (see Appendix A) learns us that, except for SHA-1, the chaining word updated in one step is input to the Boolean function of the next step. The chaining word updated in that step only becomes available after adding in the Boolean result, rotating the resulting sum, and, in case of MD5 and RIPEMD-160, adding in another chaining word. SHA-1, in contrast, inputs the updated chaining word to a simple rotate, and the next chaining word becomes available after only 1 more addition. These lower bounds on a step's CPL are summarized in Table 4 . SHA-1 uses exactly the same kind and amount of operations as MD5 and RIPEMD-160 to update a chaining variable: 1 application of a Boolean function, 4 additions, and a rotate. However, the lower bound on a step's CPL is only half that of MD5 and RIPEMD-160. This is due to the fundamentally di erent w ay SHA-1's step function is organized compared to all the others:
1. The rotate is not applied to a sum of intermediate results, but to an individual chaining variable. 2. None of the arguments of the Boolean function are, except for a rotate, updated in the previous step, but in the step before that. This in itself might be a coincidence, but it turns out that the lower bound is also the actual CPL of each SHA-1 step, while this is not the case for any of the other hash functions, as will be shown in the sequel. This seeming coincidence might w ell be a design principle.
For the other hash functions the Boolean function is part of the critical path. This results in an increase of the CPL if the result cannot be delivered within the 1 stage assumed for the lower bound. This is, e.g., the case for the multiplexer (x^y) _ (x^z) used in all MD4-like hash functions. It would seem that from the moment x becomes available, and only using and, or, a n d xor, it takes three more stages to deliver the multiplexer result Tou95]. However, using the mathematically equivalent expression ((y z)^x) z McC94, NMVR95], it only takes two more stages. Since this is still 1 more than the value assumed in the lower bound, this multiplexer lengthens the CPL of all steps using it by 1, except for SHA-1, where the Boolean function isn't necessarily part of the critical path. Remark that, as far as CPL is concerned, it doesn't always pay o to use the equivalent m ultiplexer expression. Consider the alternative multiplexer (x^z) _(y^z) used in MD5, RIPEMD-128, and RIPEMD-160, and where the critical path runs through y. Without rewriting it only takes 2 stages to deliver the result from the point y becomes available, but using the equivalent expression ((x y)^z) y the CPL increases to 3.
The results of this CPL analysis for the MD4-family of hash functions is given in Table 5 . The analysis is done using both 3-operand and 2-operand instructions. With the exception of the rst and third round steps of SHA-1, the shortest possible critical path is the same for both operand formats. However, for the same CPL a realization on a 2-operand architecture requires more parallel execution units than on a 3-operand one. This information can be derived from the last 4 columns, where for both formats the required number of parallel units and their e ciency is given. The e ciency is de ned as number of instructions in a step CPL number of execution units and is a measure of the average usage of the parallel execution units. The closer the value is to 1, the higher the degree of occupancy of the parallel units. Table 5 also shows that if 3-operand instructions are used the shortest possible critical path of all SHA-1 steps is equal to the lower bound of Table 4 : 2 stages. This is illustrated for the most involved case in Figure 1 : the step function of the third round using the majority function. As a result the CPL of SHA-1's compression function is the shortest of all the MD4-like hash functions, as shown in Table 6 . To realize a CPL of 2 in round 1 and 3 of SHA-1, two parallel rotates of the same variable are required, see Figure 1 . However, the rotate instruction is a unary operation, and hence its 2-operand format has equal source and destination, making a parallel execution on the same variable impossible. Comparing the requirements of Table 6 with the resources of Table 3 shows that current s u -perscalar architectures are only able to exploit all the available instruction-level parallelism of MD4 and MD5, two algorithms that as collision-resistant h a s h functions can no longer be considered as secure Dob96a, Dob96b, Rob96] . The natural question to ask is: how realistic are the prospects for a generalpurpose processor issuing one day 7 ALU instructions in parallel? Issuing many instructions per clock is di cult due to an increasingly complex issuing logic having a negative impact on the clock cycle time. Therefore, a high issuing rate w i l l o n l y p a y o if the parallel execution units are kept su ciently busy, s o t h a t the increase in cycle time will be more than compensated for by an enhanced throughput. The CPL analysis of SHA-1 shows that some algorithms certainly contain enough instruction-level parallelism to sustain such an increased issuing rate, but it is doubtful whether this will be the case for an average instruction sequence.
The RIPEMD-family has, in contrast to SHA-1, two completely independent lines, leaving room for exploiting parallelism on a di erent l e v el: the use of a multiprocessor system where the multiple-issue capability o f e a c h processor is limited, rather than a uniprocessor system with a single, very sophisticated processor capable of o ering all the required parallelism on its own. In this respect HePa96, Section 4.10] states that`to date, computer architects do not know h o w to design processors that can e ectively exploit instruction-level parallelism in a multiprocessor con guration.' The capability of placing two fully con gured processors on a single die, which should be possible around the turn of the century, might result in a new type of architecture allowing processors to be more tightly coupled than before, and at the same time allowing them to achieve v ery high performance individually. Therefore, exploiting the instruction-level parallelism of the RIPEMD-family in the near future seems much more likely, since each o f the independent lines only requires a two-way superscalar architecture, which i s already a standard feature of most processors today. Algorithms with more instruction-level parallelism than the hardware they are executed on can provide, will inevitably see their CPL increase. This is illustrated by means of the rst step of MD4's round 2. Using a 3-operand instruction format two parallel units su ce two exploit all available instruction-level parallelism, as illustrated in the left diagram of Figure 2 . Remark that the e ciency is 100%. Using a 2-operand instruction format will increase the number of instructions, as operations of the form A B op C will require two instructions: A B and A A op C. Due to the already 100% e ciency of the 3-operand instruction stream, 3 parallel units are now required to realize the same CPL of 4. Therefore, an implementation using only 2 parallel units will inevitably have a longer critical path. This is illustrated in the right diagram of the same gure, showing an increase in CPL of 1 stage. The left diagram is expected to be found on e.g., a PowerPC 604 SDC94] o r a P A 7100LC BKQW95], while the right diagram resembles the situation on a Pentium processor, except that a Pentium cannot execute a rotate over more than 1 bit in parallel with any other instruction, resulting in a further increase of the CPL. 
Conclusion
The new generation of customized hash functions based on MD4 (RIPEMD-128, RIPEMD-160, SHA-1) contains more instruction-level parallelism than current general-purpose computer architectures are able to provide. The critical path of SHA-1 is shorter than any of the other MD4-like hash functions, but exploiting it would require a 7-way m ultiple-issue architecture. Exploiting the instructionlevel parallelism of the RIPEMD-family in the near future seems more likely, due to their organization in two independent lines, each of which only requires a 2 -w ay superscalar architecture. Opening up new perspectives is the recent introduction of a new 5-way VLIW processor, primarily intended for multimedia processing.
