Abstract-This paper presents a high throughput ApplicationSpecific Instruction-set Processor (ASIP) for cryptographic hash functions. The processor is obtained via hardware/software codesign methodology and accelerates SHA (Secure Hash Algorithm) and MD5 hash functions. The proposed design occupies 0.28 mm 2 (66 kgates) in 65 nm CMOS process including 4.5 KB single port memory and 52 kgates logic. The throughput of the proposed design reaches 15.8 Gb/s, 12.5 Gb/s, 12.2 Gb/s, and 19.9 Gb/s for MD5, SHA-1, SHA-512, and SHA3-512, respectively under the clock frequency of 1.0 GHz. The proposed design is evaluated with state-of-the-art VLSI designs, which reveals its high performance, low silicon cost, and full programmability.
I. INTRODUCTION
Cryptographic hash functions play an important role in network security protocols and infrastructures, such as TLS, SSL, SET, IPSec, and PKI [1] . Hash functions can be used to verify integrity of data in transit. Hash functions can also be used as message authentication codes, e.g., in the case of the Hash Message Authentication Code (HMAC) [1] . The cryptographic hash functions MD5 (Message Digest Algorithm 5), SHA-1 (Secure Hash Algorithm 1), and SHA-2 (Secure Hash Algorithm 2) algorithms are widely adopted nowadays. The SHA-3 (Secure Hash Algorithm 3) standard has been released by NIST on August 5, 2015. In the next few years, it is expected that the SHA-3 will become a mandatory or optional cryptographic hash algorithm for all mainstream and future network security protocols and standards [1] .
Circuits for MD5 and SHA cryptographic hash functions should, in general, support multiple algorithms with high performance. On the one hand, according to specific protocols proposed for different applications, there are requirements to adopt the hash algorithms of different security level [2] . On the other hand, MD5 and SHA workloads are among the most performance/powercritical workloads due to the iterative nature of hash computation and the high computational complexity. The increasing speeds for wired and wireless data networks require high-throughput hardware implementations of the cryptographic hash algorithms so as to meet the required high performance. The IPv6 network stack and its mandatory IPSec security protocols will push further the requirement for high performance implementations [1] . Several hardware techniques have been adopted to accelerate single or multiple cryptographic hash functions. The techniques include the use of parallel counters and Carry Save Adders (CSA) [3] - [5] , loop unrolling to mitigate the serial dependence of hash computation [6] , delay balancing [4] , embedded memories [7] , and pipelining [1] - [4] , [8] , [9] . These techniques require significant extra hardware resulting in higher area. Up to now, there have been successful VLSI designs for multiple hash functions resulting in significant area savings via hardware sharing. For example, Cao et al. [2] and Wang et al. [10] propose reconfigurable hardware designs for SHA-1 and MD5 hash algorithms. Ramanarayanan et al. [11] , Michail et al. [9] , and Chaves et al. [5] present SHA accelerators for five SHA algorithms (i.e., SHA-1/224/256/384/512). However, these designs are implemented with ASICs/FPGAs targeting a small predefined set of hash algorithms. This paper presents a hash processor (HP-ASIP) for MD5, SHA-1, SHA-2, and SHA-3. Table I lists functions implemented in this paper. HP-ASIP is obtained via hardware/software co-design and achieves ASIC-like performance and full programmability with area consumption of 0.28 mm 2 (65 nm). Thanks to its programmability, HP-ASIP can offer changes to the implemented algorithms via software programming when one of them is cracked to extend chip lifetime. 2nd  preimage  MD5  128  < 64  NA  NA  SHA-1  160  < 80  160  105-160  SHA-224  224  112  224  201-224  SHA-512/224  224  112  224  224  SHA-256  256  128  256  201-256  SHA-512/256  256  128  256  256  SHA-384  384  192  384  384  SHA-512  512  256  512  394-512  SHA3-224  224  112  224  224  SHA3-256  256  128  256  256  SHA3-384  384  192  384  384  SHA3-512  512  256  512  512 The rest of this paper is organized as follows. Section II provides the design of HP-ASIP. Section III presents the top-level architecture, the datapath, the memory subsystem, the pipeline scheduling, and the instruction set of HP-ASIP. Section IV describes the area and power consumption of HP-ASIP. Section V evaluates HP-ASIP. Section VI concludes the paper. 
II. DESIGN OF HP-ASIP
We propose a methodology to design HP-ASIP. Fig. 1 depicts the design flow. The design flow, a hardware/software (HW/SW) co-design flow, optimizes the partition of HW and SW functions cooperatively during the design of HP-ASIP. Firstly, the scope of algorithms (e.g., MD5, SHA-1, SHA-256, SHA-512, and SHA3-512, etc), throughput, etc are specified according to application requirements. Then, a datapath, a memory subsystem, and a processor architecture are designed to accelerate the algorithms of the scope.
A. Design and Optimization of HP-ASIP
The datapath of HP-ASIP is first designed as other parts of HP-ASIP can be designed only when the datapath is fixed. There are multi-mode hash accelerators for SHA [11] and MD5 [2] . In this work, firstly, a preliminary datapath for SHA-1/224/256/384/512 is proposed according to [11] . The datapath can process two independent data streams in parallel when performing SHA-1/224/256. Then, we map one step of MD5 [2] onto the datapath optimizing the degree of hardware sharing between MD5 and SHA-1/224/256. At last, we map one round of SHA-3 [1] onto the datapath incrementally.
Through our research, the critical path of the datapath for SHA-3 is much shorter than the critical path of the datapath for SHA-1/224/256/384/512 and MD5. To achieve high clock frequency, we optimize the datapath into two pipeline stages. Taking the datapath for SHA-1 ( Fig. 2(a) ) and the datapath for SHA-256 ( Fig. 2(b) ) as examples, we explain how to optimize the datapath into two pipeline stages. Firstly, we identify the critical path of the datapath for SHA-1 and SHA-256, respectively as shown in Fig. 2 . Then, we optimally implement the operations of the critical path into two pipeline stages as shown in Fig. 3 . Fig. 3(a) describes the implementation of SHA-1 on HP-ASIP and Fig. 3(b) describes the acceleration of SHA-256 on HP-ASIP. The rest of algorithms implemented in this paper are optimally implemented in a similar way. In this design, SHA-3 requires only one pipeline stage of the datapath while SHA-1/224/256/384/512 and MD5 need two pipeline stages. To fully adopt the two-stage pipelined datapath, we introduce odd and even register contexts for hash values and message schedulers. Firstly, we analyze the parameters of the targeted algorithms. Table II lists the size of the internal state, the size of each message block, the size of iteration constants, and the iterations of the targeted algorithms. The maximum size of internal state for MD5 and SHA-1/224/256/384/512 is 512b and the maximum size of message block for MD5 and SHA-1/224/256/384/512 is 1024b. We thus introduce 512b register and 1024b register for both the odd and the even register contexts (to be discussed). When performing SHA-3, we adopt 1600b of the odd and even register contexts for the internal state of SHA-3.
In this work, we introduce data memory (to be discussed) for the iteration constant (K t ) [11] and the index for the message schedule (W t ) of the targeted algorithms so that we can adopt them via software programming when performing MD5 and SHA message digest computation. Besides, we propose instructions to adopt the odd and even register contexts (to be discussed). HP-ASIP can thus process two independent data streams simultaneously when performing SHA-384/512 and process four independent data streams simultaneously when performing MD5 and SHA-1/224/256.
In this design, common operations among the targeted algorithms are implemented by shared functional blocks to achieve low silicon cost. After mapping all the algorithms targeted in this paper onto the datapath incrementally, we fix the datapath. Then, we extract and represent the control signals for the processing routines in the datapath by a group of control indications and propose a specific instruction set for the targeted algorithms. As the rounds of message digest computation for SHA-1 and MD5 can be divided into 4 parts [2] , we introduce 4 instructions to fulfill the hash computation of SHA-1 and MD5, respectively (to be discussed).
Afterwards, algorithm pseudocode for the algorithms targeted in this paper are developed adopting the instruction set. We extract the addressing and control information of the algorithm pseudocode and propose a specific memory subsystem, a control path, and a toplevel architecture for HP-ASIP. Based on the specified functional blocks and the instruction set, we develop the RTL (Register Transfer Level) description of HP-ASIP. Then, the correctness and performance of the functional design and the silicon layout are verified. Based on the algorithm pseudocode, we develop the assembly codes of the algorithms, offering the support of all algorithms targeted in this paper. At last, the hardware (assembly instruction set) and software (assembly codes) are integrated and HP-ASIP is thus designed.
As shown in Fig. 1 , the design and optimization flow of HP-ASIP is recursive. Any previously mentioned essential requirements not fulfilled may cause a huge work of redesign. Adopting the HW/SW co-design methodology, we ensure that HP-ASIP can achieve low silicon cost via optimizing the degree of hardware sharing among the targeted algorithms. This method results in an ASIP for the targeted cryptographic hash algorithms satisfying all the previously mentioned essential requirements.
B. Data Block Expansion for SHA Function
The SHA-1 algorithm computation steps described in Fig. 2(a) are performed 80 times (rounds). Each round adopts a 32-bit word obtained from the current input data block. As each input data block only contains 16 32-bit words (512 bits), we need to obtain the remaining 64 32-bit words via data expansion. The data expansion is performed via the computation described in (1), where 
For the SHA-2 algorithm, the computation steps shown in Fig. 2(b) are performed for 64 rounds (80 rounds for SHA-512). In each round, a 32-bit word (64-bit for SHA-512) from the current input data block is adopted. As the input data block only contains 16 32-bit words (64-bit for SHA-512), we need to expand the initial data block to obtain the remaining words. The expansion is performed via the computation described in (2), where 
For efficiency reasons, this work accelerates data block expansion in hardware. Taking SHA-256 as an example, we expand the 512 bits of each data block in hardware. The input data block expansion described in (2), can be implemented with registers and ADD operations. The output value is selected between the original data block (for the first 16 rounds) and the computed values (for the remaining rounds). Fig. 4 depicts the implemented structure for SHA-256. As the datapath for MD5 and SHA-1/224/256/384/512 is two-stage pipelined, we implement the data block expansion adopting two pipeline stages so that the datapth and the data block expansion circuit can work synchronously. 
C. Message Padding
To ensure that the input data block is a multiple of 512 bits as required by the MD5 and SHA-1/224/256 specifications (1024 bits for SHA-384/512, etc), the original message needs to be padded. Taking the padding procedure for a 512-bit input data block as an example, it is performed as follows: for an original message composed of n bits, the bit "1" is appended at the end of the message, followed by k zero bits, were k is the smallest solution to the equation n+1+k 448 mod 512  . The last 64 bits of the 512-bit input data block are filled with the binary representation of n. For the SHA-512 message padding, 1024-bit data blocks are utilized and the last 128, not 64 bits, are reserved for the binary value of the original message [5] . The message padding operations can be efficiently implemented in software.
III. THE PROPOSED PROCESSOR
This work adopts SIMD (Single Instruction Multiple Data) architecture to meet the requirements on the computational complexity. A two-stage pipelined datapath and a SIMD instruction set are proposed for the targeted algorithms. Multiple memory banks are designed to fulfill the bandwidth requirements of the datapath. We also introduce a variable depth pipeline to approach the efficiency limit.
A. Top-level Architecture
The top-level architecture (Fig. 5 ) of HP-ASIP is made up of three parts: control logic, memory subsystem, and datapath. The control logic includes PC FSM, PM (program memory), ID (instruction decoder), DMA, and status registers. The control logic reads an instruction from the PM, a 256 × 80b SRAM, per clock cycle and decodes the machine code (i.e., the instruction fetched) into control signals. The control logic also performs loop acceleration. The memory subsystem is composed of AGU, RPN, WPN, and DM (data memory). The AGU generates addresses for operands according to the machine code. Then, the addresses generated will be passed to the DM. The DM contains four memory blocks. Each memory block contains 16 32 × 8b SRAMs and can provide 16-byte data per clock cycle. The outputs of these memory blocks will be passed to the RPN and then to the datapath. The datapath accelerates the hash algorithms implemented in this paper. The outputs of the datapath will be passed to the WPN. The RPN and WPN are introduced for data shuffling to ensure that the vector data can be accessed in parallel without access conflict. The outputs of the WPN will be written to a memory block. 
B. Datapath
The datapath of HP-ASIP contains 2 pipeline stages and 5 blocks (Fig. 6 ). Among these blocks, the block To fully adopt the two-stage pipelined datapath, we process two independent messages each time. Taking SHA-512 as an example, when performing hash computing, two 512-bit hash values can be stored in Hash register and two 1024-bit messages can be stored in Message Scheduler. Therefore, we can process two independent data streams simultaneously via software programming (i.e., interleaved). 
C. Memory Subsystem
The operands of HP-ASIP are 16-byte vector data. The 16-byte vector data should be obtained within one clock cycle to ensure that the datapath of HP-ASIP can work efficiently. We thus propose a parallel memory subsystem and specific addressing patterns for HP-ASIP. Five addressing patterns are proposed for HP-ASIP as shown in Table III . All the algorithms targeted in this paper can thus be supported adopting the 5 addressing patterns and the parallel memory subsystem.
To ensure that the vector data can be obtained in parallel, we introduce the RPN and WPN. Fig. 7 describes an example of the RPN. Without the RPN, 16-byte data stored in addresses 19 to 34 can't be obtained simultaneously in sequential order. Utilizing the RPN for shuffling, the vector data can be allocated in parallel for the datapath. The WPN works in a similar way. 
D. Pipeline Scheduling
To approach the efficiency limit of the datapath, the instructions of HP-ASIP are realized in pipelined modules. The HP-ASIP contains 7 pipeline stages as shown in Fig. 8 . Firstly, an instruction will be read out from the PM and decoded into control signals during IF and ID, respectively. During ID, the addresses of operands will be generated and then passed to the DM. The source operands will be obtained from the DM during Mem. The obtained operands will be passed to the RPN and permuted if necessary during Perm. Afterwards, the outputs of the RPN will be passed to the datapath. The datapath can consume 1 or 2 pipeline stages to fulfill the requirements of different instructions. Some logics are designed to buffer the control signals and the addresses of destination operand to ensure that the datapath can work properly. The block Out Ctrl is utilized to select which pipeline stage of the datapath should output results. Finally, the results will be stored during WB.
E. Instruction Set
To support multiple algorithms targeted in this paper, we propose an instruction set for HP-ASIP. The instruction set of HP-ASIP consists of 24 SIMD instructions. Among these instructions, 2 are for SHA-3, 3 are for SHA-384/512, 3 are for SHA-224/256, 8 are for SHA-1, and 8 are for MD5. Table IV lists selected typical instructions of HP-ASIP. Column 1 shows the instruction mnemonics in assembly. Column 2 shows functions of the instructions. Adopting the instruction set introduced, all the algorithms targeted in this paper can be efficiently accelerated. 
SHA5120
One step of SHA-384/512 (adopting the odd register context) SHA5121
One step of SHA-384/512 (adopting the even register context) SHA2560 2x one step of SHA-224/256 (adopting the odd register context) SHA2561 2x one step of SHA-224/256 (adopting the even register context) SHA110 2x one step of the second round of SHA-1 [2] (adopting the odd register context) MD500 2x one step of the first round of MD5 [2] (adopting the odd register context) MD530 2x one step of the fourth round of MD5 [2] (adopting the odd register context) SHA3
One round of SHA-3 [1] As loop control under software flow consumes much resource, we introduce an efficient branch-cost-free loop acceleration in hardware. In this design, the instruction REPEAT and nested loops are hardware accelerated. All the loops implemented in this design are performed with no branch cost. We achieve this adopting two mechanisms. Firstly, the microcode for each instruction of HP-ASIP contains a special part indicating how many times the instruction requires to be repeated. The special part of an instruction can be configured via appending an option "-i Imm" at the end of the instruction in assembly code. For example, if we want to repeat an instruction 24 times, an option "-i 24" can be adopted. Secondly, we propose an instruction REPEAT to repeat a block of instructions several times. 9 shows how to perform SHA-1 on HP-ASIP with the proposed instructions. Data hazard avoidance assembly coding is adopted to enhance performance. The indexes for the message schedule are stored in dm0 before performing hash computation. The REPEAT instruction repeats the following 2 instructions 20 times. Fig. 10 shows how to perform SHA-3 on HP-ASIP with the proposed instructions. C code of SHA-3 and the corresponding assembly code are presented. The round constants are stored in dm0 before performing hash computation. The instruction SHA3 is adopted. To process one data block, the instruction SHA3 repeats 24 times with the option "-i 24". Among all the modules of HP-ASIP, the datapath, the DM, and the PM are the modules consuming most of the area. The datapath costs 69.1% of the area. The DM with 4 memory blocks consumes 11.6%. The PM and the AGU cost 9.6% and 2.5%, respectively. The permutation networks, including read/write permutation network, cost [2] and [10] are successful SHA-1/MD5 processor cores that support both the SHA-1 and MD5 hash functions; [1] is one of the most efficient SHA-3 methods proposed recently; [11] accelerates SHA-1/224/256/384/512, and it is the most comparable method to ours for SHA functions.
Note that the gate counts of [2] and [10] listed in Table  VI exclude the cost of memory [2] . The area of [11] is obtained in 45 nm CMOS technology while HP-ASIP is synthesized in 65 nm CMOS technology. To compare with [11] more fairly, we scale the area provided by [11] Table 6 is figured out in a similar way.
In terms of throughput, Ramanarayanan's design [11] achieves better throughput than HP-ASIP for SHA-1/224/256/384/512 because Ramanarayanan's design obtains high clock frequency. However, HP-ASIP is programmable and supports MD5 and SHA-3 because HP-ASIP obtains programmable architecture and application specific instruction set.
Compared with state-of-the-art ASICs/FPGAs, our design achieves competitive throughput for MD5 and SHA functions with full programmability. For its programmability, HP-ASIP can offer changes to the algorithms implemented in this paper to extend its chip lifetime. For example, when one of the implemented cryptographic hash algorithms is cracked, HP-ASIP can still work properly via updating software.
VI. CONCLUSIONS
This paper presents a SIMD ASIP for cryptographic hash functions that accelerates MD5, SHA-1, SHA-2, and SHA-3. Adopting processor architecture, we map the hash algorithms onto a two-stage pipelined datapath, optimizing the degree of hardware sharing among the algorithms. This approach results in a hash processor that achieves the throughput of 15.8 Gb/s for MD5, 12.5 Gb/s for SHA-1, 12.2 Gb/s for SHA-512, and 19.9 Gb/s for SHA3-512, occupying 0.28 mm 2 in 65 nm CMOS. Compared with state-of-the-art VLSI designs, our design achieves ASIC-like performance, full programmability, and low silicon cost.
