This work proposes an Application-Specific System Processor (ASSP) hardware for the Secure Hash Algorithm 1 (SHA-1) algorithm. The proposed hardware was implemented in a Field Programmable Gate Array (FPGA) Xilinx Virtex 6 xc6vlx240t-1ff1156. The throughput and the occupied area were analyzed for several implementations in parallel instances of the hash algorithm. The results showed that the hardware proposed for the SHA-1 achieved a throughput of 0.644 Gbps for a single instance and slightly more than 28 Gbps for 48 instances in a single FPGA. Various applications such as password recovery, password validation, and high volume data integrity checking can be performed efficiently and quickly with an ASSP for SHA1.
Introduction
The Secure Hash Algorithm version one, SHA-1, is an algorithm used to verify the integrity of variable length data streams from an operation called hash.
A hash function outputs a fixed-length code C given a message of variable length K as input. It may be said that the output of the hash function, also called ASIC for IoT applications or used in the FPGA itself, aiming to accelerate hash code calculation in several applications such as password recovery, password validation and integrity checking in large volumes of data.
Related Work
The work presented in Jarvinen (2004) used a Xilinx Virtex-II XC2V2000-6 FPGA to implement the SHA-1 with Iterative Looping (IL). This implementation occupied around 1, 275 Logic Cells (LC) operating at a throughput of 734Mbps. The works presented in Michail et al. (2005) ; Kakarountas et al. (2006) also used a Xilinx Virtex-II XC2V2000-6 FPGA to implement the SHA-1. These implementation proposals present a scheme using Full Pipeline (FL) that consumed around 3, 519 LC for a throughput of 2.5267 Gbps. Comparing with the proposal presented in Jarvinen (2004) , the throughput is 4× higher due to the use of 4 SHA-1 pipelined modules. However, the occupancy area is also around 4× larger. In Lee et al. (2009) it is also presented a proposal using FL that reached a throughput of 5.9 Gbps.
In Iyer & Mandal (2013) the SHA-1 implementation was executed in a Xilinx Virtex 5 Xc5vlx50t FPGA using hardware description language Verilog. The implementation performed with IL was similar to that presented in Jarvinen (2004) . However, it had a slightly higher occupancy rate, around 1,351 LC, and the throughput also a little higher, around 786 Mbps.
The work presented by Khan et al. (2014) brought a solution of the SHA-1 in FPGA with low power for uses in devices lacking high power capacity and with high throughput, in addition to a small area size compared to the similar implementations. For this, the authors relied on the work presented in Michail et al. (2005) and Kakarountas et al. (2006) . In this work the LC number was reduced by making a more serial implementation, also reducing the throughput.
Another implementation-based approach described in Michail et al. (2005) and Kakarountas et al. (2006) was showed in Michail et al. (2016) , in which an implementation in a TSMC 90 nm ASIC was proposed. In this proposal, a 3 throughput of around 15 Gbps was observed.
A comparison of several Xilinx FPGA platforms with SHA-1 implementation was presented in Michail et al. (2014) . The implementation was based on the proposal presented in Michail et al. (2005) and Kakarountas et al. (2006) and a maximum throughput of about 14.3 Gbps was observed for a Xilinx Virtex 7 FPGA.
Works with SHA-1 implementation on other hardware platforms can be found in Marks & Niewiadomska-Szynkiewicz (2014) and Al-Kiswany et al. (2009) in which comparisons between Graphics Processing Units (GPUs) and CPUs were performed. The GPUs NVIDIA Tesla M2050 with 448 CUDA cores and AMD FirePro V7800 with 1440 stream processors could achieve throughput peaks of up to 1.5 Gbps.
The proposal here developed used as target device a Virtex FPGA 6 xc6vlx240t-11156 FPGA and the results showed a throughput of 652 Mbps for a single SHA-1 module. The implementation used the Iterative Looping strategy which occupied less circuit area when compared to other strategy Michail et al. (2005) and Kakarountas et al. (2006) and unlike the results presented in the literature, it was possible to synthesize up to 48 SHA-1 modules in a single FPGA device yielding a throughput of 28.160 Gbps. where
the SHA-1 algorithm generates an output message, m i , called a hash code, of fixed size C = 160 bits, characterized as 
bits that can be divided into L i blocks of length M = 512 bits, that is,
The pseudo-code presented in the Algorithm 1 displays the sequence of steps required to generate the hash code. These steps are going to be described in detail in the following subsections. The calculation of the P i value can be expressed by
where the (a mod b) operation returns the modulo of the division between a and b. Thus, p i can be expressed as
where, p 0 = 1 and p i = 0 for i = 1 . . . P i − 1.
n ← −1 10:
for n ← 0 until 79 do 12:
14:
15:
end for
16:
17: end for
Length Insertion
In this step (lines 4 and 5 of the Algorithm 1) the message v i is added, which is characterized by a binary word of T = 64 bits and expressed as
The generation of the message length is performed by the function LenghtGeneration (K i ) presented in the line 4 of the algorithm 1. The message v i stores the length value of the i-th incoming message m i , that is,
where Binary(a, b) is a function that returns a vector of size b with the binary representation of a decimal number a with b bits according to the big-endian standard.
The 180-4 FIPS norm NIST (2015), assumes that the size, K i , of most messages can be represented by 64 bits, that is,
Finally, at the end of the second step the message, z i , which is an extension of the i-th original input message, m i , is generated (line 5 of the Algorithm 1).
In this work the message z i is identified as a Z i bits vector expressed as
Hash Code Initialization
The hash code initialization (line 6 of the Algorithm 1) is standardized by the FIPS 180-4 NIST (2015) according to the following expressions:
and he = h 128 . . . h 159 = Binary(3285377520, 32),
where h i = ha hb hc hd he .
Message Split
In this step, line 8 of the Algorithm 1, the message z i is split into L i blocks of M = 512 bits, that is,
where each j-th block associated with i-th message is expressed as
The j-th block, b j , can also be represented as
where u j [k] is a 32 bits message, that is,
where
H(n) Hash Variables Initialization
The SHA-1 algorithm has five 32 bits variables, called A(n), B(n), C(n), D(n) and E(n) that are updated during iterations of the algorithm. These variables are identified in this work as vectors:
where, the combination of these five variables form a vector of 160 positions identified as
The initialization of these variables, in the instant n = −1, (line 10 of the 
w(n) Variable Calculation
In SHA-1, it takes 80 iterations for a valid output, h i , associated with a i-th message be generated (Algorithm 1, line 11). In each n-th iteration a w(n)
variable is calculated, expressed as
where ⊕ is the exclusive or operation and lr(r, s) represents the leftrotate function that is expressed as
where ∨, , and are the bitwise OR and left and right bitwise shift, respectively.
f (·) Function Calculation
In each n-th iteration of each j-th block, b j (n), a nonlinear function, f (·), is calculated from the information of the hash variables B(n), C(n) and D(n).
The output of the function, f (·) is stored in the vector f (n) (line 13 of the Algorithm 1), expressed as
γ(n) = (B(n−1)∧C(n−1))∨(B(n−1)∧D(n−1))∨(C(n−1)∧D(n−1)) (27) and
where ¬ and ∧ are negation operation and bitwise AND, respectively.
Hash Variables Update
Also, in each n-th iteration of each j-th block b j (n), the values of the variables A(n), B(n), C(n), D(n) and E(n) are updated after the calculation of f (n) (line 14 of the Algorithm 1). The update of these variables is represented by the following equations:
and
in which,
The SHA-1 has four 32 bits constants k(n), which are used in the n-th iteration of each j-th block b j n, as specified by 
10
Hash Code Update
For each j-th block, b j , SHA-1 executes 80 iterations, and at the end of every j-th block the hash code is updated linearly following the expressions:
So for every i-th message, m i , the value of the associated hash code, h i , is found in
iterations, where N i is defined in this work as the total number of interactions for the calculation of the hash associated with a message m i . and 6 from the Algorithm 1, the control of the two loops (lines 7 and 11) and the initialization of hash variables (A(n), B(n), C(n), D(n) and E(n)) to each j-th block, b j , through the h0 signal.
Proposed Implementation
The CJ and CN blocks are log 2 (L) and 7 bits counters, respectively. The CN counter is responsible for the loop iteration of line 11 of the Algorithm 1, generating the signal n. The CJ counter is incremented by the counter CN 
GF Module
The GF module implements the function described in subsection 3. The function type selection in the GF-MUX multiplexer is controlled by the GV module, through binary logic with comparators and logic gates corresponding to each interval, having the following outputs, 
Each one selecting a function f (n) based on the 7 bits counter of the CN module.
GW Module
The GW module consists of 16 messages u j [n] (with 32-bits ) in the input, 
h i Hash Processing
After generating the signals w(n), k(n), f (n), in each n-th iteration, and the value E(n − 1), the signals Z(n) and V(n), both of 32 bits, are calculated through the sum modules S1 and S2, executed in parallel, subsequently S3 and S4. All the sum modules used in the implementation are 32-bit-specific circuits, which optimize the processing time and the space occupied by the total circuit. 42) the hash code final value, h i , associated to the i − th message is achieved.
The CO module has the function of concatenating the 5 buses of the 32-bits
formed by the signals ha, hb, hc, hd and he and generating a serial signal with the hash code h i .
Results
The Table 1 The proposal here presented, used several SHA-1 parallel modules, enabling the throughout acceleration which is especially useful in cases of brute force password recovery, in which there are a large number of hash codes to be generated. only an increment of less than 1 ns in T s , which represents an increase of almost 32× in hash throughput.
Based on the Algorithm 1 and the architecture presented in Figure 1 , for every j-th M = 512 bits block b j , 80 iterations are executes (Equation 42), so the proposed hardware throughput can be calculated as
It is important to note that the values of throughput greater than 15 Gbps are unpublished in the literature (NI = 32 e NI = 48). A 28, 16Gbps throughput is equivalent to retrieve a totally unknown 6 digits numeric password (using the brute force method) in a maximum of 20 ms or a 6 digits alpha numeric password (each digit with 62 possibilities) from a hash code in a maximum of 17.4 minutes.
Conclusion
This work presented a SHA-1 hardware implementation proposal. The proposed structure, also called ASSP, was synthesized in an FPGA aiming to validate the implemented circuit. All implementation details of the project were presented and analyzed regarding occupation area and processing time. The results obtained are quite significant and point to new possibilities of using hash algorithms in dedicated hardware for real-time and high-volume applications.
Funding
This study was financed in part by the Coordenação de Aperfeiçoamento de
Pessoal de Nível Superior (CAPES) -Finance Code 001.
