ABSTRACT In this paper, we propose a method of efficient software implementation for the cryptographic hash function LSH with single instruction multiple data (SIMD). The method is based on word-wise permutations of LSH. Using the modified functions Step j = P • Step j • P −1 and MsgExp instead of the original step function Step j and message expansion function MsgExp, where P is a permutation and P −1 is the inverse permutation of P, we show that the number of the SIMD instructions for implementing LSH is reduced. For efficient implementation of LSH in other environments (e.g., MIMD), various types of word permutations are listed.
I. INTRODUCTION
Cryptographic hash functions are necessary for constructing a system of information security. Generally, they are used for authentication, providing both data integrity and entity integrity [1] - [4] . A cryptographic hash function is a function that maps an input to a fixed output satisfying the following cryptographic resistance properties [5] .
1. Preimage resistance: for essentially all pre-specified outputs, it is computationally infeasible to find an input that hashes to that output.
2. 2nd preimage resistance: it is computationally infeasible to find another input that hashes the same output as a specified input.
3. Collision resistance: it is computationally infeasible to find any two distinct inputs that hash to the same output.
From the output of a cryptographic hash function, it should be computationally difficult to find the corresponding input. Additionally, for a given input, it should be difficult to find another input that hashes to the same output. Because of these aspects, a cryptographic hash function is used in various fields, such as a message authentication code (MAC), key derivation function (KDF), and a pseudo-random number generator. LSH [6] is a cryptographic hash function that was designed by NSRI [7] (National Security Research Institute). SIMD [8] is a class of parallel computing. SIMD is an instruction set that performs the same operations on multiple data simultaneously. A core element of SIMD is a register. SIMD has registers in with various lengths such as
The associate editor coordinating the review of this manuscript and approving it for publication was Xiangxue Li.
128, 256, and 512 bits. For a 128-bit resister, each resister has four 32-bit or two 64-bit sections of data. Thus, one operation for an 128-bit resister is equivalent to four 32-bit operations or two 64-bit operations. When implementing cryptographic algorithms, SIMD is used for various purposes such as resistance to side-channel attacks [9] , [10] and efficient implementations [11] - [16] . There has not been any research on overcoming weakness using SIMD with a cipher For these reasons, BLAKE2 [17] and LSH are cryptographic algorithms having advantage of implementation with SIMD. BLAKE2, SIMON/SPECK [18] , and LEA [19] were implemented using SIMD in [11] - [13] . However, to the best out knowledge, our research is the first to have an efficient implementation via SIMD by changing the representation of a cryptographic algorithm. In this paper, we show how to implement a cryptographic hash function LSH efficiently with SIMD by representing the LSH using P and P −1 , where P is a permutation and P −1 is the inverse of P. Note that complexity is considered as the number of SIMD instructions and their latency, not the number of XOR and modular additions. This metric is necessary for finding conditions that reduce the number of SIMD instructions and use SIMD instructions with low latency when an algorithm is implemented.
For example, a case in which a permutation in a register composed of four 32-bit words is implemented. If a word-wise permutation is operated in a register, then only a single SIMD instruction is needed, ''_mm_shuffle_epi32''. However, if the word-wise permutation is the identity, then there is no need for an SIMD instruction. In another example, assuming that two 64-bit words compose a register, if a word-wise permutation is operated in two mixed registers, then three SIMD instructions are needed as follows: ''_mm_unpacklo_epi64'', ''_mm_unpac-khi_epi64'', and ''_mm_shuffle_epi32''. The instruction ''_mm_uppacklo_epi64'' extracts the left 64-bit word in the two registers, and ''_mm_unpackhi_epi64'' extracts the right 64-bit word in the two registers. Additionally, the instruction ''_mm_shuffle_epi32'' is used to permute in 32-bit units.
If each of the two 64-bit words still remain in their same registers, then ''_mm_unpacklo_epi64'' and ''_mm_unpackhi_epi64'' are not needed for a word-wise permutation. In this paper, conditions to reduce the number of SIMD instructions and implement SIMD instructions with lower latency are found.
This paper is organized as follows. Section II shows the specifications of LSH. Section III provides an efficient implementation method via SIMD for LSH. We demonstrate the method and permutation conditions needed for efficient implementation. All permutations are categorized considering those conditions. In Section IV, we show an optimal permutation for the best performance with LSH. Concluding remarks on the implementation performance with an optimal permutation are given in Section V.
II. SPECIFICATION OF LSH
In 2014, a hash function LSH was published by D. Kim et al. at International Conference on Information Security and Cryptology, designed specifically to enhance software efficiency [6] . LSH was designed using wide-pipe Merkle Damgard construction (wide-pipe MD construction) [20] . The design of the compression function for LSH is based on ARX (Addition ( ), Rotation (≪), and XOR (⊕)) [21] . The following describes the wide-pipe MD construction and compression function of LSH.
As shown in Fig. 1 , the length of an internal state in a widepipe MD construction is 2n bits, which is twice the length of an output with n bits. Let w be the number of bits in a word. LSH-8w-n can represents any of the following LSHs: LSH-256-224, LSH-256-256, LSH-512-224, LSH-512-256, LSH-512-384, and LSH-512-512. Each has a different initializing value IV . The generating method of IV is given in [6] .
The structure of the compression function f in an LSH is ARX-based. The number of bits for the input of f is 48w, and that of the output is 16w. The compression function f transforms 16- 
The i-th 32-word array message block. [15] ): The j-th 16-word array submessage generated from the i-th message M (i) . [7] ): The j-th 8-word array step constant.
The 16-word array temporary variable used in a step function. P: A word-wise permutation on 16 words
P i : A word-wise permutation on 4 words.
Notice that we define a permutation P that has the same format for the input and output. If the input is an index i, then P(i) is also an index. Similarly, if the input is a word
B. MsgExp FUNCTION
The first two sub-messages M 
Here, the permutation τ is defined by Table 1 . 
C. Step j FUNCTION
Step j is used N s times repeatedly in the compression function f .
Step j is composed of three functions MsgAdd, Mix j , and σ as
MsgAdd: Mix j,l : W 2 → W 2 for inputs X and Y is defined by (5). Here, the bit rotational amounts α j , β j , γ j used in Mix j,l are shown in Table 2 .
The word permutation function σ permutes 16 words. The permutation σ is defined in Table 3 , and the permutation of internal states are as follows:
III. A FAST IMPLEMENTATION METHOD
In this section, we demonstrate the method in [22] of how to implement LSH efficiently using a word permutation and its inverse with repeated step functions.
A. OVERVIEW OF WORD-WISE PERMUTATION
In this subsection, we investigate the relation between a permutation P and the compression function LSH of LSH. We also show how to construct LSH , which is a modified representation of LSH using the permutation P. To represent LSH , we investigate the relation between the permutation P and operations in the step function of LSH . For a permutation P and its inverse P −1 , we consider the function LSH as in Fig. 2 . Note that LSH takes a message permuted by P as an input and outputs the permuted hash value of LSH by P. Thus, the output of LSH is equal to the permuted output of LSH by P −1 .
We define LSH as the following:
Our goal is to implement LSH efficiently using fewer SIMD instructions with low latency. The followings are basic criteria for P for LSH .
Criterion 1: P is a word-wise permutation.
If P is not a word-wise permutation, then it is necessary to consider implementing modular addition, which increases the number of SIMD instructions. Thus, we do not need to consider permutations other than word-wise permutations.
Criterion 2: 
2) Left rotation by α j , β j and a permutation P are commutative with respect to each other.
3) The following relation holds for the left rotation by γ j and permutation P.
In LSH , there are steps:
Then, for T [P(i)] T [P(i + 8)], the following equality holds:
For ⊕, the equation can be proved similarly.
Proof 2):
Proof 3): For γ l , l = i − 8, thus when i is permuted to P(i), l is also permuted to P(l). Then the following holds:
By Theorem 1, a word-wise permutation P commutes with the operations of LSH . That is, considering the rotation by γ and letting γ := (γ P −1 (0) , . . . , γ P −1 (7) ), permutation by P after a left rotation by γ is equal to left rotation by γ after a permutation by P. Let the compression function f used in LSH be defined as follows:
Then by substituting (14) into (7), we have
The compression function f consists of the message expansion MsgExp and step function Step j . Similarly to (14) , MsgExp and Step are represented as follows:
Similarly to (15) ,
Step
As in (15) , (18) , and (19), all permutations P and P −1 are canceled out to the identity permutation except for the first P −1 and the last P. The following provides the details of constructing MsgExp and Step j . Using (1), the function 
Let τ be
The function MsgExp is represented with τ as follows:
By using (2), the function Step j in (17) is represented as follows:
• P)
where
In (23), MsgAdd = MsgAdd . In MsgAdd , the words are permuted by P −1 , and the added message is also permuted by P −1 in MsgExp . Thus, MsgAdd is a function of modular addition between words in the same ordering as in the function MsgAdd. In Mix j , the modular additions of step constants and left rotation by γ l are affected by the permutation P. Therefore step constants word-wise permuted by P −1 and left rotation by γ P −1 (l) should be in Mix j . Further, σ is a word-wise permutation since P, P −1 and σ are all wordwise permutations. Consequently, MsgAdd is the same as MsgAdd, and Mix j is the same as Mix j except for the values for left rotation and step constants. This implies that they do not affect to the performance when they are implemented with SIMD. However, the word-wise permutation τ in MsgExp and σ in Step j affect the performance when implemented by SIMD instructions.
B. TYPES OF PERMUTATIONS FOR LSH
In this subsection, we investigate the conditions of P that improve performance via SIMD related to τ and σ . Both the permutations τ and σ simultaneously permute four words with the some order as follows. The permutations τ and σ are represented by the compositions of two permutations, including internal permutations of four words and external permutations of a four-word arrays. Note that, in Fig. 3 and 4 , four-word arrays permute to four-word arrays, regardless of the ordering of the four words in the four-word arrays.
To simply represent a permutation P = (p 0 p 1 · · · p 15 ) that takes a word array of (w 0 w 1 · · · w 15 1 b 2 b 3 ) , b i ∈ {0, 1, 2, 3} is a permutation that permutes 4 words (w 4j , w 4j+1 , w 4j+2 , w 4j+3 ) into 4 words (w 4j+b 0 , w 4j+b 1 , w 4j+b 2 , w 4j+b 3 ). Using an external permutation and four internal permutations, a permutation P is represented as (P a 0 , P a 1 , P a 2 , P a 3 ). For an example of the notation, the permutation σ of Step j is shown below.
In Fig. 5 , the above four internal permutations are represented by σ 0 = σ 1 = (2013) and σ 2 = σ 3 = (0321). The external permutation is represented by (1, 3, 0, 2) . Therefore σ is represented by σ = (σ 1 , σ 3 , σ 0 , σ 2 ). Similarly, τ = (τ 0 , τ 1 , τ 2 , τ 3 ) where τ 0 = τ 2 = (3201) and τ 1 = τ 3 = (3012). Recall that SIMD instructions are operated in a register or among registers instead of via words. Thus, words in a register should be operated according to the same instructions. If some words in a register need to be partially operated, then the SIMD instructions for the register cannot be used, and the register should be divided. Composing and relieving of registers are done through SIMD instructions such as ''load'' and ''store''. Therefore, to reduce the number of SIMD instructions, words in a register need to be permuted simultaneously. Note that consecutive groups of four words are permuted by τ and σ simultaneously, and the same operations are applied to those consecutive groups of four words. It is enough to consider permutations of the form consisting of external and internal permutations. An external permutation in P provides no advantages for reducing the number of SIMD instructions, thus we fix the external permutation of P as the identity permutation. By Criterion 2, the permutation of the last eight words is determined by the permutation of the first eight words. Therefore, the total number of permutations we should consider is (4!) 2 = 576 for two internal permutations. To find the optimal permutation P among the 576 internal permutation candidates, we define five types of permutations with respect to the following forms. Four are defined for internal permutations, and the other is defined for two internal permutations.
Definition 1: We define TYPE i for i = 1, . . . , 5 as a form of permutation in S i as follows. Note that * is an integer in {0, 1, 2, 3}. 
The following examples will be helpful for understanding the above TYPEs. For example, P i = (0132) has one TYPE1 as (01 * * ) in S 1 and one TYPE2 as ( * * 32) in S 2 . Additionally, P i = (3102) has two TYPE3 as ( * * 02) and (31 * * ) in S 3 and none as TYPE1,2,4. For (P 1 , P 2 ) = ((1032), (1032)), P 1 and P 2 are the same as an internal permutation (1032), thus it is TYPE5.
We find a good permutation by counting the number of TYPEs in τ and σ . Note that the total numbers for TYPE1 to TYPE4 for an internal permutation are always two. TYPE1 is represented as an internal permutation of two consecutive words permuted with the same ordering. In LSH-512, since the word size is 64 and the register size is 128, the register has two 64 bits words. If τ or σ has the form of TYPE1, then an SIMD instruction for permutation of the register positions '01' or '23' is not needed. For the other case of TYPE2, an SIMD instruction ''_mm_shuffle_epi32'' for changing word positions in a register is needed, then an SIMD instruction for this permutation is needed. For TYPE3, '02', '20', '13', and '31' are composed of words in different registers. Thus, those groupings require the SIMD instruction ''_mm_unpacklo_epi64'' or ''_mm_unpackhi_epi64''. For TYPE4, '03', '30', '12', and '21' are composed of words in different registers and different word positions(of left or right). Thus, two SIMD instructions are needed: ''_mm_shuffle_epi32'' to change word positions, one of ''_mm_unpacklo_epi64'' and ''_mm_unpackhi_epi64''. Note that TYPE1 and TYPE2 do not share an internal permutation. By comparing the numbers of SIMD instructions for internal permutations, TYPE1 and TYPE2 are better than TYPE3 and TYPE4, and TYPE1 is the best.
TYPE5 is used for a 256-bit registers such as AVX2. For LSH-256, a register has eight words. If two internal permutations in a register are equal, the permutation can be operated using ''_mm256_shuffle_epi32'' instead of ''_mm256_permutevar8x32_ps''. The latter SIMD instruction has triple latency compared to that of the former. Thus, using the former is better than the latter for better latency performance.
TYPE1-5 are defined by considering an SIMD instruction set. Because there are many SIMD circumstances, if someone wants to use LSH with some specific SIMD, then the above TYPEs can be used to choose a permutation. All types of permutation considering TYPE1-5 are in Appendix A.
IV. PERMUTATION WITH THE BEST PERFORMANCE FOR LSH USING SIMD
In previous sections, we have shown how to find optimal permutations τ and σ with TYPEs. The optimal permutation is selected as TYPE1-4 except for LSH-256 with AVX2. For the case of LSH-256 with AVX2, since there are two internal permutations in the register, the optimal permutation is selected as TYPE5.
A. THE PERMUTATION P FOR LSH WITH SSE2, SSSE3, XOP, AND LSH-512 WITH AVX2
The permutation P and its inverse P −1 for SSE2, SSSE3, XOP, and AVX2 in SIMD are described in Fig. 6 . For AVX2, VOLUME 7, 2019 the permutations in Fig. 6 are applied to only LSH-512 by the above argument.
The permutation P is (0123, 2013, 0123, 2013), which is represented by the symbol (A,M) in Appendix A. If P is used for τ , then there are four TYPE1 and four TYPE2. If P is used for σ , then there are three TYPE1, one TYPE2, and four TYPE3. Note that τ has two TPYE1, two TPYE2, and four TYPE4, and σ has four TYPE3 and four TYPE4. τ = P • τ • P −1 is described in Fig. 7 , and σ is described in Fig. 8 .
Four words are assigned in one register, as shown in Fig. 7 and Fig. 8 . In this case, the number of SIMD operations of τ is not reduced. However, there is a reduction in σ . Because there is no need for SIMD instructions in the second internal permutation of σ , the identity permutation is considered. 
B. THE PERMUTATION P FOR LSH-256 WITH AVX2
AVX2 uses a 256-bit register. Thus, the register contains eight words. Then by σ , the left-half and right-half of the register are divided into two different registers. Thus, there is no advantage to using σ instead of σ . However, if τ is of TYPE5, then we can implement τ with ''_mm256_shuffle_epi32'' instead of ''_mm256_permutevar8x32_ps''. This reduces the latency by one third. The permutations P and P −1 used to implement LSH-256 using AVX2 are shown in Fig. 9 . This permutation P = (0132, 3120, 0132, 3120) is described as (B,V) in Appendix A. τ and σ are described in Fig. 10 and Fig. 11 . Fig. 10 shows that permutations of eight words in a register are of TYPE5. There is source code for a modified LSH implementation using (A,M) and (B,V) in GIT-HUB [23] .
V. CONCLUSION
We have shown how to implement LSH efficiently with SIMD using permutations. The performance results are summarized in Tables 4 and 5 . On average, there is a 5% improved performance when using our method of permutations. Note that there is an additional 5% performance improvement after applying other optimization methods, such as the deployment of SIMD instructions; this is because the code in [7] has already been optimized by the creator of LSH. There is no security vulnerability when applying the proposed method, since the only changes are in orderings of words, circular rotations, and step constants in registers. Furthermore, we have defined five permutation types for LSH and SIMD, which classify all permutations of LSH. This classification can be used to implement LSH with a new SIMD instruction set for various register sizes or platforms.
APPENDIX

A. PERMUTATIONS
An internal permutation is represented using the English alphabet for readability. This representation is as shown in Table 6 .
The different types of permutations are defined in IV. In Table 7 , each pair in the first and second column represents the number of TYPE1 and TYPE2 permutations. Similarly in Table 8 , each pair in the first and second column represents the number of TYPE3 and TYPE4 permutations. In Table 9 , the first column represents the number of TYPE5 permutations in τ and σ respectively. DONGYEONG KIM received the B.S. degree in mathematics from Hanyang University, Seoul, South Korea, in 2013, where he is currently pursuing the Ph.D. degree in mathematics, under the supervision of Prof. J. Song. His research interests include cryptanalysis of symmetric-key cyrptography, channel coding theory, and post quantum cryptography. 
