Abstract-This paper describes software optimization for the stream Cipher ChaCha. We leverage the wide vectorization capabilities of the new AVX2 architecture, to speed up ChaCha encryption (and decryption) on the latest x86_64 processors. In addition, we show how to apply vectorization for the future AVX512 architecture, and get further speedup. This leads to significant performance gains. For example, on the latest Intel Haswell microarchitecture, our AVX2 implementation performs at 1.43 cycles per byte (on a 4KB message), which is ~2x faster than the current implementation in the Chromium project.
INTRODUCTION
Secure communication on the internet requires that the communication endpoints use different cryptographic primitives to establish a protected channel, and a common protocol to apply these primitives. The leading protocol for secure communication specifications is TLS [1] . TLS supports a diversity of public key algorithms for establishing a symmetric session key for two communication endpoints, and a variety of symmetric ciphers and MAC algorithms for the subsequent encrypted and authenticated communication. The performance of these primitives is crucial for efficient communication.
Currently [2] , the most popular ciphers of the TLS protocol are RC4 and AES-CBC (with some hash based authentication), but they have been scathed by some problems/attacks ( [3] , [4] ). The AES-CBC issues have been fixed in TLS 1.1, but the fix is complex. The RC4 cipher is perceived unsecure. The problems with the existing ciphers result in strong motivation for developing and using new cipher suites.
A very promising alternative is the AES-GCM [5] authenticated cipher, whose software implementation has been extensively optimized [6] . This, however, implies that two secure cipher alternatives (in TLS) are based on the same cryptographic primitive AES, and past experience shows that this can be problematic. Consequently, it is useful to have a choice between different cryptographic primitives for the same purpose, and this is where ChaCha [7] and Poly1305 [8] become important. They are secure, relatively fast, and already have high quality public domain implementations. They also are naturally "constant time", and have nearly perfect key agility. These properties led to the newly proposed TLS draft [9] which includes ChaCha20 as a stream cipher, and Poly1305 as the authenticator.
ChaCha has naturally good performance, and already has implementations that use 128-bit vectorization. In this paper, we show how to achieve higher performance by using wider vectorization: 256-bit AVX2 [10] instructions that are available on the new Haswell architecture, and 512-bit AVX512 [11] on future architectures.
II. PRELIMINARIES
ChaCha is a 256-bit stream cipher, based on the Salsa20 [12] stream cipher. Compared to Salsa20, ChaCha has better diffusion per round and conjecturally increasing resistance to cryptanalysis. The core of the Salsa20 (and ChaCha) function is a hash function which maps 64 input bytes to a unique and irreversible 64-byte output keystream. Its 64-bit block counter restricts the maximum number of blocks for the output keystream to 264 (i.e., a maximum keystream of 240 GB). The encryption and decryption is done by xor'ing the keystream with the input data. Two useful features of ChaCha are the possibility of output block generation at random positions, and the naturally constant time for processing stream blocks.
A. ChaCha's Matrix
The input to the ChaCha function is a 256-bit key, a 64-bit nonce and a 64-bit block counter. They are all treated as 32-bit integer arrays in little endian format. The input values and four 32-bit constants are arranged in a 4 x 4 matrix. The following matrix (Error! Reference source not found.) shows the initial state before the round function operates on it. 
Quarter-Round Function
The quarter-round function ( Fig. 3 ) updates, reversibly, one row of the state matrix. The operations are 4 adds, 4 xors and 4 rotations, which are applied on the four 32-bit values of the row. 
Row-Round and Column-Round Function
The row-round function right-rotates the rows in the first step, and up-rotates the columns in the second step. The rotation count equals to the position of the row or the column. After both rotations, the row vectors are fed into the quarter-round function, as illustrated in Fig. 4 . The column-round function rotates the columns in the state matrix to the top by their position count. After the rotation the row vectors are fed into the quarter-round function (see Fig. 5 ). On 32 (or less) bit microarchitectures both round permutations are achieved at almost zero performance costs by using pointer arithmetic. On vector based architectures, the permutations requires real rotations.
C. ChaCha's Double-Round Function
The consolidation of the row-round function and the column-round function is called double-round function (see Fig. 6 ). 
D. Existing ChaCha Implementations
A variety of public domain ChaCha implementations are available. Some of them could be found on the author's webpage [13] , and others appear in the eBACS project webpage [14] . The NSS [15] and OpenSSL [16] libraries in the Chromium [17] project also include two implementations. We briefly summarize the main optimizations in these implementations.
128-Bit Vectorization
The row-round and the column-round functions execute, four times, the same quarter-round function, with four independent inputs. The input to a quarter-round is a four 32-bit element vector. This is used to calculate the four quarterrounds in parallel with four 128-bit vectors as input (instead of four 32-bit values). This is illustrated in Fig. 7 . With the aggregated calculation of the four quarter-round functions, only one double-quarter-round per two rounds is left, and the load/store operations for the single 32-bit values in each quarter-round functions are replaced by one quarter load and store operations for 128-bit vectors. This comes at the cost of additional rotations that are needed for shuffling the order of the single elements in the vectors, to be suitable for either the row-round calculation or the column-round calculation.
III. WIDER VECTORIZATION
The purpose of this section is to show how to leverage the new AVX2 and the future AVX512 extensions to improve ChaCha's encryption/decryption performance.
A. 256-Bit Vectorization
Compared to a 128-bit register, a 256-bit register can hold twice as many 32-bit values. This can be leveraged to store two row vectors at the same time, and operate on both of them simultaneously. Therefore, the operational costs can be reduced from two load, two store and 2X process instructions for two 128-bit vectors, to one load, one store and X process instructions for a 256-bit vector.
To take advantage of the 256-bit vector registers, we chose to process two ChaCha stream blocks simultaneously. We do not further cut down the count of row vector operations in the double-quarter-round because the remaining operations on the row vectors are mutually dependent. This makes it difficult to further reduce the number of vector operations through simultaneous calculations.
The updated double-quarter-round algorithm, using 256-bit vectorization, is listed in Fig. 8 . The improved double-quarter-round algorithm requires a change in the initialization of the vectors. We still use the 128-bit unaligned load instructions to transfer the input data, key and constant to the 256-bit vector registers. Then, we broadcast the 128-bit vectors to both 128-bit halves of the 256-bit vector (the "broadcast" instruction duplicates an element in a vector register). This requires four broadcast instructions, but it is still faster than loading the unaligned data in 256-bit blocks. The second state matrix also needs an incremented block counter. This can be computed by one extra vector addition with a constant (0,0,0,1).
The incrementing of the block counter also needs to be adapted when more than two blocks are processed. This is done by changing the vector constant for 64-bit integers addition from (0,1) at 128-bit vectors to (0,2,0,2) at 256-bit vectors. This does not involve extra vector additions with constants.
Finally, xor'ing and writing of the encrypted (or decrypted) stream to the target buffer needs to be adjusted due to the order of the 128-bit rows in the four 256-bit vectors. Every one of the four 256-bit vectors includes, in the lower 128-bit part, the row vectors of the first output block (bytes 1-64), and includes, in the higher 128-bit part, the second output block (bytes 65-128). Therefore, we need four extra vector permutations to rearrange the bytes in the 256-bit vectors to the right stream order. Subsequently, we can save half of the xor, load and store operations, because we can operate on 256-bit instead of 128-bit blocks.
Also instruction-level parallelism is leveraged by subsequently executing up to three double-quarter-round functions per double-round iteration. This requires 12 AVX2 registers for simultaneously holding up to 3x2 independent blocks in the registers.
B. 512-Bit Vectorization
The 512-bit vectorization extends the 256-bit vectorization in a way that four state matrices can be processed in parallel. Therefore, the double-quarter-round has to be adapted for 512-bit vectors. Similarly, the vector initialization changes from adding (0,1,0,0) to the initial 256-bit vector to adding (0,3,0,2,0,1,0,0) to the initial 512-bit vector. In addition, incrementing of the block counter changes from adding (0,2,0,2) to adding (0,4,0,4,0,4,0,4). Finally, the order of the row vectors in the 512-bit vectors needs to be adjusted again. As a result, we need 8 extra vector permutation instructions and 12 extra move instructions compared to the 256-bit implementation. However, working with 512-bit blocks, saves half the number of xor, load and store operations, and the overall performance is expected to be enhanced.
IV. RESULTS
This section shows the performance of our proposed optimizations. We compare our implementation to the vectorized ChaCha implementation of the NSS and OpenSSL sources from the latest development branch of the Chromium project (November 10, 2013; retrieved from "svn" repository of Chromium [18] ).
A. Performance Comparison on the Haswell
Microarchitecture Fig. 9 shows the performance of the ChaCha20 encryption and decryption for a single Haswell core (Intel® Turbo Boost Technology, Intel® Hyper-Threading Technology, and Enhanced Intel Speedstep® Technology were disabled). The implementations were compiled with gcc version 4.8.1, AVX2 support ('-mavx2' -this also optimizes the SSE2/SSE3 [10] source code of the 128-bit vectorization with AVX [10] instructions) and compile time optimizations ('-O3' and '-fomit-frame-pointer'). We observe that the improvement for the 256-bit vectorization is only marginal for short messages of less than 128 bytes. However, for longer messages, our implementation almost doubles the performance, as shown in Fig. 9 . 
B. Performance Estimation on Future Processors
A future Intel microarchitecture may introduce the AVX512 extension which could be used to further speed up ChaCha with the proposed 512-bit vectorization.
Since there is not yet any processor with the AVX512 extension available, we cannot measure the resulting performance at this point, and therefor use a different methodology. The compilation of the code can be done by using gcc version 4.9.0 [19] , and the resulting binary can be executed on an emulator (SDE [19] ). We used the SDE tool to count the number of executed instructions for the encryption and decryption. Table 1 compares the instructions count of the three implementations for different message sizes. The first implementation is the 128-bit vectorization of the modified NSS and OpenSSL libraries, the second implementation is our 256-bit vectorization implementation and the third is our 512-bit vectorization implementation. The AVX512 based implementation offers no instruction overhead for smaller sized messages (less than 128 bytes) and it provides an incremental reduction for increasing message size from 16% to 56% in the instructions count. This indicates a high potential for a performance improvement on future processor generations. 
V. DISCUSSION
We showed here, how to significantly speed up the performance of ChaCha by means of widening the vectorization from 128-bit to 256-bit/512-bit and using algorithmic and software implementation improvements. Therefor we implemented the ChaCha algorithm using the new AVX2 extension feature of the Haswell microarchitecture and the future AVX512 extension feature. The evaluation shows for encrypting (or decrypting) more than 128 bytes, that the widening of the vectorization to 256-bit results in a doubling of the throughput for ChaCha. For less than 128 bytes the widened vectorization has no real effect. In addition, we could show with the instruction count comparison, that the 512-bit vectorization has a great capability to double again the throughput of ChaCha on future microarchitectures. These optimizations make ChaCha even more attractive for being a fast and secure alternative to AES in the TLS protocol.
Our source code is available from the eBACS project webpage [14] (in the SUPERCOP measurement system). 
