INTRODUCTION
At the end of phase one, the software performances have been analyzed comprehensively. However, the hardware implementation results are far from sufficient. Compared with software implementations, hardware implementations are less flexible but can typically be more compact and fast. As well, encryption hardware cores are usually isolated from other hardware in a system, so hardware implementation for a cipher is often more secure.
The research of this paper examines designs of Salsa20 [2] and Phelix [3] , both of which are claimed suitable for software and hardware implementation. The needs of different applications in communication systems demand different structures for cryptographic algorithms in hardware. For example, in wireless applications like cell phones, compactness and low power consumption are critical because of battery limitations and portability, while for virtual private network (VPN) applications and secure e-commerce web servers, demand for high-speed encryption is rapidly increasing.
When considering implementation technologies, normally, the ASIC approach provides better performance in density and throughput, but an FPGA is reconfigurable and more flexible.
In our study, several schemes are used, catering to the features of the target technology.
PHELIX 2.1 Short Description of the Phelix Algorithm
Phelix is claimed to be a high-speed stream cipher. It is End.
The bitwise exclusive-or of two words, denoted as "⊕", is the sum of the words with carries suppressed. The symbol "<<<" represents left rotation, and " " represents addition modulo 2 32 . Further details of the algorithm can be found in [3] .
Compact ASIC Structure of Phelix
The Phelix stream cipher can be implemented in many ways.
The proposed compact structure focuses on function sharing to optimize the area. Figure 1 illustrates a minimal ASIC implementation consisting of one round of encryption and a memory recording the four old states. FIFO: the "first in, first out" memory that stores the old states.
To improve the performance, the subkey generation is on the fly: during each block, two words of subkeys are produced and used as parameters in H function block that generates the 32-bit key stream later.
To increase the speed of the encryption, we could design additional logic to perform the H function. It would require more adders and 32-bit exclusive-or function blocks that can work in parallel. However, it will dramatically increase the size of the H function circuit since the adder is the largest component compared with other simple function blocks, such as a 32-bit register. So far, we have not investigated the exact area penalty inferred by this option.
Also, many multiplexers can be removed to shorten the critical path. But the gained benefit is limited. Implemented by using 0.18µ CMOS technology, the multiplexers have a 0.37～0.70 ns range of critical path, depending on the number of the input ports, while the overall critical path that contains only two multiplexers is around 7 ns. Only a 10% speed increase can be gained by this method.
Furthermore, it is possible to compute the subkeys off-chip, and then download them into the circuit memory to save time.
However, this may affect the security of the device. If we consider the 64-byte input block as a 4×4 matrix of 32-bit words, the four elements in each row and each column will be modified by quarterround function ten times, respectively. After that, the output is added with the original values, producing a 4-word keystream.
Compact ASIC Structure of Salsa20
The compact structure of Salsa20 mainly contains two 32-bit ×16 memory blocks, a controller and one quarterround function block.
Figure 2. Datapath of Quarterround Block
The design of quarterround block is straightforward. If the reset signal in the controller occurs, the contents of all registers are formatted to zero. If the start signal occurs, the inputs are loaded into the registers in parallel, and then the core performs addition, rotation, and XOR sequentially. After that, the modified data is loaded into the registers again. For example, the quarterround block is more than two times faster than the memory. Thus, we use a frequency divider in the circuit. In that way, the global frequency is 250 MHz, and the frequency for the quarterround block is 125 MHz.
Basic Iterative ASIC Structure of Salsa20
The datapath of an iterative structure consists of four quarterround function blocks, since the four rows or the four columns are encrypted independently. The control unit is simply a combination of a counter and a comparator. 
Fast ASIC Structure of Salsa20
There is minimal serialization of blocks in Salsa20. This feature can be considered in two ways: (1) during a single round, there is no communication between columns or rows;
(2) each 64-byte block is encrypted independently. This gives a chance to implement a parallel structure, which can dramatically increase speed by handling several blocks at the same time.
Employing the iterative structure of Salsa20 as a pipeline stage, it is easy to build a pipelined structure of variable stages.
A full pipelined structure consists of twenty stages arranged in a sequence. Adding pipelining affects latency, but compared with the overall improvement on performance, the latency is not dramatic. 
SYNTHESIS RESULTS
All designs are simulated and synthesized by using Synopsys CAD tools. Synthesis results for Phelix are illustrated in Table   1, and table 2 is for different structures of Salsa20. To our knowledge, there are no published ASIC implementations results for the Phelix, but a rough estimation from the authors of [3] , is that the cipher can achieve speed of at least 200 MBps with 20,000 gates for the area. The targeted technology is not specified. We are not aware of any published results on the hardware design of Salsa20. 
CONCLUSIONS
In this paper, two stream cipher candidates, Phelix and Salsa20 of eSTREAM project are implemented in hardware and compared in terms of performance and consumed area.
It is an unsurprising result that Salsa20 based on a compact context is slower and consumes more area than Phelix, since it performs a large number of invertible modifications, each of which changes one word of the matrix in a sequential manner.
A microprogrammed control unit and a predefined memory were presented in a compact FPGA structure of Salsa20 to save the area and improve the performance. It should be notice that high speed adders often consume more area in the chip. A more considerate choice of adders should be made with the requirements of the specific application.
