Introduction
RC4 is a widely used stream cipher whose algorithm is very simple. It has withstood the test of time in spite of its simplicity. The RC4 was proposed by Ron Rivest in 1987 for RSA Data Security and was kept as trade secret till 1994 when it was leaked out [4] . Today RC4 is a part of many network protocols, e.g. SSL, TLS, WEP, WPA and many others. There were many cryptanalysis to look into its key weaknesses [4, 5] followed by many new stream ciphers [6, 7] . RC4 is still the popular stream cipher since it is executed fast and provides high security.
There exist hardware implementations of some of the stream ciphers in the literature [8] [9] [10] [11] . Since about 2003 when FPGA technology has been matured to provide cost effective solutions, many researchers started hardware implementation of RC4 as a natural fall out [2, 3] . The FPGA technology turns out to be attractive since it provides soft core processor having design specific functional capability of a main processor (MicroBlaze [12] ) along with reconfigurable logic blocks that can be synthesized to a desired custom coprocessor, embedded memories and IP cores. One can design RC4 algorithm totally as an executable code for the soft core processor (main processor) only or in custom coprocessor hardware operated by the main processor. Because of the system overhead, any single instruction if executed in the main processor takes at least 3 clocks, while the identical one when executed in a coprocessor takes 1-clock as the latter is customized to handle the specific task. Besides the clock advantage, the coprocessor based design makes the system throughput faster by another fold since it is executed in parallel with the main processor.
In this paper, RC4 algorithm is considered as it is and exploiting conventional VHDL features a design methodology is proposed processing of 1-byte in 1-clock. The said design is implemented in a custom coprocessor functioning in parallel with a main processor (Xilinx Spartan3E XC3S500e-FG320 FPGA architecture) followed by secured data communication between two FPGA boards through their respective
Ethernet ports -each of the two boards performs RC4 encryption and decryption engines separately. The performance of our design in terms of number of clocks proved to be better than the previous works [1 -3] . The clock gating technology is introduced to save dynamic power. In order to see the resilience of RC4, a battery of statistical tests as mentioned in the NIST document [14] is undertaken and it is found that the randomness property of its key streams is reasonably good.
The paper is organized as follows. The RC4 algorithm is briefly described in Sec. 2. In Sec. 3 the 1-byte-1-clock design and its hardware implementations are described. The communication experiment set up along with the results of relative comparisons is narrated in Sec. 4. The power optimization using clock gating technology is discussed in Sec.5. The randomness test of RC4 algorithm undertaken following the NIST statistical tests suite is discussed in Sec. 6 along with its results. The conclusion is discussed in Sec. 7. Hardware Implementation of 1-byte 1-clock design 
RC4 Algorithm

Experimental Results
Power Optimization
Power optimization study is important in view of its application in emerging embedded technology. In synchronous digital circuits the effective way to reduce the dynamic power dissipation is to dynamically disable the clock in those regions which do not remain active during a specific time of data flow. Since most of the dynamic power consumption in an FPGA is directly related to the toggling of the system clock, temporarily disabling the clock in inactive regions is the most straightforward method of minimizing power consumption.
In experiments 1 and 2 stated in Sec. 4 above, there was no clock management process. Fig. 7 shows a schematic diagram of expt. 2 exhibiting KSA and PRGA blocks together.
Fig: 7. Circuit Block diagram of Experiment 2.
In RC4 the KSA and PRGA processes are sequential and there is no loss of data if the PRGA block is made active, i.e. prga_en is made '1' from its initial value '0', only when the KSA process finishes all its operations during the first 257 clocks. But both the clocks (ksa_clk and prga_clk) in Fig. 7 are running for the entire computing process. The clock gating circuit incorporated in experiment 2 is shown in Fig. 8 . The prga_en is first initialized to '0', thereby ksa_en becomes '1' and only the ksa_clk remains active for the first 257 clocks. After the 257 th clock, prga_en becomes '1', thereby KSA process is instantly disabled and the prga_clk is activativated setting the PRGA block in operation. Table 4 shows the power consumed on various items as depicted by the Xilinx X-power [13] analyzer tool doing simulation. It may be noted that the total power is a sum of quiescent and dynamic powers, the dynamic power is a sum of clock, logic, IOs and signal powers and the signal power is a sum of data signal and control signal powers. It is seen from the Table 3 that over the structural design the clock gating technology gives a saving of about 4.6% in dynamic power and about 1% in total power.
6.
Randomness Tests on RC4 following NIST Statistical Test Suite
Considering the fact that RC4 is very simple, popular and withstood many attacks, it is thought to study the randomness property of its key stream based on 15 statistical tests consolidated by NIST in a Statistical Test Suite [14] . All these statistical tests are undertaken on a sample size of 300 each of which has 1342400 bits produced by RC4. Tests results are shown in Table 3 and in Fig. 9 . The P-value in the NIST tests is the probability value indicating the degree of non-randomness -the lesser is the P-value, the higher is its degree of non-randomness. For a particular bit sequence, if its value for a particular test is less than 0.01, the sequence is considered to be completely non-random. Considering all the Pvalues for a particular test to be undertaken on samples of size (N) greater than 100, one can define a parameter P pop as proportion of passing of P-values. The theoretical statistical estimate of acceptable P pop is 0.99 ± R, where R is inversely proportion to the square root of N, the larger the sample size, the smaller the value of R.
Considering 300 samples of RC4 key bits sequences obtained from 300 different keys, the value of R is calculated approximately as 0.01. The observed P pop is the relative number of P-values lying above 0.01 to all the P-values. From the said statistical consideration, all the 300 samples of RC4 key bit sequences is observed to pass all the 15 tests although 149 P-values are found to fail among 12300 (=41x300) P-values. It is to be noted that for a particular sample, test nos. 1 -10 and 12 have one P-value each, while test nos. 11 and 13 have 2 P-values each, 14 has 8 P-values and 15 has 18 P-values -altogether 41 P-values. Table 4 shows number of P-values lying in 11 ranges between 0 and 1 for all 15 tests. Fig. 8 depicts the observed proportion of passing (Y-axis) for all tests (X-axis). Among the 15 tests, the lowest observed P pop is 0.98 for tests 1 and 3, while the highest one is 1.00 for test 7. The observed P pop for all the 15 tests are shown in Fig. 9 . Fig. 9 . Observed Proportion of passing of RC4
The P-value of P-values (POP) for a particular test is another parameter whose value is calculated based on Table 4 following a statistical methodology mentioned by NIST [14] . The distribution of P-values for a particular test undertaken on all the samples can be considered uniform, if its POP is greater than 0.0001.
From Table 5 it is seen that the POP of all the 15 tests are above 1e-4 and one can conclude that P-values of all the 15 tests are uniformly distributed. It is also seen that the POP value is most for test 2 and least for test 14
exhibiting the fact that test 2 produces most uniformly distributed 300 P-values and test 14, the least -although Table 4 . One can thus conclude that according to NIST Statistical Test Suite the RC4 key bit sequences can be considered to be fairly random. 7.
Conclusion
The proposed 1-byte-1-clock RC4 design in FPGA is a coprocessor based design functioning in parallel with a main processor. The encryption engine of the design implemented in one board successfully communicates through its Ethernet port to another board containing the decryption engine. The present 1-byte 1-clock processing exploits conventional VHDL features and circuit-wise it is much simpler than the processing of 2-bytes together in 2 clocks [1] , leading to a throughput little better than that presented in [1, 11] . The clock gating technology incorporated in the structural design is found to reduce dynamic power by about 5%. From the statistical randomness studies, RC4 is found to be producing reasonably fair random key bit sequences.
8.
