Fault Detection for RC4 Algorithm and its Implementation on FPGA
  Platform by Paul, Rourab et al.
ar
X
iv
:1
40
1.
27
32
v1
  [
cs
.A
R]
  1
3 J
an
 20
14
Fault Detection for RC4 Algorithm and its
Implementation on FPGA Platform
Rourab Paul1,∗, Amlan Chakrabarti2, Ranjan Ghosh3
Abstract
In hardware implementation of a cryptographic algorithm, one may achieve
leakage of secret information by creating scopes to introduce controlled faulty
bit(s) even though the algorithm is mathematically a secured one. The tech-
nique is very effective in respect of crypto processors embedded in smart
cards. In this paper few fault detecting architectures for RC4 algorithm are
designed and implemented on Virtex5(ML505, LX110t) FPGA board. The
results indicate that the proposed architectures can handle most of the faults
without loss of throughput consuming marginally additional hardware and
power.
Keywords: RC4, FPGA, Fault Tolerance.
1. Endomorphism
RC4 algorithm is very simple and is widely used as a stream cipher.
Today RC4 is a part of many network protocols, e.g. SSL, TLS, WEP,
WPA and many others.There were many cryptanalysis to look into its key
weaknesses [1] followed by many new stream ciphers [2]. RC4 is still the
popular stream cipher since it is executed fast and provides high security. It
is believed that mathematically secure crypto algorithm becomes vulnerable
while implementing it in hardware [3], since it becomes possible to extract
∗Corresponding author
1rourabpaul@gmail.com, A.K.C. School of I.T., University of Calcutta, Kolkata.
2Senior Memeber IEEE, A.K.C. School of I.T., University of Calcutta, Kolkata.
3Dumkal Institute of Engineering and Technology,Basantapur Education Society, Mur-
shidabad after retirement from the Institute of Radio Physics and Electronics, University
of Calcutta, Kolkata
Preprint submitted to ICCN-2013/ICDMW-2013/ICISP-2013 August 3, 2018
secret information by introducing faults in a controlled fashion due to which
on fault detection techniques turn out to be a key issue related to hardware
implementation. Moreover, shrinking dimension of raw devices induces Single
Event Upset (soft error) which is termed as a change of logic state caused by
ions or electro-magnetic radiation striking the device. The dense devices use
more hardware components for faster processing and in turn cause increase
of ion beam radiation as internal faults. This ion beam radiation causes state
change in CLB[4]. Usually two types of faults tolerant circuits are in use,one
is Hardware Based Fault Tolerant (HBFT) circuits [5][6][7] and the other is
Algorithm Based Fault Tolerant (ABFT) circuits [8][9][10][11]. For HBFT,
faults are detected either at the CLB level or at the LUT level.
A hamming code based fault detecting and correcting scheme is proposed
for stream cipher like A5/1(GSM), E0 (Blue-tooth), RC4 (WEP), and W7
on hardware platform in article [12]. It is not necessary that faults are al-
ways sourced from the system itself. ABFT circuits at the communication
level with specific reference to RC4 are proposed in [13] and [11]. A sequence
number padded to each cipher character of RC4 is proposed in [13]. In[11] a
method is proposed where data are stored in a matrix and 1 byte checksum
is added to each of the rows and columns of the matrix. For multiple error
detection they used Knight Checksum. Both the fault methods detect the
fault after execution of the cipher text and thus take some additional time.
There exists quite a few literatures on AES Fault tolerance Scheme [8][9].
The article [8] has tried to find out the contagious sections of the AES algo-
rithm from which section the probability of error spreading is maximum. It
has been observed how a single bit and multiple bit errors can spread over
the data with algorithm iterations. Fault is located followed by its detection.
They have introduced a parity checker scheme (16 bits) with input data block
(16 bytes) which can detect single bit errors and as well as odd multiple bit
errors.The error detecting efficiency of this scheme is not so good for efficient
error detecting crypto system. For key scheduling process [8] has proposed an
inverse key scheduling module which error detecting efficiency is appreciable
but resource usage becomes twice of the original key scheduling model. In
[9] three types of fault detection scheme based on cycle redundancy checks
(CRC) are proposed.
In this paper an ABFT scheme is proposed for RC4 stream cipher in which
efficient fault detecting additional hardware blocks are designed with an in-
tention to detect maximum errors using minimum resources. The proposed
scheme can detect faults at the very instant the ciphering is being executed.
2
1.	N =  256; 
2.	for i = 0 to (N-1) //Initialization module
3.         S[i] = i;          // Identity permutation
4.	     K[i] = key[i % l];  
5.  end for;
6.  j = 0;                        //Storage module 
7.  for i = 0 to (N-1)
8.        j = (j + S[i] + K[i]) % N;
9.        swap (S[i], S[j]);
10. end for;
                         KSA Process
1.   N = 256;
2.   i = j = 0;
3.  while (TRUE)//Generating Key stream Z
4.       i = (i + 1) % N;
5.       j = (j + S[i]) % N;
6.       swap (S[i] , S[j]);
7.       t = (S[i] + S[j]) % N
8.       output Z = S[t];
9.   end while;
                    PRGA Process
Figure 1: RC4 Algorithm
Fault blocks and the algorithm block are executed in parallel due to which
the throughput remains unchanged. When fault is detected, the system is
reset to aware the user. Here faults are only detected,not corrected. In the
absence of fault detecting blocks, occurrences of faults would cause changes
in the power and timing parameters which would provide information to side
channel attackers to extract information related to secret key. The paper is
organized as follows: Section 2 details the overview of Fault techniques.The
experimental results are summarized in Section 3. Conclusion and References
are enlisted in Section 4 and 5.
2. Fault detection techniques adopted for RC4
The RC4 has two sequential algorithms, namely KSA (Key Scheduling Al-
gorithm) and PRGA (Pseudo Random Generator Algorithm) and are shown
in Figure 1. In case of RC4 a single or multiple bit errors can change the
value of ′j′ (see line 8 of KSA and line 5 of PRGA of Figure 1) randomly
which can addresses a wrong S-box element for further swapping process.
So there is no correlation between number of faulty bit and the number of
iteration of the algorithm. As we see in Figure 1, RC4 has an identity S-
Box S[N], N=0 to 255 and a secret key, key[l] where l is typically between
5 and 16, used to scramble the S-Box [N]. The purpose of the KSA-PRGA
processes is to scramble the S-Box [N]. As both the processes have more or
less identical operations, the design of fault tolerance modules is discussed
for one process only.The detail hardware architecture of the core algorithm
has been described else where in [14].
3
Arithmetic and logical operations are very much fault prone.The three com-
mon arithmetic operations of KSA and PRGA are ′i′, ′j′ computation and
retrieval of S[i], S[j] for swapping process. Any malfunction in these opera-
tions may cause wrong encryption/decryption results which may be a clue for
an attacker. It has been seen that ′i′,depends on a plane binary up counting
process, where as ′j′, depends on an addition operation. Any abnormality
in′i′, ′j′ computation can address wrong S[i]andS[j] resulting in wrong swap-
ping process. Any single or multiple faults on S-Box can spread all across the
algorithm quickly hampering the algorithm randomness as well as the cipher
text authenticity.
To check the ′i′ functionality according to the algorithm,a new efficient
counter checker module is proposed. For ′j′ computation an efficient ad-
dition checker is proposed.The correct computation of S[i]andS[j] is ensured
by using a CRC checker on S-box [N].The three checkers are executed par-
allel with the main algorithm without hampering the algorithm throughput.
The proposed three fault blocks are shown in Figure 2 and the structure of
the CRC code is shown in Figure 3. Before the execution of each round, the
core algorithm checks the âĂĲno faultâĂİsignal contributed by the three
proposed fault checkers. Each of the three fault checkers feed no_fault to
the algorithm block through AND gate.Any fault detected by a particular
fault module can stop the execution of the algorithm at that instant of clock
edge.
2.1. Error detection on S-Box: CRC Checker
To detect fault on S-Box Array standard CRC technique of 4-degree poly-
nomial is used. Lower degree of polynomial is used to reduce length of re-
dundant CRC bits. It has been seen that CRC has good efficiency to detect
single bit errors, double bit errors and odd number of errors. A dedicated
hardware block to execute CRC algorithm has not been designed since that
would require a huge hardware resource,large computation power and might
cause some synchronization problem with the main algorithm.This synchro-
nization problem is a sensitive issue as the main crypto core has a very high
throughput based on dual edge sensitivity. To bye-pass, a standard 4-degree
divisor, X4 + X3 + 1 is chosen and four bit residue is computed as CRC
code which has been padded to each S-Box element, each element thus be-
comes a 12-bit data instead of 8-bit, as shown in Figure 3. This new S-box
is stored as two S-Boxes in the CRC hardware block (vide Figure 4) as well
as in main algorithm S-Box. CRC block has two input S[i] and S[j]. In each
4
Algorithm Core 
CRC BLOCK
ADDITION 
CHECKER
COUNTER
 CHECKER
S[i]
S[j]
Clk
Clk
Clk
AUGEND
ADDEND
i
NO FAULT
NO FAULT
NO FAULT
NO FAULT
SUMMATION
Figure 2: Fault Blocks
Table 1: CRC
# Possible combination CRC
faulty bit(bit) of faulty bit Detected Fault Undetected Fault
1 8C1=8 8 0
2 8C2=28 21 7
3 8C3=56 56 0
4 8C4=70 70 0
5 8C5=56 56 0
6 8C6=28 0 28
7 8C7=8 8 0
8 8C8=1 0 1
Total fault 255 219 36
Error Detecting Efficiency (%) 86
clock S[i] and S[j] is computed by the main algorithm block and then it has
been checked by CRC module whether S-Box element is correct or not. If
there is no error in the CRC block it will proceed for the next clock cycle.
According to [15] CRC checker can detect all odd number of errors since the
divisor polynomial can be divided by X+1. It can detect all isolated double
bit errors, since the polynomial cannot be divided byX t+1 (where t=2 to 8).
Table 1 shows the number of detected faulty bit, maximum number of faulty
bit combination and the number of detected and undetected fault using the
CRC based method. As shown in table 1 the efficiency evaluates 86%.
2.1.1. Hardware design of CRC checker
The input of CRC block is S[i]andS[j] which are of 12 bit, 8 bit is for
data and 4 bit for CRC code. The CRC encoded data format is shown in 3.
A look up table of 4 bit width is already stored in ′buffer′ called CRC array
with appropriate CRC code. The S[i] and S[j] port of algorithm block is
5
connected with S[i] and S[j] of CRC block. In every rising edge of clock the
CRC encoded S[i] and S[j] data from the algorithm block has been checked
with CRC array by the CRC block.The CRC hardware architecture is shown
in 4.
S-Box Message
8 bits
CRC bits
4 bits 12 bits
CRC encoded data
Figure 3: CRC encoded data
0
1
2
255
254
0
0
1
1
2
2
254
254
255
255
MUX
256:1
MUX
256:1
CRC
Array
S[i](11 down to 4)
S[j](11 down to 4)
S[i]
(3 down
 to 0)
S[j]
(3 down
 to 0)
Comparator
Comparator
clock
no_fault
Figure 4: CRC hardware block
2.2. Error Detection on Addition Checker
There are few standard types of error detecting techniques on arithmetic
operations [15] such as parity and residue techniques which are most popular
due to its low cost and high error detecting efficiency. Residue technique is
motivated on modulus operation which costs huge hardware footprint [16] in
FPGA based platform. This is the reason that the parity scheme is adopted
in this paper for addition checker. The 8-bit data is split into two 4-bit
nibble to increase the error detecting efficiency as is seen in table 2. Now
it is necessary to describe how the parity prediction scheme is initiated for
addition operation. Of the two 8 bit numbers, such as If there are two
6
Table 2: Degree of Redundancy of proposed error detecting scheme
Type of EDC Number of Redundant Bits Degree of Redundancy
Byte parity 1 1/8=12.5 percent
Nibble Parity 2 2/8=25 percent
numbers of 8 bit width, such as add and aug. 4 parity bit such as then p(add
lower), p(add higher), p(aug lower) and p(aug higher) following manner.
p(add lower)=p(add(0) xor add(1) xor add(2) xor add(3)),
p(add higher)=p(add(4) xor add(5) xor add(6) xor add(7)),
p(aug lower)=p(aug(0) xor aug(1) xor aug(2) xor aug(3)),
p(aug higher)=p(aug(4) xor aug(5) xor aug(6) xor aug(7)).
2.2.1. Prediction for arithmetic addition
It is well known that the parity of the sum of two natural number can be
obtained by XORing the parities of both summands and of all carries propa-
gated between any two adjacent bits, plus the possible carry-in into the least
significant position. Hence
p(addlower+auglower) = p(addlower)xorp(auglower)xorCinxor
⊕3
i=0C
(i)
out
p(addhigher+aughigher) = p(addhigher)xorp(aughigher)xorCinxor
⊕7
i=4C
(i)
out
2.2.2. Hardware strategy of addition checker
In KSA process the inputs of addition checker such as, ′j′, S[i], and K[i]
and its summation result has been passed to the addition checker block. By
parity prediction technique the addition checker fault block can check the
whether the summation is right or wrong. The efficiency is about 75% which
is portrayed in table 3. The same addition checker module has been used for
Z computation in line no. 5 and 7 of PRGA process.
2.3. Error Detection on i counter
Several techniques[17] have already been developed in order to improve
the reliability of binary counter. A completely new technique is proposed
consuming very low resource usage and exhibit very high error detecting
efficiency. An interesting pattern has been observed in binary counting. If
the parity of even bit position data is computed, the parity of first 4 set of
data will be the complement of next 4 set of data.Similarly if the parity of
odd bit position data is computed, the parity of first 4 set of data will also be
7
Table 3: Addition Checker
# Possible combination Counter Checker
faulty bit(bit) of faulty bit Detected Fault Undetected Fault
1 8C1=8 8 0
2 8C2=28 16 12
3 8C3=56 56 0
4 8C4=70 32 38
5 8C5=56 56 0
6 8C6=28 16 12
7 8C7=8 8 0
8 8C8=1 0 1
Total fault 255 192 63
Error Detecting Efficiency (%) 75
Table 4: binary counting pattern
counting number parity
MSB of two nibble Even Odd
00000000 0 0 0
00000001 0 1 0
00000010 0 0 1
00000011 0 1 1
00000100 0 1 1
00000101 0 0 1
00000110 0 1 0
00000111 0 0 0
00001000 1 0 1
00001001 1 1 1
—– – – –
11111110 1 1 0
11111111 0 0 0
the complement to next 4set of data. There exist another pattern in parity
bit of msb of upper nibble and msb of lower nibble. The parity pattern is
shown in table 4. Following this pattern we have designed a fault checker on
the ′i′ counter which store the 8 consecutive counting of ′i′ and feed a decision
whether the counting is right or wrong based on the pattern prediction that
we describe in table 4.
2.3.1. Hardware overview of counter checker
The main RC4 algorithm core increasing ′i′ in every clock cycle. Every
8 set of data has been buffered into an array in counter checker fault block.
The fault block separate 8 set of data into two 4 set of part. The fault
checking algorithm checks the proposed pattern mentioned in table 4 after
every 8 clock cycle and make decision that whether fault has been occurred
or not which has been fed to the main algorithm block.The error detecting
8
Table 5: Counter Checker
# Possible combination Counter Checker
faulty bit(bit) of faulty bit Detected Fault Undetected Fault
1 8C1=8 8 0
2 8C2=28 20 8
3 8C3=56 56 0
4 8C4=70 55 15
5 8C5=56 56 0
6 8C6=28 20 8
7 8C7=8 8 0
8 8C8=1 1 0
Total fault 255 224 31
Error Detecting Efficiency (%) 88
CLK
Initialize
i=0, j=0 Start Counter
i=1
j=(j+S[i])%256
i=2
Swap S[i] & S[j]
Z1 executed
j=(j+S[i])%256
i=3
Swap S[i] & S[j]
Z2 executed
j=(j+S[i])%256
i=9
Swap S[i] & S[j]
Z9 executed
j=(j+S[i])%256
i=8
Swap S[i] & S[j]
Z8 executed
i1 is latched i2 is latched i3 is latched
i8 is latched & make 
decision whether fault
 is occured or not 
Counter Checker
Addition Checking
on j
Addition Checking
on j
Addition Checking
on j
Addition Checking
on j
Addition Checker
CRC checking on
S[i] & S[j]
CRC checking on
S[i] & S[j]
CRC checking on
S[i] & S[j]
CRC checking on
S[i] & S[j]
CRC Checker
Figure 5: Timing diagram of proposed fault modules with respect to to main algorithm
efficiency of proposed counter checker has been shown in table 5.
3. Results and discussion
The individual fault blocks have been implemented on Xilinx Virtex5
FPGA board. The resource consumption of 3 fault blocks is very less com-
pared to the main architecture. Of the proposed fault blocks, the counter
checker block and add checker sub-blocks takes very less resource compared
to main architecture (0.09% & 0.31% ) while the main CRC checker sub-
block(50% ) is resource hungry and takes considerably high resource com-
pared to main architecture. Not only this, the 3 blocks have 45% 0.2% &
0.26% LUTs usage compare to the main architecture. The detail estimation
of resource usage is given in table 7.
The xilinx xpower tool to measure system power consumption [18]. Three
fault blocks, CRC Counter Checker & Addition Checker consume 7% 1.2% &
4.3% power compared to main architecture power. Resource utilization table
9
7 & power consumption table 6 is showing that the additional fault blocks has
very less resource utilization and less power consumption which is the desir-
able goal of such kind fault detection application on FPGA based platform.
In an earlier paper [14] the RC4 algorithm was implemented in hardware
Table 6: power consumption
Power Main CRC Counter Addition
( milli watt) Core Checker
Total 994.72 70.58 12.81 43.25
Power
Quiescent 914.87 30.71 0.67 2.29
Power
Dynamic 52.86 67.1 11.9 40.96
Power
Clock r 47.33 19.03 7.11 0.19
Power
Logic 0.60 4.47 0.06 0.13
Power
IOs 0.60 4.47 0.06 0.13
Power
Signal 4.74 43.39 4.94 41.66
Power
Table 7: Resource utilization
Logic Main CRC Counter Addition
Usage # Core Checker Checker
Slice 4139 2042 4 13
Register
slice 12560 5653 26 33
LUT
fully 4132 2034 52 78
used LUT
-FF pairs
using Vertex 5 FPGA in which 1-byte in 1-clock was the approximate execu-
tion speed which has been achieved by carrying the addition process (line 5
of PRGA process) during the rising edge of a clock pulse and the swapping
and key streams generation (lines 6 and 7 of PRGA process)during falling
edge of the same clock pulse with a loss of one initial clock pulse.The tim-
ing diagrams of the proposed three fault modules with respect to to main
algorithm clock are also shown in Figure 5. At falling edge of every 8th
consecutive clocks,′i′ is checked and at every rising edge,the addition checker
and at every falling edge, the CRC checker is executing their respective tasks.
It becomes evident that the fault modules are so designed in hardware here
that the throughput of the main RC4 algorithm remains unchanged.
4. Conclusion
In this paper three low cost fault block are designed for RC4 and im-
plemented in FPGA operating concurrently with the progress of the main
algorithm consuming low power and resources, providing run time fault de-
tection efficiency without affecting its throughput. On detection of even
one fault the algorithm ceases execution. Had the main algorithm and fault
10
blocks are executed sequentially, the throughput would have been reduced
and the attacker would have been able to get the secrets of the algorithm
observing power and timing parameters.
5. References
References
[1] S. Paul and B. Preneel, “A new weakness in the rc4 keystream generator
and an approach to improve the security of the cipher,” in FSE, 2004,
pp. 245–259.
[2] T. Good and M. Benaissa, “Hardware results for selected stream cipher
candidates,” in of Stream Ciphers 2007 (SASC 2007), Workshop Record,
2007, pp. 191–204.
[3] S. Ghosh, D. Mukhopadhyay, and D. R. Chowdhury, “Petrel: Power
and timing attack resistant elliptic curve scalar multiplier based on pro-
grammable gf(p) arithmetic unit,” IEEE Trans. on Circuits and Sys-
tems, vol. 58-I, no. 8, pp. 1798–1812, 2011.
[4] M. Nikodem, “Error prevention, detection and diffusion algorithms for
cryptographic hardware,” in Dependability of Computer Systems, 2007.
DepCoS-RELCOMEX ’07. 2nd International Conference on, june 2007,
pp. 127 –134.
[5] M. P. John Lach, William H. Mangione-Smith, in Efficiently Supporting
Fault-Tolerance in FPGAs, August 31 ACM Digital Library. 1998, pp.
138 –148.
[6] N. G. K. Paul, in Hardware Controlled and Software Independent Fault
Tolerant FPGA Architecture, August 31 15th International Conference
on Advanced Computing and Communications.IEEE 2007, pp. 138 –148.
[7] K. K. et al, in A NOVEL SRAM-BASED FPGA ARCHITECTURE
FOR EFFICIENT TMR FAULT TOLERANCE SUPPORT, August 31
IEEE Xplore 2009, pp. 138 –148.
[8] G. B. et al, in Error Analysis and Detection Procedures for a Hardware
Implementation of the Advanced Encryption Standard, vol. 52, no. 4,
April IEEE TRANSACTIONS ON COMPUTERS,2003.
11
[9] C.-H. Y. et al, in Simple Error Detection Methods for Hardware Im-
plementation of Advanced Encryption Standard, vol. 55, no. 6, IEEE
TRANSACTIONS ON COMPUTERS, 2006.
[10] J. B. et al, in ERROR CORRECTION PROCEDURES FOR A HARD-
WARE IMPLEMENTATION OF THE ADVANCED ENCRYPTION
STANDARD, vol. 55, no. 6, June ELSEVIER, 2006.
[11] C. N. Z. et al, in Multiple Dimensional Fault Tolerant Schemes for
Crypto Stream Ciphers, vol. 2, no. 3, July International Journal of Net-
work Security & Its Applications (IJNSA).
[12] J. M. et all, in On the Synthesis of Attack Tolerant Cryptographic Hard-
ware, IEEE, 2010.
[13] M. Ali, M. Eltabakh, C. Nita-rotaru, M. Ali, M. Eitabakh, and C. Nita-
rotaru, “Ft-rc4: A robust security mechanism for data stream systems,”
2005.
[14] R. Paul, S. Saha, J. K. M. S. U. Zaman, S. Das, A. Chakrabarti, and
R. Ghosh, “A simple 1-byte 1-clock rc4 design and its efficient implemen-
tation in fpga coprocessor for secured ethernet communication,” CoRR,
vol. abs/1205.1737, 2012.
[15] E. W. W. Peterson, in Error Correcting Codes, vol. 2nf ed., MIT press,
1972.
[16] R. Paul, S. Saha, C. Pal, and S. Sau, “Novel architecture of modular ex-
ponent on reconfigurable system,” in Engineering and Systems (SCES),
2012 Students Conference on, march 2012, pp. 1 –6.
[17] W. N. TOY, in A Novel Parallel Binary Counter Design with Parity
Prediction and Error Detection Scheme, vol. C-20., no. 1, IEEE TRANS-
ACTIONS ON COMPUTERS, JANUARY 1971.
[18] Xpower. http://www.xilinx.com/support/documentation/sw_manuals/xilinx11/ug733.pdf.
12
