Energy, Performance, Area versus Security Trade-offs for Stream Ciphers by Batina, L. et al.
PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
 
 
 
 
The following full text is a preprint version which may differ from the publisher's version.
 
 
For additional information about this publication click this link.
http://repository.ubn.ru.nl/handle/2066/127475
 
 
 
Please be advised that this information was generated on 2017-03-09 and may be subject to
change.
Energy, performance, area versus security trade-offs for stream
ciphers∗
Lejla Batina, Joseph Lano, Nele Mentens, Sıddıka Berna O¨rs,
Bart Preneel, Ingrid Verbauwhede
Katholieke Universiteit Leuven, ESAT/SCD-COSIC
Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium
{Lejla.Batina,Joseph.Lano,Nele.Mentens,Siddika.BernaOrs,
Bart.Preneel,Ingrid.Verbauwhede}@esat.kuleuven.ac.be
Abstract
The goal of this submission is to provide a framework and platform to compare stream
ciphers not only on their security level but also based on their energy consumption, per-
formance and area cost. We describe the basic hardware assumptions, give the area, delay
and power consumption values of some existing stream ciphers and give guidelines for the
designs of future algorithms.
Keywords: E0, A5/1, RC4, hardware implementation, power consumption
1 Introduction
(Existing) stream ciphers have the reputation to be very area efficient when implemented in
dedicated hardware, yet at the same time they have a sequential nature and might consume
a lot of energy in software. The goal of this contribution is to quantify these claims (”Are
they true?”) and to provide to the algorithm developers a framework that enables them not
only to assess the security of newly proposed algorithms, but also to make a security versus
area, energy or performance trade-off. The area can be expressed in the number of gates or
the amount of memory required to run the algorithm. Performance can be measured in the
number of encryption bits versus the clock frequency, which is technology dependent. The
energy consumption is measured instead of power consumption because many new applica-
tions will end up into battery operated devices. For these devices, the instantaneous power
consumption is less important, but the amount of encryptions per energy unit (Joule) will
determine the lifetime of the device. In this paper we look at hardware implementations or
co-processors for stream ciphers. Software implementations on existing processors are another
interesting topic, but are only discussed as a reference point. The main reason is that software
implementations of encryption algorithms are usually orders of magnitude slower and need
higher energy than dedicated HW co-processors (as will be illustrated in Section 3, 4 and 5).
∗Lejla Batina, Nele Mentens and Sıddıka Berna O¨rs are funded by research grants of the Katholieke Uni-
versiteit Leuven, Belgium, Joseph Lano by IWT-Vlaanderen. This work was supported by FWO project
(G.0141.03) and (G.0450.04.)
1
We expect similar or even more outspoken differences for stream ciphers. In the next section,
we give a short overview of the stream ciphers. Then in the subsequent sections, we discuss
different issues when evaluating area, performance and energy.
2 Stream ciphers
In the 70s and 80s, research on stream ciphers was mainly focused on the development of
efficient stream ciphers in hardware. Linear Feedback Shift Registers (LFSRs) are the main
building blocks of these ciphers, because of their good statistical properties and their efficiency
in hardware. To make these LFSRs cryptographically secure, the two most common practices
are the use of a Boolean function (sometimes complemented with some nonlinear memory bits)
and the irregular clocking of LFSRs. An extensive overview of the design and the cryptanalysis
of such designs can be found in [13]. Two widely used stream ciphers based on this research
are A5 [1] used in GSM mobile phones and E0 used in the Bluetooth standard [15].
In the 90s, many stream ciphers were proposed that achieve a high performance in soft-
ware, such as LEVIATHAN (Cisco), MUGI (Hitachi-K.U. Leuven), RC4 (R. Rivest), SNOW
(Lund University), SOBER (Qualcomm) and SEAL (IBM). An overview can be found in [12].
Because LFSRs have been very well-studied, some of these software-based stream ciphers use
word-oriented versions of LFSRs. Other designs introduce new approaches, based on concepts
coming from various fields such as block cipher design, chaos theory and random shuﬄes. No
unification has been made in this field.
Recently, algebraic attacks [4] have emerged as a new powerful class of attacks against
LFSR-based stream ciphers. Though it is not yet entirely clear what the impact is of these
attacks, this has incited researchers to search new methodologies that can replace the LFSRs
and that are immune to the many classes of attacks that can be applied on LFSRs. Two
proposals are the counter-assisted number generators [14] and t-functions [10].
3 Area evaluation
One NAND gate is considered to have a unit area in CMOS standard cell based hardware.
According to this calculation, the different operations used in E0, A5/1 and RC4, the area of
these operations and the total area of E0, A5/1 and RC4 are given in Table 1 and Table 2.
The gate count numbers given in Table 2 are lower bounds. When the circuits are im-
plemented in practice some overhead is introduced such as buffers, wiring area, etc. Table 1
can be used as a guideline to find the gate count of the operations used in existing and newly
proposed stream ciphers. This method is also useful to compare different designs. We now
discuss the stream ciphers E0, RC4 and A5/1 in more detail.
• The only published hardware implementation of the E0 stream cipher was given by
Kitsos et al. in [8]. It is synthesized, placed and routed using the XILINX field-
programmable gate array (FPGA) (Virtex-E V2600E-FG1156). The system clock fre-
quency is 15 MHz. It uses 895 configurable logic blocks (CLBs), 1 508 function genera-
tors and 300 D flip-flops.
We also have implemented the E0 stream cipher using the XILINX FPGA (Virtex-
HQ800). The system clock frequency is 93.36 MHz. It uses 132 slice flip-flops, 140
4-input look up tables (LUTs). The total equivalent gate count is 1 902.
Table 1: Gate count of the operations used in E0, A5/1 and RC4
Operation Gate count
n-bit LFSR n ∗ 12
4-bit Adder 1 FA+2 HA=5 AND+2 OR+4 XOR=20.5
FF 12
XOR gate 2.5
AND gate 1.5
OR gate 1.5
n-byte RAM n ∗ 12
n-bit register n ∗ 12
8-bit Adder 1 HA+7 FA=1 XOR+1 AND+7 ∗ (3AND + 2OR+ 2XOR)=91.5
2-bit MUX 2 AND+1 OR+INV=5
Table 2: Operations used in E0, A5/1 and RC4 and the total area using Table 1.
Steam cipher Operations Gate count
E0 128-bit LFSR, 4-bit Adder, 4 FFs, 13 XOR gates 1 637
A5/1 61-bit LFSR, 5 XOR gates, 3 AND gates, 2 OR gates 752
RC4 4 256-byte RAM, 4 8-bit register, 3 8-bit adder, 2-bit MUX 12 951.5
• A hardware implementation of RC4 is given by Kitsos et al. in [7]. It consists of
a control and a storage unit. The storage unit is responsible for the key setup and
key stream generation. The storage unit contains memory elements for the S-Box and
K-Box, along with 8-bit registers, adders and one multiplexer. The whole design was
synthesized, placed and routed by using a XILINX FPGA device. 139 I/Os, 256 function
generators, 138 CLB Slices, 279 D flip-flops or latches, 768 bytes RAM Blocks were used.
The clock frequency is 64 MHz and the throughput after setup is reported as 22 MB/s.
For the full execution of the algorithm the proposed implementation needs 768 + 3 ∗ n
clock cycles (n is the number of bytes of the plaintext/ciphertext).
• A5 consists of three Fibonacci LFSRs of sizes 19, 22, and 23 respectively, which are
initially loaded with the contents of the 64-bit key. The middle bits of all three LFSRs
are examined at each clock cycle to determine which registers shift and which do not
(at least two of the three registers shift in each clock cycle). The parity of the high bits
of the LFSRs is output after each shift, and this output bitstream is XOR’d with the
ciphertext to recover the original message. The resource requirements for A5 are quite
minimal; they consist mainly of the 64 flipflops that make up the three LFSRs.
We have implemented the A5/1 stream cipher using the XILINX FPGA (Virtex-HQ800).
The system clock frequency is 90.85 MHz. It uses 64 slice flip-flops, 70 4-input look up
tables LUTs. The total equivalent gate count is 932.
4 Performance evaluation
4.1 Performance for bulk encryption
Stream ciphers are sequential in nature, which means they cannot be pipelined. LFSRs are
examples of bit serial design [11]. To get performance you need to clock very high. In general
the throughput can be defined as given by Eq. (1).
Throughput = N ∗ clock frequency, (1)
where N is the number of bits produced in every clock cycle.
In the case of E0 and A5/1, there is only one bit produced at each clock (N = 1), hence
the throughput of the circuit is equal to the clock frequency. As the critical path of these
two designs is very short, we still can achieve a good HW performance because a very fast
clock can be used. RC4 on the other hand produces 8 bits at a time, and it needs 3 clocks
to produce 8 bits of output, thus N = 8/3 for RC4. The implementation of RC4 in HW is
much more complicated than of the LFSR-based designs. Its critical path is longer and its
clock frequency lower. So the lower clock frequency of RC4 is to some extent compensated by
the higher number of bits N produced at each clock. The throughput of E0, A5/1 and RC4
implementations described in this paper are given in Table 3. A comparison of these figures
is not fair as the implementations have a varying degree of optimization and do not use the
same FPGA.
Table 3: Throughput of E0, A5/1 and RC4 implementations in HW
Steam cipher Hardware
E0 93 Mbits/s
A5/1 90 Mbits/s
RC4 171 Mbits/s
In software, the reasoning is different. The clock frequency is now a constant for a given
processor. So increasing the performance of the stream cipher should be achieved by making
N as high as possible and the number of instructions to generate the N bits as low as possible.
This can be done by trying to achieve as much as possible per clock cycle, by an efficient usage
of the processor capabilities and its word size. Bit serial designs, such as E0 and A5/1, are
evidently not efficient in current 32-bit or 64-bit word length processors. RC4 is much more
efficient with its byte-oriented approach, though it still does not make optimal use of the word
sizes (but this is largely compensated by the simplicity of the cipher in software).
Because of these limitations on the throughput, designers have moved away from bit serial
design and have adopted a “word serial” approach, where a whole word (8, 32 or 64 bits)
of bits is output instead of a single bit: the throughput will increase with both the clock
frequency and the word length:
Throughput = N ∗Wordlength ∗ clock frequency (2)
where N is the number of words produced in every clock cycle.
So as it can be seen from Eq. (1) and Eq. (2) producing more than one bit at a time will
increase the throughput of the future stream ciphers. In order to increase “word length”,
parallelism of the operations in one algorithm can be used. Also using a pipeline structure
will increase the clock frequency. Stream ciphers with tight feedback loops (meaning that the
current input depends on recent outputs) cannot be pipelined. Thus it is important to avoid
tight feedback loops in stream cipher algorithms.
4.2 Performance in actual applications
Most stream ciphers used in practice are so-called synchronous stream ciphers. In these
stream ciphers the key stream is generated independently from the plaintext. The advantage
of such designs is that there is no error propagation. A drawback is that sender and receiver
need to remain synchronized.
In many applications (GSM, Bluetooth, Internet, . . . ), this problem is overcome by fre-
quent resynchronization of the stream cipher: the ciphertext is divided into small packets
which are encrypted and sent separately. In GSM such a packet is 224 bits, in Bluetooth it
is at most 2745 bits. Half of the packets circulating on the Internet is only 512 bits long.
To prevent the reuse of key stream, at every resynchronization the secret key is combined
with a counter unique to every packet. The easiest way would be to do so by linearly loading
both values into the state of the stream cipher. However, it has been shown that this leads to
cryptanalytic attacks and thus this resynchronization has to be done in a sufficiently nonlinear
way, see [5, 2].
Resynchronizing thus induces some overhead: the algorithm has to perform several clocks
before it can start outputting information. The shorter the packets, the more important this
overhead becomes.
To quantify this overhead, we define the Resynchronization Factor (RF) as follows:
RF =
Total number of clocks performed
number of clocks performed to generate key stream
. (3)
The RF will increase the clock frequency required from the design to achieve a certain
throughput in the following way compared to Eq. (2):
clock frequency =
Throughput
N ∗Wordlength
∗RF (4)
To be usable in actual applications, stream ciphers should achieve a good tradeoff resulting
in a secure and fast resynchronization mechanism. In Fig. 1, we present the RF as a function
of the length of the packets for the three commonly used stream ciphers A5/1, E0 and RC41.
As we can see, the overhead of RC4 is significantly larger than for the other algorithms, and
becomes a problem for small packets. This can be explained by the fact that the state of RC4
is much bigger, and initializing this state requires more work.
5 Energy evaluation
The power consumption in CMOS circuits is primarily attributed to the voltage changes on
load capacitors and is given by,
1For RC4 we have assumed that the first 256 bytes of key stream are discarded. This is done in most actual
application to overcome some weaknesses of the RC4 algorithm.
102 103 104
0
1
2
3
4
5
6
7
8
9
10
Length of the packet in bits
R
es
yn
ch
ro
ni
za
tio
n 
Fa
ct
or
 (R
F)
RC4
E0
A5/1
Figure 1: RF as a function of the packet length for A5/1, E0 and RC4
PSWITCHING = α0←1CLV
2
ddfclk (5)
where fclk is the clock frequency, Vdd is the operating supply voltage, CL is output load
capacitance and α0←1 is the node transition activity factor, the fraction of the time of the
node makes a power consumption transition (0 ← 1 and 1 ← 0 transitions) inside the clock
period.
The energy consumed by a circuit can be divided into two main categories: the fundamen-
tal energy, which is the energy necessary to calculate the intended result. Then there is the
“overhead” energy, this is the energy associated with all overhead functions, e.g. moving data
around, logic that switches but does not create outputs, etc. This overhead is huge in soft-
ware implementations because there is a large mismatch between the cryptographic operations
and the data paths of the general purpose processors. Hence in embedded devices, crypto
algorithms are usually implemented in hardware accelerators, or dedicated co-processors.
To illustrate this point, we have shown that a dedicated co-processor for AES has an
energy efficiency which is multiple orders of magnitude better than assembly, C and JAVA
implementations. A non-pipelined version of AES can run at 2 to 3 Gbits/s at around 50
mW. This results in around 40 Gbits/Joule encryption performance. The same AES, assembly
optimized on a Pentium XX processor, reaches around 800 Mbits/s (same order as the HW
coprocessor) but requires 40 W to run. This results in only 15 Mbits/Joule. The performance
of compiled software on so-called low power embedded processors is even worse. For instance
an AES written in C and compiled on an embedded 32 Sparc processor will achieve 150
Kbits at 120 mW, which results in 1.1 Mbits/Joule. The reason is that there are two layers of
inefficiency, one is the general purpose compiler, the other is the general purpose architecture.
These inefficiencies need to be taken into account when developing new stream ciphers.
5.1 LFSR designs for low power consumption
Some research is done for low power LFSR design, albeit not for cryptographic applications.
Hamid and Chen proposed to use polynomials with two coefficients having the following
format; P (x) = 1 + x1/n + xn, where n is the order of the polynomial in [6]. Equations for
power dissipation in case of the conventional serial and their LFSRs are shown below. Here
the power dissipated by each clock is included in the power dissipation of the D flip-flop, the
percentage activity of the circuit which is 50%, and parallel LFSRs do not have any input
switches.
Pserial = N (Pdff/2) + (M − 1) (Pxor/2) . (6)
PHamid−Chen = 3Pdff + 4Pand + 2Por + (Pxor/2) . (7)
Where N is number of stages, M is number of taps, Pdff, Pxor, Por and Pand are power
dissipation of the D flip-flop, an XOR gate, an OR gate and an AND gate with one output
capacitance as load, respectively.
Brazzarola and Fummi analyze the problem of selecting the set of primitive polynomial
LFSRs that minimize their switching activity working in isolation [3]. All latches of a LFSR
make the same number of switches in one period. The behavior of all latches is characterized
by the behavior of the first latch translated in the time.
They give the following corollary: The number of switches of a LFSR latch is for the first
and the last latch
(
2n−1 − 1
)
and for the other latches 2n−1 without considering the transition
to the initial state.
Theorem 1 and its corollary given in [3] imply that for a given size n different values of
power consumption between LFSRs do not depend on the switching activity of the latches
since it is the same, but only on the XOR gates.
Kitsos et al. proposed a reconfigurable LFSR design consisting of two basic compo-
nents [9]. The first is the LFSR Data Component, which contains a collection of 16 linear
feedback polynomials. The polynomial degree vary from 8 to 128 according to the Bluetooth
system. The second is the LFSR Clock Component which distributes the clock signal in the
LFSR F/Fs. So, this component determines how many F/Fs are active according to the
proper polynomial degree. This is achieved by using the LFSR Control Signal (LCS).
The critical path of the proposed LFSR is 2.912 nsec and the bit-rate is 340.7 Mbits/s.
Number of library cells is 884. Table 4 in [9] reports the LFSR power consumption measure-
ments. The average is 0.77 mW.
According to the Eq. 6 the power consumption of the stream ciphers E0 and A5/1 can be
calculated as given in Table 4.
6 Conclusions
The main purpose of this paper is to link the major design criteria, area, throughput and
power consumption, to the basic components of stream ciphers. Then we have also formulated
Table 4: Power consumption of E0 and A5/1
Steam cipher Hardware
E0 128 (Pdff/2) + 12 (Pxor/2)
A5/1 64 (Pdff/2) + 5 (Pxor/2)
guidelines for the stream cipher designer to improve these design goals.
References
[1] R. Anderson. A5 (was: Hacking digital phones). sci.crypt post, June 1994.
[2] F. Armknecht, J. Lano and B. Preneel. Extending the Resynchronization Attack (ex-
tended version). Cryptology ePrint Archive, Report 2004/232. http://eprint.iacr.
org/.
[3] M. Brazzarola and F. Fummi. Power characterization of LFSRs. In Proceedings of the
14th International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT),
pages 139–147, Albuquerque, NM, USA, November 1-3 1999. IEEE Computer Society.
[4] N. Courtois and W. Meier. Algebraic attacks on stream ciphers with linear feedback.
In E. Biham, editor, Advances in Cryptology: Proceedings of EUROCRYPT’03, volume
2656 of Lecture Notes in Computer Science, pages 345–359. Springer-Verlag, 2003.
[5] J. Daemen, R. Govaerts and J. Vandewalle. Resynchronization Weaknesses in Syn-
chronous Stream Ciphers. In T. Helleseth, Ed., Advances in Cryptology: Proceedings
of EUROCRYPT’03, volume 765 of Lecture Notes in Computer Science, pages 159–167,
Springer-Verlag, 1993.
[6] M.E. Hamid and C.-I. H. Chen. A note to low-power linear feedback shift registers.
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing,
45(9):1304–1307, September 1998.
[7] P. Kitsos, G. Kostopoulos, N. Sklavos, and O. Koufopavlou. Hardware implementation of
the RC4 stream cipher. In Proceedings of the 46th IEEE Midwest Symposium on Circuits
& Systems, pages 27–30, Cairo, Egypt, December 2003.
[8] P. Kitsos, N. Sklavos, K. Papadomanolakis, and O. Koufopavlou. Hardware implemen-
tation of Bluetooth security. IEEE Pervasive Computing, 2(1):21–29, January-March
2003.
[9] P. Kitsos, N. Sklavos, N. Zervas, and O. Koufopavlou. A reconfigurable linear feedback
shift register (LFSR) for the Bluetooth system. In Proceedings of the IEEE International
Conference on Electronics, Circuits and Systems (ICECS), Malta, September 2-5 2001.
[10] A. Klimov and A. Shamir. New cryptographic primitives based on multiword t-functions.
In B. Roy and W. Meier, editors, Proceedings of FSE, volume 3017 of Lecture Notes in
Computer Science, pages 1–15. Springer-Verlag, 2004.
[11] K. Parhi. VLSI Digital Signal Processing Systems: Design and Implementation, chapter
Bit Level Arithmetic Architectures. Wiley & Sons, 1999.
[12] B. Preneel, V. Rijmen, and A. Bosselaers. Recent developments in the design of con-
ventional cryptographic algorithms. In B. Preneel and V. Rijmen, editors, State of the
Art in Applied Cryptography, volume 1528 of Lecture Notes in Computer Science, pages
106–131. Springer-Verlag, 1998.
[13] R. Rueppel. Stream ciphers. In G.J. Simmons, editor, Contemporary Cryptology: The
Science of Information Integrity, pages 65–134. IEEE Press.
[14] A. Shamir and B. Tsaban. Guaranteeing the diversity of number generators. Inf Comput.,
171(2):350–363, 2001.
[15] Bluetooth S.I.G. Specification of the Bluetooth System, Version 1.2, 2003. www.
bluetooth.org/spec.
