A5 is the basic cryptographic algorithm used in GSM cellphones to ensure that the user communication is protected against illicit acts. The A5/1 version was developed in 1987 and has since been under attack. The most recent attack on A5/1 is the "A51 security project", led by Karsten Nohl that consists of the creation of rainbow tables that map the internal state of the algorithm with the keystream. Rainbow tables are efficient structures that allow the tradeoff between run-time (computations performed to crack a conversation) and space (memory to hold pre-computed information). In this paper we describe a very effective parallel architecture for the creation of the A5/1 rainbow tables in reconfigurable hardware. Rainbow table creation is the most expensive portion of cracking a particular encrypted information exchange. Our approach achieves almost 3000x speedup over a single processor, and 2.5x speedup compared to GPUs. This performance is achieved with less than 5 Watt power consumption, achieving an energy efficiency in the order of 150x better that the GPU approach.
INTRODUCTION
Cell-phone security and call privacy in an everyday issue for many people. To ensure protection against eavesdroppers in user communication GSM cell-phones employ a set of cryptography algorithms and protocols to provide authentication and encryption. The encryption algorithm used in GSM phones to protect the actual communication transmission is the A5/1. As it is usually the case with all encryption algorithms, the A5/1 algorithm has been under attack since its introduction. The actual algorithm has never been published, and the first pieces of information appeared in 1994 [16] [2] . In 1999 the algorithm was reverse engineered from actual GSM equipment by Briceno, Goldberg, Wagner [6] . Since then several attacks have been published.
The general idea of an attack it to determine the encryption key for a particular piece of encoded information (transmission in this case). Then, the progress of the encryption process can be followed, and effectively all information after that point can be decrypted.
Breaking a cryptographic algorithm is basically the computation of the inversion of the one-way function used in that particular algorithm. If a cryptographic function leads to an "n-bit" result, there are two straightforward methods: a) an exhaustive search can be performed computing an average of 2 n−1 values until the target is reached, b) 2 n input and output pairs can be pre-computed and stored in a table. In order to invert a particular value, we just look up the preexisting image in the table; in this case the inverting function requires only a single lookup.
A space time tradeoff exists between these two extremes. For example of n is large, computing but also storing a table with 2 n entries is impractical. The storage requirements can be reduced if we either accept that not all instances of encrypted texts can be decrypted, or if we add more computation to "generate" additional entries that for practical reasons cannot be actually stored in the table. Hellman [11] was the first to explore this tradeoff, followed by Oechslin [3] [14] that extended Hellman's work and proposed Rainbow Tables, which are utilized in the pre-computation task of the algorithm.
In this paper we exploit (a) the parallelism in the creation of rainbow tables and (b) the structure of reconfigurable logic to propose an architecture that provides high bandwidth in generating rainbow table entries. We implement our architecture on a high-end FPGA device; our implementation results clearly demonstrate that our approach is almost 3000 times faster than the corresponding software. Furthermore, our highly parallel architecture is very power and energy efficient compared to both microprocessors and GPUs, scales very well, and can be efficiently utilized in the future generations of the even larger reconfigurable devices.
DESCRIPTION OF THE A5/1 STREAM CIPHER
A5/1 is a stream cipher whose keystream generator produces according to a key and a frame counter a sequence of pseudo-random bits, the keystream. The 64-bits key, denoted Kc, is extracted from the A8 algorithm and its 10 least significant bits are usually set to zero. The 22-bits frame counter, denoted Fn, is a value produced by the frame number, which is assigned in every frame in GSM. A frame comprises of 8 bursts with each one lasting 0.5769 ms and therefore the frame counter is renewed approximately every 4.615 ms. Internally the generator consists of three Linear Feedback Shift Registers (LFSR) R1, R2,R3 which are clocked either regularly or according to the clock rule described below. All three registers are maximum length and Figure   1 ). The first 114 bits form BLOCK1 and the subsequent 114 bits form BLOCK2. BLOCK1 and BLOCK2 are combined by XOR with the 114-bits plaintext and the 114-bits ciphertext to be used as follows: on the mobile station side BLOCK1 is used for encrypting on the uplink and BLOCK2 for decrypting on the downlink, and on the network side BLOCK1 is used for decrypting on the uplink and BLOCK2 for encrypting on the downlink. Plaintext is organized into blocks of 114 bits as a result of the technique of TDMA. 
RELATED WORK
Cryptanalysis is indispensable part of the strength of a cryptographic algorithm. Although not being published, the first information about the design of A5/1 appeared in 1994 [16] [2], as well as an attack on alleged A5/1 [9] , followed by the algorithm being reverse engineered from actual GSM equipment by Briceno, Goldberg, Wagner in 1999 [6] . Since then several attacks have been published with few examples being those by Eli Biham, Orr Dunkelman [4] , by Biryukov, Shamir, Wagner [5] , by Keller, Seitz [12] , and by Gendrullis, Novotny, Rupp [8] . The most recent attack, on which our implementation is based on, is Karsten Nohl's A5/1 Security Project [1] . The project aimed at generating a set of rainbow tables that can crack any conversation. Constructing of the tables was implemented in specialized processors such as in GPUs and PS3 cell and they have a total size of approximately 2 Terabyte.
COPACOBANA, the "Cost-Optimized Parallel COde Breaker", is a large scale FPGA-based parallel machine optimized for running cryptanalytical algorithms [10] . It is using 120 FPGA chips and is has been used for DES cracking using a brute-force, exhaustive key search. It has also being used for space-time approaches, but as a complete system, with little information revealed on individual functions. [10] offers a short discussion of their approach on A5/1 using rainbow tables using a similar approach for parallelism. However, their system is based on multiple FPGA devices and they report the entire breaking process varying the quality (coverage) of the produced tables, so it is difficult to extract the creation cost to directly compare with our approach and results.
TIME-MEMORY TRADE-OFF AND RAINBOW TABLES
Time-memory tradeoff (TMTO) attack was first introduced in 1980 by Hellman [11] and since then has gone through many improvements with the most significant being the ones proposed by Rivest (as reported in [7] ) and Oechslin [14] . In 1982 Rivest [7] suggested the use of distinguished points (DP); that means stored points satisfy a specific condition such as the last bits being zeroed. This improvement reduces by a significant margin the lookup operations required for a hit.
In 2003 Philippe Oechslin [14] introduced rainbow tables, a variation of the original time-memory tradeoff by Hellman.
The basic idea is the use of multiple round functions R. Given the fact that in every step corresponds a different R i , where 1 ≤ i ≤ t-1 and consecutively a different function f i , 1 ≤ i ≤ t-1 where t represents the length of the chains, the possibility of merges is decreased (possibility of merge 1/t at a collision). Also decreased are the lookups in a table (by a factor t) and the calculations required to find a key (by a factor 2).
In the A5/1 Security project the time-memory tradeoff method is a rainbow Table structure In this paper the implementation is based on the theoretical background provided by the A5/1 Security project. The structure of the tables is based on the computation of the chains and storing of their startpoints and endpoints to pairs. The fundamental parts needed for the computation of the chains are the Α5/1 machine and the functions R. The Α5/1 machine basically operates by clocking forward the LFSRs R1, R2, R3 for 64 clock cycles with the clock rule enabled. The functions R consist of the operation XOR with different 64-bits pseudo-random values produced by clocking 64*Ν times a 64-bit wide LFSR, where Ν is the current round of the procedure (1≤Ν≤32). Basically the functions R are round functions and not reduction, since resizing is not a necessity because the Α5/1 machine's output and the internal state of the algorithm are both equal in the length of 64 bits. Construction of the chains is conducted as follows: each SP is fed to the Α5/1 machine and the 64-bit output is XORed with the first value of R. If the result is a distinguished point then it forms the first after the SP chain link and the process moves to find the next link. If it doesn't satisfy the DP condition then the result after the XOR is reloaded into the Α5/1 machine and the same procedure is followed until a DP is found. The chain is completed once the DP that forms the endpoint is retrieved after the last R. Intermediate links are then discarded and only the pairs of SPs-EPs are stored in the table.
ARCHITECTURE AND IMPLMENTATION
The procedure followed for the construction of a chain is shown concisely in figure 3 and is described below: SP enters the Α5/1 module where it is divided to 3 parts and fed to the LFSRs, meaning that bits 63-45 are loaded to R1, bits 44-23 to R2 and bits 22-0 to R3. R1, R2, R3 are Fibonacci LFSRs and their feedback gate type is XOR. An fsm inside the Α5/1 module performs the 64 clockings with clock rule enabled and the bits produced by XORing the 3 most significant bits make up a 64-bit signal.
This signal is then XORed with the 64-bits value coming from RF module which implements the operation of the round function. The values that are XORed at each round are produced by a maximum length 64-bit Fibonacci LFSR with feedback gate type XNOR. To reduce the required resources and runtime, the values are precomputed into an array and drawn according to a counter that marks each round.
The result of the operation XOR of the Α5/1 module and the value of the RF that maps the first round is checked through the functionality of a comparator whether it satisfies the DP condition, that is if the last 15 bits are zero. Fsm takes hold of the comparator's output and decides for the following step of the execution. If the result is not DP, then it is refed like a 'new' SP to the Α5/1 module and the same process through the modules with the same value of round function is repeated until the result is a DP. As soon as this DP is found, it consists the first chain link and the procedure is iterated by changing the round function value to map the next rainbow table color, until all 32 values have been used to calculate all DP chain links. In the end of all the rounds a signal done is activated and the last result is stored as the EP.
Due to the parallelization supported in hardware, multiple chains are calculated simultaneously to form a table. The overall architecture is depicted below.
Demultiplexer is used to face the problem of IOBs; at each cycle it takes as input a SP of 64 bits and gives as output an array of 64-bits entries and size equal to the number of instances. The cycles spent for the operation of the demultiplexer are negligible compared to the total number of cycles of the implementation. Then the array is loaded to the registers and all values are loaded at the same time in the subunits chains. Apart from the timing use, registers serve as factors that reduce the time spent to route the signals for the step of synthesize. The completion of each chain asserts its done signal and when all of them are finished signal doneall asserts the we signal for the memory. The memory used is a Block RAM of depth 2 16 = 65536 and 128 bits wide since every address holds the concatenation of SP (64 bits)-EP (64 bits). Block RAM was selected over Distributed RAM because it is more suitable and efficient for large sized memories. In this way SPs for each table are inserted incrementally as part of an arithmetic progression. This serves the fact that the tables must cover the wider width possible of all possible initial states.
EXPERIMENTAL EVALUATION
The FPGA used for the implementation is Virtex-5 XC5VLX330T. The numbers of parallel chains that were implemented are 10, 128, 181, 200, 250, 300 with the maximum one, according to the recourses occupied in the particular FPGA, being 345. The implementation operates for one instance at maximum frequency 178 MHz, ~150MHz for 256 instances and at 146MHz for 345 instances. Tables 2, 3 depict the speedup and comparatively the runtime for the table calculation in hardware and in software for the numbers of parallel instances implemented up to 345. For the hardware part, time was calculated by the results of simulation and place and route. For the software part, Intel VTune Performance Analyzer 9.1 was used in Windows XP environment with Pentium 4 3.01 GHz processor. The codes used are in C++ language and they do not include any printing command. For the hardware implementations with multiple parallel instances, the total number of cycles to complete the procedure is equal to the cycles needed to calculate the longest chain. The clock period is only slightly increased as the number of chains increases and consecutively performance increases rapidly as the parallelism prevails. At 345 parallel instances, FPGA implementation of the rainbow table is 2824x faster than software implementation for the same startpoints. Comparison with GPU performance Our full-sized architecture with 345 modules is able to produce 415 chains/second (we compute 345 chains every 830 ms). Published results for the A5 Security Project show that the original rainbow generation code runs on a GTX 260 at a speed of 162 chains per second [15] . The current state of the art approach is to use a modified software code that runs much faster but produces lower quality tables (i.e. trades coverage for speed). With the modified code, more powerful GPUs (GTX280 and ATI HD5870) can produce up to 500 chains per second while a PS3 can produce 120 chains per second [13] . According to these figures, our approach is about 2.5 times faster than a GTX 260 on the same code and result quality, and roughly comparable with faster GPUs running a faster but lower quality software approach. Furthermore, the power consumption of a GTX 280 or a HD5870 card at full load is about 250 Watt while the power consumption of our FPGA is only 4.2 Watt. Combining the advantage in speed with the advantage in power consumption we find that for the same results, our approach requires about 150 times less energy that a GPU for equal quality results.
CONCLUSIONS
An FPGA architecture was presented for the implementation of rainbow tables aimed at breaking the algorithm Α5/1. Given 64 consecutive bits of a ciphertext, time-memory tradeoff attack retrieves the initial state of the algorithm and allows deciphering of a message, and furthermore deciphering of a conversation. We found that the creation of rainbow tables in hardware is -as expected-exceedingly faster than their creation on a PC. We achieved the parallel calculation of 345 specific chains which on a workstation requires 39.1 minutes whereas in this implementation in hardware only 830 ms are needed. Put in other words, a single FPGA device achieves speedup of almost 3000 times. Larger FPGA devices can achieve even greater throughput, while the processing can be split to other FPGA boards (much like the COPACOBANA approach) to achieve even greater performance. Our approach is also 2.5 times faster than an optimized GPU implementation of the same code and result quality, while it operates at sixty times less power. The overall energy efficiency of our approach is about 150 times better than that of a GPU.
