We propose an efficient hardware architecture for the Blowfish algorithm [1]. The speed is up to 4 bit/clock, which is 9 times faster than a Pentium. By applying operator-rescheduling method, the critical path delay is improved by 21.7%. We have successfully implemented it using Compass cell library targeted at a 0.6 µm TSMC SPTM CMOS process. The die size is 5.7x6.1 mm 2 and the maximum frequency is 50MHz.
maximum frequency is up to 50MHz.
II. BLOWFISH ALGORITHM
The elementary operators of Blowfish algorithm include table-lookup, addition and XOR. The table includes four S-boxes (256x32bits) and a P-array (18x32bits).
The Blowfish algorithm consists of four steps including table initialization, key initialization, data encryption and data decryption. Fig. 1 shows the Blowfish encryption algorithm
III. THE PROPOSED ARCHITECTURE

A. Operator Rescheduling
When calculating "s = a + b", the i-th bit of s is equal to a i ⊕ b i ⊕ c i , where c i is the carry-in of i-th bit. The operators include only carry generators and XOR, so we can use operator-rescheduling method to reduce the critical path delay. Fig. 3 shows the result of operator rescheduling.
The gray line in these figures shows the critical path. The original critical path delay is two CG delay plus five XOR delay. After rescheduling, the critical path delay is reduced to two CG delay plus two XOR delay. Three 2-input XOR delays are hidden. According to a synthesizer's report, the improvement of critical path delay is about 21.7%.
B. Fast Carry Generator
The fast carry generator is based on a carry-lookahead adder [3] . We construct the carry generator using hierarchical 4-bit carry generators.
C. The System Configuration
Controller
The controller is implemented as a finite state machine and described in a behavioral Verilog model. See Fig. 4 .
Datapath
It includes ROM modules, SRAM modules, and the main arithmetic units of Blowfish. Fig. 5 shows the datapath architecture.
Because the size of SRAM module is 2 n words, P1 and P18 are implemented as registers, and the others are mapped to 16x32 bits SRAM. We use a shift register under DataIn to expand 4-bit input to 64-bit input and a shift register over DataOut to reduce 64-bit output to 4-bit output.
CORE implements the loop of the 16-round iteration. A pipeline stage is added to the output of the SRAM modules. The pipeline stage will double the performance of the Blowfish hardware but lead to the overhead of area.
D. DFT Consideration
The testing circuit of the controller is done by adding scan registers to store the signals of the controller and scan out the contents of the registers in test mode.
The datapath is described by Verilog RTL model. All of the flip-flops of the datapath are replaced by scan flip-flops. Table 1 shows the feature of this chip. The maximum frequency of this Blowfish cipher chip is 50MHz. Fig. 6 shows the photomicrograph.
IV. EXPERIMENTAL RESULTS
V. CONCLUSION
The proposed hardware architecture of the Blowfish algorithm can achieve high-speed data transfer up to 4 bits per clock, which is 9 times faster than a Pentium. By applying operator-rescheduling method, the critical path delay is improved about 21.7%. Besides, DFT is also taken into consideration. Specially, the chip is cascadable that means if two chips are used, the performance is double. The test results show that the maximum frequency of this Blowfish cipher chip is 50MHz. The proposed architecture has satisfied the need of high-speed data transfer and can be applied to security device of a system. Table 1 The chip feature 
REFERENCE
