Abstract-This paper focuses on the design and analysis of a versatile Field Programmable Gate Array (FPGA) hardware for the Skein hashing algorithm. A single design capable of processing individual messages sequentially, multiple messages using pipelined architecture, or executing Skein's tree hashing mode using the same pipelined architecture was developed for the Skein-256 version of the algorithm. Emphasis was placed on efficient use of FPGA resources and detailed performance analysis of pipelined tree hashing. The design is compared with current sequential and tree hashing FPGA implementations. The post place-and-route results show that our design achieves a maximum throughput of 1.4Gbps in sequential mode, 6.6Gbps in multiple message mode, and 6.6Gbps in tree hashing mode on a Virtex-5 FPGA and 1.5Gbps, 7.7Gbps, and 7.7Gbps on a Virtex-6 respectively.
I. INTRODUCTION
Hashing functions play a fundamental role in modern cryptographic systems. They are used to produce a small fixed size output known as the digest or hash from an arbitrary length input [1] , [2] . Typical applications of these functions include data integrity verification and message authentication schemes. Finding collisions for the most commonly used Secure Hash Algorithm, SHA-1, is a likely possibility as cryptanalysis techniques continue to develop over time [3] , [4] . Although no weaknesses have been found in any of the functions belonging to the SHA-2 family, they are not expected to have much support in the future as they share the main design concepts with SHA-1. As a result the National Institute of Standards and Technology (NIST) initiated a competition to develop a new hashing algorithm to be named SHA-3 [5] . Since the initial submissions on October 31st, 2008, the field of remaining algorithms has been narrowed to five finalists: Keccak, Grøstl, JH, BLAKE, and Skein. Each of these algorithms is being carefully scrutinized by the cryptographic community and NIST for security, and both software and hardware performance. Selection of the new standard is expected to occur in 2012.
This work focuses on a new, versatile FPGA architecture for the Skein hashing algorithm. Developed by a team led by Bruce Schneier and Niels Ferguson [6] , Skein is based around the Threefish block cipher used in a MatyasMeyer-Oseas (MMO) construction and additional wrapper around MMO called Unique Block Iteration (UBI). The Skein algorithm has many modes of operation including a tree hashing mode that allows for paralellization of processing at a message level. As part of this work, a single architecture capable of operating in sequential, pipelined multiple message, or pipelined tree mode was developed for the Skein-256 version of the algorithm. Performance analysis was performed to determine the theoretical throughput of all three different modes.
II. SKEIN HARDWARE OVERVIEW
The most common hardware architectures of Skein implement the sequential versions of the algorithm. The iterative architecture, first introduced in [7] implements a single round in hardware that is iterated over for each round of Threefish. The dynamic rotations in Threefish require large multiplexers which severely inhibits performance. Subsequently, in [8] , 8 rounds of Threefish were unrolled as to remove these dynamic rotations and improve the performance greatly. A 4-round variation of the unrolled architecture was then presented in [9] and [10] . The 8-round unrolled Threefish core was the basic starting point for our previous [11] and current work. Table I is a short summary of results for sequential Skein-256 from [7] , [8] , and [11] . 
Design
Device Slices Throughput (Mbps) [7] XC5VLX30-3 1001 408.68 [8] XC5VLX110-3 937 1751.00 [11] XC5VLX110-3 1281 1603.31
It is important to note that results from [8] are post synthesis whereas the remaining two are post place-androute. Also, the design from [11] implements more complex control logic (capable of precessing data in the tree mode) which accounts for the increase in resource utilization. No additional resources, such as BRAM or DSP blocks are reported for any of the implementations presented in this paper.
In [13] , Walker, Sheikh, Mathew and Krishnamurthy explore the benefits of a pipelined Skein architecture. Their Application Specific Integrated Circuirt (ASIC) architecture inserts additional pipeline registers between Threefish rounds. By inserting these pipeline registers, the critical path is significantly decreased resulting in increased clock frequency at which circuit can operate. However, due to the data dependencies, throughput can be improved only when hashing multiple messages simultaneously. A maximum performance is achieved when processing a number of messages equal to the number of pipeline registers. Although pipelined FPGA performance data is not available for Skein-256 (which was the focus of this work), 3 different architectures for Skein-512 were developed in [12] . The designs are as follows: a 4-round unrolled architecture with 2 pipeline registers, 4-round unrolled with 5 pipeline registers, and an 8-round unrolled design with 10 pipeline registers which is similar to this work. A summary of the results is shown in Table II . The tree hashing mode of Skein allows for multiple blocks of a single message to be processed simultaneously [6] . The original message is split up into smaller chunks which are subsequently processed by separate UBI elements. Each UBI is considered a node of the tree. The results of these UBIs are concatenated into a new message, and this new message is split up again and processed in the same fashion at the next level of the tree. This occurs until only a single UBI is needed to complete the message processing, i.e. the root node of the tree is reached. Skein's specification uses three key parameters that determine the structure of the tree and how the message is split up. These parameters are the leaf-size (Y L ), node-fanout (Y F ), and maximum tree height (Y M ). Each leaf-level UBI processes up to 2 YL blocks of a message and each UBI above the leaf-level process up to 2 YF previous UBI results. If a maximum tree height Y M is specific, a single node at this level will process all UBI results from the previous level regardless of Y F . In our previous work [11] , the standard unrolled core was duplicated in hardware in order to process the individual tree nodes (UBIs) at the same time. The hardware results are shown in Table III . One can see that as the number of cores increases, an expected speedup is achieved over the standard sequential architecture.
III. VERSATILE SKEIN ARCHITECTURE
The architecture developed in this work incorporates aspects from the 8-round unrolled architecture in [8] , the pipelined architecture in [13] , and the tree hashing architecture [11] . The complete high-level block diagram of the versatile Skein-256 architecture is shown in Figure 1 . This architecture is comprised of two main components, the control logic and the Skein hashing core. The Skein-256 hashing core contains all the logic necessary to execute the Threefish block cipher. The main aspect of this design is the versatile round architecture shown in Figure 2 . In addition to the standard Threefish MIX and PERMUTE functions in each round, a register is placed at the output of each round. To accommodate both sequential and multiple message hashing modes, multiplexers are placed at the input of each round. This allows for either the registered or unregistered output of the previous round or subkey addition to be selected as the input for the round. This is also accomplished without adding any slice overhead in the FPGA as the multiplexers are integrated into the look-up tables (LUTs) and the registers are already present in the each of the slices as shown in Figure 3 . To ensure that the synthesis tools properly integrate the multiplexers into LUTs, the Xilinx LUT6 2 primitive was instantiated in the VHDL [14] . To further reduce the critical path of the design, a pipeline register is also placed at the output of each subkey addition. These additional registers reduce the critical path to only a single 64-bit addition. Therefore the total number of pipeline registers in the 8-round unrolled Threefish function is 10. The second major change needed in order for the design to operate in the three aforementioned modes is the re-design of the subkey generator. As with the sequential 8-round unrolled architecture, two subkeys must be generated each cycle. However, since up to 10 messages may be hashed simultaneously with this design, the subkey generator must be capable of storing the key and tweak words for each of the messages. In this design, LUT-based FIFOs are used to store these words.
The concept of pipelined tree hashing comes from the fact that the pipelined core is capable of hashing multiple messages simultaneously, which means it is also capable of processing multiple nodes of a tree. The advantages here are reduced resource utilization and improved performance due to the fact that the core does not need to be duplicated and it can be run at a higher clock frequency, respectively
Control logic was implemented to coordinate processing with the Skein core and support reading message blocks from memory. The control logic is also responsible for the switching between sequential, multiple message, and tree hashing modes based on user input. The final design was modeled in VHDL and tested using the KAT message input files provided by Skein's NIST submission [3] . Functional verification was performed using ModelSim simulations including post place-and-route simulations. The actual hardware design was tested on the Xilinx ML605 development board containing a Virtex-6 LX240T FPGA. A MicroBlaze soft-processor was used to run the tests in hardware. Post place-and-route results for this design on Virtex-6 and Virtex-5 FPGAs are shown in Table IV . These results are for a design optimized for the pipelined modes. The same architecture can be optimized for the sequential datapath to attain throughputs of 1.56Gbps, 1.43Gbps, and 1.82Gbps for the XC5VLX110-3, XC6VLX240T-1, and XC6VLX240T-3 devices, respectively, which are similar to the results of previous sequential designs
The ATHENa scripts from [10] were used to automate and optimize the synthesis and implementation of the VHDL models. The reported clock frequencies are explained in more detail in the following section.
IV. PERFORMANCE ANALYSIS
In order to accurately compare this design against previous work, equations have been developed to calculate the throughput of each of the modes this architecture supports. In this design the system is clocked at a frequency based on the critical path of the pipelined data path CLK P IP E , thus optimizing the design for multiple message and tree hashing. Therefore, for the sequential mode, a multicycle path is introduced. The number of cycles for the sequential data path CY CLES EN is calculated using Equation 1, where CLK SEQ is the maximum possible operating frequency of the sequential data path. The total latency for the sequential mode is then given by Equation 2, which stems from the fact that the standard 8-round unrolled architecture of Skein-256 has a latency of 10 cycles. Lastly, the throughput of sequential mode -T P SEQ is given by Equation 3 , where msgBlocks is the number of blocks in a message, N b is number of bytes per block, and OH is the number of overhead cycles incurred in the output stage of Skein or any pre-processing overhead cycles. All calculations in this work assume OH = 0 and an ideal memory interface that is capable of delivering a full message block every cycle.
CY CLES
The next set of equations developed in this work are used to determine the throughput of versatile architecture when operating in multiple message mode -T P MM . T P MM is dependent upon the total number of messages being hashed and the size of the largest message. If less than 10 messages are being hashed simultaneously then the wasted pipeline stages reduce the effective throughput of the system. The same effect can be observed when one of the messages is longer than the others, pipeline stages will be wasted once the shorter messages are complete. The general equation for T P MM is given in Equation 4 , and the maximum T P MM in Equation 5 where 10 messages are all of the same length. The latency of the pipelined data path used in multiple message mode is 102 cycles. The first result is produced in 91 cycles which accounts for the 72 Threefish rounds and 19 subkey additions. The remaining 9 messages increase the latency to 100. Two additional cycles are required at the beginning of each Threefish iteration to prime the subkey generator FIFOs.
Before analyzing the throughput of the pipelined tree mode, an important factor to look at is the message processing overhead. When hashing a message in the tree mode, intermediate values formed through the concatenation of leaf-level and above UBI results add to the total number of blocks that need to be processed. This message overhead can be calculated as a percentage by Equation 6 , where UBI L is the number of leaf level UBIs and UBI F is the number of remaining UBIs in the tree.
As Y L and Y F are increased, each UBI processes more blocks of a message or more UBI results from a previous level, resulting in less total UBIs in the tree. This decreased message overhead leads to an increase in throughput for a given message. For Y F = Y L , as the message size increases, the maximum overhead approaches 1/(2 YF − 1). A detailed comparison of the effect of tree parameter values on message overhead is shown in Figure 4 .
In [11] , the throughput of the tree hashing mode was calculated assuming that all of the duplicated cores were utilized at all times. The Algorithms 1 and 2 more accurately describe the latency and throughput of tree hashing as they take a few real processing scenarios into consideration. The 
first is that when the leaf level processing is complete there may be additional open pipeline stages (or cores) to begin processing the next level. The remaining pipeline stages are calculated to be used to process the UBIs above the leaf level. Also, although mostly a concern for smaller messages, all but one pipeline stage (or core) is wasted when processing the root node. Additionally, these algorithms take into consideration UBIs that do not necessarily process exactly 2 YF or 2 YL message blocks. With the reduced overhead as message size increases, especially with larger values of Y L and Y F , an inverse relationship between this overhead and throughput is expected. A theoretical maximum speedup of this design in tree mode over sequential mode is calculated using the total number of pipeline stages P IP ELINE REGS , in this case 10, and the latencies of both the sequential mode LAT SEQ , and Algorithm 2. Latency of fan-out nodes LAT F done = 0 pipelineRegs = 10
pipeline mode latency LAT P IP E . The maximum theoretical speed up is then 10 · LAT SEQ /LAT P IP E . For this design the the resulting maximum speedup is 4.9. A graph of the theoretical speedup for a tree with a given message size and tree parameters Y L = Y F in shown in Figure 5 .
Examining this figure, one can connect the previously discussed message overhead and the maximum theoretical speedup to the theoretical speedups for particular tree parameters. In the case where Y L = Y F = 1, twice as many blocks as the original message are processed in tree mode. In Figure 5 , a maximum speedup of 4.9/2 = 2.45 is seen for Y L = Y F = 1 due to the message overhead. On the opposite 
V. CONCLUSION
This work developed a new, versatile FPGA architecture for the Skein-256 hashing algorithm capable of hashing in sequential, multiple message, and tree modes. The performance of the architecture in each of the modes was analyzed and shows that pipelining allows to achieve a speedup of 4.9 in multiple message and tree modes vs. sequential mode. Also, with only 8% more resources, a 118% improvement in throughput over our previous, two core tree hashing [11] was observed. Using a dual clock or dynamic clock reconfiguration are possible design improvement that would allow the versatile Skein system to operate at the maximum speed for all three modes. Future work may also include implementations of similar architectures for Skein-512 and Skein-1024.
