Abstract Block ciphers are the most prominent symmetrickey cryptography kernels, serving as fundamental building blocks to many other cryptographic functions. This work presents RunFein, a tool for rapid prototyping of a major class of block ciphers, namely product ciphers (including Feistel network and Substitution permutation network-based block ciphers). RunFein accepts the algorithmic configuration of an existing/new block cipher from the user through a GUI to generate a customized software implementation. The user may choose from various micro-architectural templates (unrolled, pipelined, sub-pipelined) School of Computer Engineering, Nanyang Technological University (NTU), Singapore, Singapore HDL description of the cipher. Various modes of operation and the NIST test suite may also be included. This highlevel design approach eliminates the laborious and repetitive development efforts for VLSI realizations of block ciphers. It allows a quick design exploration, consequently enabling fast benchmarking in terms of critical resource estimation of various versions/configurations of a cipher that varies in terms of security, complexity and performance. Using RunFein, we have successfully implemented some well-known product ciphers and benchmarked their performance without significant degradation against their published hand-crafted implementations in literature.
Introduction and motivation
The world of cryptography is highly dynamic where newer cryptographic proposals and cryptanalytic attacks are frequently reported. Competitions inviting newer/better block ciphers, stream ciphers and hash functions have an active participation from an ever increasing cryptographic community. A thorough evaluation of these proposals on software/hardware platforms must follow against diverse parameters. RunFein aids the cryptographer by expediting the traditional VLSI development cycle of a cryptographic system through design automation. The user specifies the algorithmic and micro-architectural specifications of a cipher at a high abstraction level conveniently through a GUI. The RunFein tool seamlessly integrates these sub-structures and algorithmic functions into a working cipher model, to gen-erate efficient software and hardware realizations without sacrificing the performance compared to their hand-crafted solutions.
We first discuss two major reasons for motivation of developing a rapid prototyping framework for cryptographic functions, followed by relevant work done and the major scientific contributions of RunFein in this context.
Cryptography is dynamic
Below, we list the major reasons fueling the ever-changing nature of cryptography (supported by one example).
Cryptanalysis Successful cryptanalytic attempts render
the further use of attacked ciphers vulnerable as well as open doors for newer subsequent proposals. Countering cryptanalysis also often requires a modification in the original proposal, e.g., RC4 + [2] . 2. Better machines Development of Custom hardware aids cryptanalytic attacks by enabling even the brute force attacks for small key-sized proposals today; DES can today be broken in less than a day [3] . Moreover, architectural updates in computing machines influence cryptographic schemes, e.g., BLAKE [4] , a hash function supports different word-size versions to cater to both 32/64 bit machines. 3. Newer applications The imminent ubiquitous computing era has initiated newer security applications, e.g., lightweight cryptography for resource-constrained devices. Consequently, lightweight cryptographic proposals aiming at a thrifty area-power budget with reasonable security are frequently proposed, e.g., PRESENT [5] . 4 . Design trade-off Most of the block cipher proposals support multiple modes of operation and versions for variable-sized key, block size, rounds, etc. These versions let the user choose a performance-security trade-off, e.g., varying the number of rounds in Salsa20 [6] .
Tedious design space exploration
An increased interest in subsequent cryptographic competitions, including AES [7] , NESSIE [8] , CRYPTREC [9] , eSTREAM [10], SHA-3 [11] and CAESAR [12] , is evident by the growing number of candidate proposals submitted to them, compared to their successor. Performance is a decisive factor to select a competition finalist; hence, all submitted proposals (that withstand cryptanalytic attacks) must be ordered on the basis of their performance. The computational efficiency, both as hardware and software implementations, was a major reason in the selection of Rijndael as AES [7] and Keccak as SHA-3 [13] . Evaluating a cipher's quality is an involved task, since there are multiple versions/modes of operations, evaluation parameters and implementation platforms to choose from. Table 1 gives a glimpse of multiple parameters used typically to evaluate the suitability of a cipher for a particular implementation platform. Developing custom computing architecture and mapping on known processors are termed here as the hardware and software implementation platforms, respectively. For software platforms, the throughput of a cipher is specified in terms of cycles/byte (stream ciphers, PRNGs), cycles/hash (hash functions) or cycles/block (block ciphers). For hardware platforms, the basic parameters mentioned in Table 1 are sometimes taken up as hybrid combinations, e.g., energy/bit, throughput per area ratio (TPAR) etc. Moreover, we may have multiple performance figures on the same computing platform according to the software optimizations or hardware configurations chosen for cipher implementation.
Quantifying the performance of VLSI implementations requires benchmarking against diverse parameters such as area, power, throughput and latency. The traditional VLSI development methods are firstly time consuming, requiring manual steps such as architectural design, handwritten RTL, simulation, verification and debug; secondly, repetitive, as nonconformity to often conflicting design requirements and design constraints, after synthesis, would require rearchitecting the design at the RTL level again. The workload is further compounded by numerous possible design options in a security-cost-performance trade-off [14, 15] that must be weighed against each other before reaching an optimal point in the entire design space. With a tedious and errorprone manual design methodology, this is hardly possible.
Previous works
The traditional VLSI design cycle can be expedited by automating its various steps with tool support. Various highlevel synthesis (HLS) tools have been proposed in this context, both academically and commercially. Noticeable examples include Synopsys Synphony C Compiler [16] , GAUT [17] , Xilinx Vivado HLS [18] , Mentor Graphics Catapult C [19] , Legup [20] etc. The user gives the untimed specification of the design in a high abstraction level language (HLL) that the tool transforms into a fully timed digital hardware. The user may direct the tool to obey certain constraints including target platform, latency, throughput, area, frequency, etc. Based on the constraints, the tool explores various architecture trade-offs and optimizes across design hierarchy and loop structures to come up with a hardware architecture. The tool develops high-quality architectures at the expense of letting the user have a limited control over the choice of generated architecture, allowing limited design space exploration.
For the high-level hardware implementation of symmetric key cryptography, most of the reported efforts [21, 22] focused merely on the proof of concept of cryptographic Table 1 Parameters and their respective units to evaluate the performance of a cipher on H/W and S/W platforms workloads being viably implemented by a HLS tool chain, without competing in quality with a handwritten RTL. Two case studies are worth mentioning in the context that take up HDL code for modern cryptographic algorithms and generate HDL descriptions by a new-generation HLS tool (Vivado HLS by Xilinx [18]). In [23] , all the five round-3, SHA-3 candidates were undertaken by the Vivado HLS tool and performance benchmarked for TPAR against manual RTL. In spite of various iterations of the source code modifications by pragmas (constraints) to economize hardware resources, the TPAR for HLS remains between 62 and 85 % lower, compared to manual RTL for various Altera devices. Similarly, noticeable performance penalty is caused by the HLS tool when various configurations of AES are generated and performance profiled on different families of FPGAs [24] . On a Virtex-7 FPGA, the degradation of HLS AES in terms of TPAR lags behind 28-42 %, compared to manual RTL.
Their architectural optimizations remain however generic. The user does not have the freedom to choose various hardware micro-architectures specific to cryptographic functions class to rapidly explore performance-resources trade-off. Some of these tools have slow learning curve as they require learning a new language. Moreover, the HDL generation performance shows a dependence on the coding style of the designer. Consequently, their results remain suboptimal compared to the hand-optimized cryptographic implementations.
The RunFein methodology
RunFein is a rapid prototyping tool, catering only to quick and efficient realizations of block ciphers from user-specified configurations into digital hardware. The user provides three sets of parameters to the tool flow, as shown in the Fig. 1 : firstly, the algorithmic specifications comprising construc- tive elements coming from a pool of representative elements to define any block cipher; secondly, the user chooses architectural specifications of the cipher for HDL generation that includes a mode of operation and one of the various micro-architectures like unrolled, pipelined, sub-pipelined, bit-sliced implementations; thirdly and optionally, the user may specify a set of test vectors, if already known, for the verification of the design. The software and hardware generation engines of the tool generate an optimized software implementation and a synthesizable HDL description. The design configuration given to the tool is validated for completeness and correctness at various stages of hardware/software generation. These rule checks detect the functional and system-level problems much earlier in the design cycle, improving design reliability and shortening time to market. The tool infers the necessary interfaces and structures to implement optimized HDL along with verification envi-ronments and necessary scripts. It provides a seamless end to end verification from the configuration to RTL validation/verification environments.
With a similar motivation as for RunFein, we earlier presented RAPID-FeinSPN [1] that caters to the rapid prototyping for block ciphers, but covering only a simple loop-folded hardware implementation. RunFein is a step forward in the direction of hardware optimizations by offering the user various micro-architectures design implementation alternatives. Like any other high-level prototyping tool, RunFein boasts that developers productivity allows a quick hardware resource estimation, early functional validation and speedy exploration/selection of design space. Its features that distinguish it from the other HLS tools are highlighted.
1. The HLS tools generally require learning a new highlevel language/grammar algorithm specification and synthesis constraints for design specification, making the learning curve of the tool steep. RunFein has a language independent interface with the user instead and enables a sophisticated design specification capture via a GUI. This also eliminates the dependence of the design quality on the coding style of the programmer. 2. Since RunFein deals with a specific application domain, the possible configurations and optimizations in target micro-architectures are reduced. Since the hardware architectures, interfaces and dependencies are better understood, RunFein guides the user by presenting him/her a list of viable micro-architectural configurations to be crisply picked, instead of generic optimization goals in other HLS tools. RunFein consequently allows more control over design optimizations and a methodical design space exploration, consequently requiring fewer iterations to reach a design target. 3. The algorithmic design configuration of a block cipher works as an executable specification for complete architectural design space exploration. (e.g., bit-slicing the design does not require re-configuring the algorithmic specification of the cipher). 4. Additional features, required specifically for block cipher's implementation/configuration, can automatically be added to RunFein-generated software/hardware implementation. These features include the inclusion of NIST standardized modes of operation and integrated NIST test suite for evaluation of statistical randomness of the encrypted data.
Original contributions
The noteworthy contributions of this work are listed.
1. We surveyed a diverse and wide range of block ciphers to systematically build up a functionally complete set of constructive elements/architectural structures to define the configuration space of any block cipher. 2. The biggest technical challenge is to develop a tool capable of seamlessly integrating these sub-structures and functions into a working model, without sacrificing the performance of the implementation, both of software and hardware platforms. 3. The configuration model completeness and RunFein tool effectiveness is validated by implementing some prominent block ciphers and benchmarking their performance to rival their manual implementations.
The rest of the paper is organized as follows. Section 2 discusses the categorization of block ciphers as computational kernels. Section 3 gives RunFein tool flow and discusses the configuration space of the product block ciphers. The salient features of the software generation engine of RunFein are discussed in Sect. 4. Section 5 gives the hardware micro-architectures supported by the hardware generation engine. Section 6 explains the area, power and throughput results of two prominent ciphers in various hardware configurations along with a comparison with existing work. Section 7 concludes this paper and provides a future roadmap.
Dwarfs of cryptography
For rapid prototyping of a block cipher, RunFein employs a bottom-up design approach by piecing together elementary operations to form a complete system. The idea is similar in spirit to the 13 computational kernel classes or so-called Berkeley dwarfs capturing the major functionality and data movement pattern across an entire class of important application [25] . A similar idea is presented by Intel RecognitionMining-Synthesis (RMS) view [26] . This concept of design based on computational kernels has been exploited for rapid prototyping in cryptographic applications, e.g., fast hardware implementation of elliptic curve arithmetic operations [27] , parameterized cryptanalytic tool flows [28, 29] and rapid prototyping frameworks for cryptographic protocols [30, 31] . Undertaking these basic kernels across algorithms of an application class helps in a generic understanding as well as in an optimized implementation [32] . Classifying cryptography under computational dwarfs [25] makes it a subclass of combinational logic dwarf, along with other computing subclasses.
Next, we first justify why the study of block ciphers out of all the symmetric key cryptography functions is more significant and then investigate the computation kernels of block ciphers.
Workhorses of symmetric key cryptography
Block ciphers enable secrecy of encrypted data, not beyond a single block of data. However under various modes of operation, they enable data transmission having major services of information security (InfoSec) including authenticity, integrity and confidentiality. These modes transform block ciphers to other cryptographic primitives, making them the workhorses of symmetric key cryptography and consequently making their study imperative. Other than these operational modes, the basic deterministic transform functions of block cipher serve as elementary kernels or building blocks for many symmetric key cryptographic protocols. Figure 2 highlights this constructive nature of the block ciphers being used as other cryptographic functions including stream ciphers, hash functions, message authentication codes (MAC) and cryptographically secure pseudo-random number generator (CSPRNG). We mention a few examples of cryptographic functions driven from block ciphers in this context.
Stream ciphers Block ciphers are transformed to stream
ciphers under counter mode (CTR) and output feedback mode (OFB) [33] . SOSEMANUK [34] , an eSTREAM finalist stream cipher, uses a block cipher SERPENT for its construction. 2. Hash functions Hash functions may be driven from a block cipher, operating in schemes that make them noninvertible one-way compression functions. WHIRLPOOL is based on an AES like block cipher operating under a Miyaguchi-Preneel hashing construction scheme [35] . More examples borrowing block cipher constructions include two SHA-3 finalists BLAKE [4] and Skein [36] . 3. MACs MACs may be driven from hash functions (in HMAC mode) or from block ciphers (in OMAC, PMAC and CBC-MAC mode). 4. CSPRNG A CSPRNG can be driven from a block cipher operating in counter mode of operation. Also, running a stream cipher on a counter returns a CSPRNG, with its initial state kept secret. 5. Authenticated encryption Authenticated encryption is generically constructed by combining a block cipher and an MAC operating under a mode of operation, hence simultaneously providing confidentiality, integrity and authenticity assurances on the data. Various modes of authenticated encryption have been standardized by ISO [37] .
It is worth highlighting that though block ciphers may serve as the building blocks of many cryptographic functions, these functions may have other roots of origin. Most of the popular stream ciphers are constructed using LFSRs along with some non-linear combining functions and an FSM. Similarly, many CSPRNGs originate from number theory problems. Also worth mentioning is the fact that cryptographic functions take inspiration from each other too. SEAL, HC-128 and HC-256 are stream ciphers that make use of SHA family of hash functions for their key expansion phase, and SHACAL is a block cipher based on SHA-1. Many stream ciphers and CSPRNGs have common roots.
Ingredients of a block cipher
This section presents classification and typical elements of construction for block ciphers. Since our goal is to define configuration space of block ciphers for high-level synthesis, we strictly focus on their architectural/operational constructs. Their complexity and cryptanalytic properties are therefore skipped, but could be referred from [38, Chapter 7] .
A block cipher is a mapping of a plaintext data block of size S B (blocksize) to an equal sized ciphertext block under the parameterization of a key (of size S K , keysize). This deterministic mapping (encryption) should be invertible. The inverse function (decryption) generates the original plaintext given the ciphertext under the same key. Classical/historical block ciphers include Caesar ciphers, affine ciphers, substitutions ciphers, polyalphabetic substitutions, etc. These techniques are proven over time to be cryptanalytically vulnerable and not suitable for practical use today [38, Chapter 7] .
The product ciphers make the most popular class of block ciphers (and lightweight block ciphers) used today. A product cipher combines multiple data transformations so as to make the resulting cipher more secure than the individual transformations. These transformations may include permutations (adding diffusion), substitutions (adding confusion), translations (e.g., XOR), linear transformations (e.g., rotation), arithmetic operations, modular multiplication, transpositions, etc. An iterated product cipher involves sequential repetition of a set of transformations called a round function. The round function iterates N r (roundcount) number of times during encryption/decryption. For the ith round, a subkey i (of size S SK ) is generated. Two major classes of iterated product ciphers are defined as follows [38 
Computational building blocks of symmetric key cryptography
This section attempts to unconventionally classify the major functions of symmetric key cryptography (block ciphers, stream cipher, hash functions) based on their underlying common computational elements. A small set of three operations, i.e., modular addition (A), bit rotation (R) and bitwise XOR-ing (X) make a functionally complete set of operations for building any cryptographic function [39, Section 5], including block ciphers, stream ciphers and hash functions. The term AXR (later renamed ARX) was coined by Weinmann [40] in 2009; however, such designs have been proposed much earlier. This combination of linear (X, R) and nonlinear (A) operations, iterated over multiple rounds, achieves strong resistance against known cryptanalysis techniques [39] . Figure 3 , a subset diagram, captures the computational kernels of symmetric key cryptography. The bitwise shift operation is added to the ARX pool of operations for con- Feistel ciphers may use substitutions and permutations, other than the ARX operations, in their round functions. XOR is generally used for key whitening the round values with subkey of that round. Addition operation might not be explicitly used in round operations; however, a count-up/down counter is always required for encryption/decryption block realization, respectively. DES has a Feistel structure, but employs S-Boxes and P-Boxes for its round operation. AES [45] does not have any P-Boxes and rather uses Galois field multiplication. It is noteworthy that this computational categorization highlights only the commonalities as a trend in cryptographic functions. This categorization is neither complete nor by definition binding to a particular class of ciphers. Consequently, exceptions exist, e.g., TEA [46] family of lightweight block ciphers (XTEA, XXTEA) are Feistel network ciphers by structure and use shift operations other than ARX. AURORA, a hash cipher for SHA-3 competition, has a structure as a combination of SPN and a generalized Feistel structure [47] .
Classifying the cryptographic functions on the basis of their primitive computational elements brings forward a surprisingly simplistic angle of viewing them, beneficial to their implementation, both on hardware and software platforms. RunFein is developed around the concepts of modularity and extensibility. It supports constructive composition of cryptographic building blocks supporting SPN/Feistel network-based block ciphers, which are favorite primitives for block ciphers today. Additionally, stream ciphers based on block ciphers (e.g., salsa20 [6] ) can be realized using RunFein. It can also model Lai-Massey structure block ciphers and can be conveniently extended to support newer structures/components if/when the need arises.
RunFein tool flow
The tool flow of RunFein is graphically shown in Fig. 4 . The user populates the configuration space of a new block cipher to get customized software and hardware implementations. A sophisticated design capture is made possible via a GUI to let the user conveniently specify cipher design and implementation customization. The configuration for a cipher could be added, parameter by parameter, or could be saved and loaded later. A list of known ciphers is available to instantly load the configurations for easier manipulation. RunFein validates this design capture for completeness and correctness at various stages of the tool flow. It successfully abstracts away the diversity of the design space by translating the configuration to a generic block cipher template. The configurations undergo a set of design rule checks before generating the software and hardware implementations.
Cipher configuration space
A key challenge addressed in this work is to identify a complete set of algorithmic primitives and architectural substructures that is generic enough to configure a range of block ciphers and their implementations. After survey of diverse ciphers, we developed sub-structures and component lists to develop primitive libraries for software and hardware realizations. We propose a so-called layered architecture where each layer specifies a data transformation specified by the operation. To fully appreciate the concept of layers of operations, we consider the data flow graph of the cipher (and its key expansion) where data move from top to bottom. The layers are then the horizontal divisions of the data flow diagram.
The configuration parameter set is categorized into algorithmic parameters, modes of operation, micro-architectural parameters and test-vectors (all parameterizable attributes that a user must populate are highlighted in the proceeding discussion).
Algorithmic parameters
The parameters to define the algorithmic construction of block cipher are -Basic parameters The input plaintext to a block cipher (encryption) and its output ciphertext are of equal size, blocksize S B (all sizes specified in bits). The size of Key is specified as S K (for some operational modes an IV (initialization vector) having size S IV must also be specified). The granularity of the cipher is specified as word size (S W ) of the cipher. Block ciphers iterate a deterministic combination of operations known as a round. The rounds operate N r (roundcount) number of times during encryption/decryption. -Round layers For most block ciphers, the data undergo an initial and/or final transformation, before and/or after the rounds processing, respectively. Since these transformations may differ from each other and the central round transformation, we name them as round_initn and round_final, while the round transformation is referred to as round_middle. These three kinds of rounds are defined by a series of layers of operations. Every layer comprises at least one of the following operations, performed exclusively on the user-specified portions of the layer input (this list could be conveniently extended to accommodate newer operations).
Substitution or permutation boxes (S-Box, P-Box).
2. Galois field multiplication GF-mul with another polynomial (primitive polynomial must be specified). -Kround Operation For each round, a sub-key (of size S SK ) is generated through key expansion. Like rounds, key expansion requires iteration of kround transformation N r (roundcount) number of times to generate sub-keys. kround may also have different definitions for kround_init, kround_middle and kround_final, and each of them are defined by layers of operations like cipher rounds.
The layers of each round have a layernumber to specify their order of execution within that round. The input to and output from a layer may differ in size (bits) due to an expansion/contracting layer operation and is specified as S_lin and S_lout, respectively.
Modes of operation
RunFein lets the user opt from a list of modes of operation to add the chaining dependencies between adjacent blocks of data during encryption/decryption. Currently, any of the NIST-standardized modes of operation may be chosen for implementation [33] , as listed in Table 2 . Here, C i represents ciphertext for the ith plaintext block after encryption function parameterized by the secret key E k , while P i represents the plaintext after decryption. Due to the chaining dependencies, multiple blocks of data cannot be subjected to encryption or decryption in a parallel fashion for some modes of operations as indicated in the Table 2 . For all modes other than ECB, the user specifies the IV and any additional parameters required.
Putting things together
We take up two ciphers and try to work out their algorithmic configuration according to the discussed RunFein's layered architecture definition methodology. These are AES-128 [45] due to its widespread usage and PRESENT [5] due to its ultra lightweight nature. Moreover, both of these ciphers have been standardized by ISO. We define the configuration space for these ciphers in encryption blocks only. The reader is kindly requested to refer to the documentation of these ciphers for a detailed understanding of their functionality [5, 45] . Table 3 shows the basic parameter configuration space for 80-bit key of PRESENT cipher. The configuration parameters for PRESENT-80 are fed to the tool's GUI and are stored as an XML configuration file, a snapshot of which is shown in Fig. 5 . A separate token ALGORITHM holds the basic parameter, round and key round operational layers. The basic The arguments for S-Box and P-Box can be loaded either by a text file or added by the user in the edit boxes. The round_final is specified by one layer of ARK, similar to the first layer of round_middle. Hence, the ciphertext is taken out after the first ARK layer in the last iteration of cipher encryption. Figure 5 shows the KROUND token that configures the key expansion information. For key expansion in PRESENT-80, the kround_init and kround_final are not required and hence defined as having no layers. The kround_middle requires three layers of operations as defined below.
PRESENT-80
-layer0 is the ROTATE operation configured to carry out a left rotation by 61. -layer1 is the S-Box. The user specifies one S-Box inserted at word number 19 of the key, the most significant nibble to the layer input. The rest of the bits are passed on unaltered.
-layer2 is the AddCounter that XORs the selected bits of the data (bit 19 till 15) input to the layer with a 5-bit counter (round counter).
A round counter increments till it reaches N r − 1 and a valid ciphertext is available.
AES-128 [45]
For AES-128, the corresponding parameters for RunFein are specified as given in Table 3 . The round_init requires one operation layer, i.e., ARK. round_middle is defined by 4 layers.
-layer0 is an S-Box. The user specifies 16 S-Boxes to be inserted along with S-Box definition of 256 bytes. -layer1 is a Shift-rows operation. It is a compound operation that takes up the layer input as a 2-D matrix and re-arranges the words of each row with fixed offsets. -layer2 is a GF − Mix, a compound operation assuming 2-D arranged data. The user specifies a 4 × 4 column coefficients for GF (2 8 ) multiplication. -layer3 is the ARK that XORs the key with the data.
Using this layered architecture, a cipher may have multiple valid definitions. The Shift-rows operation in layer1 may have been defined using various layers, each rotating one row of the state matrix, as defined by the AES specifications. We define it as a standard compound operation since it is a common operation used in ciphers other than AES, e.g., LED.
The round_final is defined by three layers, same as layer0, layer1 and layer3 of round_middle. For each round, a subkey is generated through a kround. The kround_init is a nop layer, since the first sub-key is the input key itself. The kround_final is not required and hence not defined. kround_middle requires 7 layers of operations for its definition as shown in Fig. 6 .
-layer0 is a ROTATE left by 8 layer. It takes the least significant 32-bit word of the key. This layer also expands 128 bits of input to 160 bits of output by concatenating the input bits unaltered along with the rotated word output. -layer1 is the S-Box; four S-Boxes are inserted on the four least significant bytes of layer1 input. -layer2 is an XOR with counter dependent constants (RCON). The constants are specified by the user using a text file. -layer3-layer6 are XOR operations, performing selective XOR-ing of layer inputs as per AES specifications. Figure 16 shows a GUI snapshot of round layer operational specification for AES-128 in RunFein. 
Cipher model creation and validation
The RunFein framework provides a sophisticated configuration capture via a GUI (some snapshots of the GUI are presented in the "Appendix"). It provides convenient default values in the GUI wherever necessary, and the configuration file with default values for micro-architecture and test-vectors for PRESENT-80 is shown in Fig. 5 (further discussion follows in the following section). Other than the parameters, specified by the user through GUI, some parameters are inferred by the tool. A counter is required to keep track of the iterations of the cipher. It counts up or down during encryption or decryption of a block of data, respectively. Its size is taken up as log 2 (N r ) bits. Other than the counter, we have two variables, namely d_state and k_state, which contain the updated data state and key state, respectively (for hardware implementation these values are D-flip-flops instead).
Before creation of a valid cipher model, the configuration parameters given by the user undergo a list of defined rules checks. The user is prompted in case of a violation and cipher implementation does not proceed unless a valid configuration is specified (some additional rules related to hardware microarchitectures are discussed in Sect. 5.3.6).
-Blocksize of any cipher by definition equals the sizes of plaintext/ciphertext.
S B = S P = S C S B = 2m, where m ≥ 1. The configuration file is parsed by RunFein and cipher model is created, for PRESENT-80 and AES-128 it is shown in Fig. 6 . The cipher model comprises a controller and datapath. The controller is simply the inferred counter (not shown in Fig. 6 ), and the datapath of the cipher is constructed by operational layers of round and kround. A multiplexer is also inferred at the input to d_state and k_state registers, controlled by the round count. For PRESENT-80, the last round or round_final comprises the ARK layer only and hence the cipher text is extracted after layer0. For AES-128, layer0 of kround expands the key and layer3 contracts it back to 128 bits.
Software generation engine
The software generation engine takes either the user-specified configuration of a new cipher or alternatively loads the design configuration of a known cipher (Fig. 4) . One also specifies data for plaintext, key, IV (through text files or edit boxes) using the GUI. RunFein compiles the cipher model to generate a high performance, fixed-point ANSI-C description. The code is enhanced by a simulation environment with usercontrollable switches for verification, throughput profiling, data dumping, etc. The generated code is not specifically optimized for a particular general purpose processor (GPP); however, it has a regular structure and good code readability.
All the configuration parameters of the cipher (as specified in XML file listing in Fig. 5 ) are #defined in a header file. This includes all basic configurations, test vectors and the microarchitecture, though software implementation only caters to the default values of a simple iterative loop-folded implementation. Data types of registers, layers and all interfaces are typedef -ed in accordance with their respective granularity specified. Supplementary functions are kept in a separate file that is included in the main file during simulation. These functions include datatype conversion functions (e.g., con-version of hexadecimal to binary arrays and vice versa), data dumping and verbose simulations. For each operational layer of round and kround, a separate function is defined with interface and functionality, as per the user specified. Layers may operate on operands with different granularity, i.e., P-Box operates on bits, S-Box operates on S W , etc. The functions generated include relevant calls to conversion of granularity functions in addition to the functionality of the layer operation.
The main body of code, having the controller and the datapath of the cipher model, is a separate file that #includes all supplementary and header files. For elaboration of code simulation environment, we refer to the simplistic pseudocode for encryption of one block of data given in Algorithm 1. The plaintext and key are assigned to the local variables d_state and k_state, respectively (line 1, 2). k_state is updated first by the Kround_init function. Using the updated key, the d_state is updated using the round_init function (line 4). The controller part of the cipher comprises counter variable, keeping track of the round under execution. The loop starting in line 6 iterates for roundcount −1 times and keeps updating the data and key registers. The final round generates the last k_state which is used up by round_final to generate the ciphertext, as given in (lines 9, 10), respectively. RunFein-generated code for AES is presented in the appendix of [1] . 
Input: plaintext, key, con f iguration

Algorithm 1: RunFein Encryption Pseudocode
The software generation engine of RunFein generates a single-threaded, untimed, sequential C model of the stream cipher with necessary libraries and scripts. Some of its additional features are highlighted.
-NIST Test Suite RunFein has integrated with it the NIST test suite [48] to characterize the statistical qualities of PRNGs. It serves as a first step in determining the suitability of a PRNG used for cryptographic purposes. Figure 17 gives a GUI snapshot of RunFein for the selection and parameterization of various statistical tests available for execution as per the user wishes (RunFein caters only to the block ciphers; however, they behave like stream ciphers and CSPRNGs under certain modes of operation).
-Verification For the verification of the generated model according to the user-specified test vectors, a verification environment is generated. For new proposals, without defined test vectors, the verification switches may be turned off by the user. -Performance profiling The user may enable a performance profiling environment in the generated software implementation to evaluate encryption speed (in seconds, cycles/byte) of the cipher design. Provision of encrypting bulk data from random plaintext for monitoring data randomness is provided. A reasonably efficient generated implementation may be further manually optimized for a specific platform.
Hardware generation engine
The hardware generation engine of RunFein requires additionally the micro-architectural configuration of the cipher model to be specified by the user, other than the algorithmic configuration to generate a complete working model of the block cipher in synthesizable HDL along with a test bench and necessary scripts. First, the viability of the chosen micro-architecture configuration is evaluated by RunFein by a list of rule checks. Then some optimizations are carried out to enable hardware reuse, e.g., the reuse for middle and final rounds of the algorithm by gauging the commonalities between the two. Since for PRESENT-80, the round_final is a single ARK operation, the final ciphertext is therefore taken out after the first layer of round_middle. For AES-128, the middle round and last round differ only in one layer, i.e., GF-mul. A bypass mux is automatically inserted, enabled at the final round as shown in Fig. 6 . After design validation and optimizations, RunFein generates the digital design as an Architecture Description Language (ADL) and relies on Synopsys Processor Designer [49] as high-level synthesis framework for generation of synthesizable HDL code, as shown in Fig. 7 . The language allows full control over minute design decisions and preserves the overall structural organization neatly in the generated hardware description. This design is profiled to get critical parameters like the maximum clock frequency of the design, chip area and power consumption.
LISA ADL
RunFein generates the cipher in an ADL called Language for Instruction-Set Architectures (LISA) [50] . This language offers rich programming primitives to capture an implementation of a design with full programmability to an Application-Specific IC (ASIC).
Before discussing the language semantics, it is useful to understand the key ideas of the high-level modeling using LISA. In LISA, the complete implementation is viewed as a directed acyclic graph (DAG) of LISA OPERATIONs. The OPERATIONs contain state CODING (encoding), state BEHAVIOR and ACTIVATION to successor OPERATIONs. As can be seen in Fig. 8 , showing a typical state machine modeling, one common OPERATION can be activated by multiple parent OPERATIONS. Similarly, one OPERATION can activate multiple children OPERATIONS. The complete structure is an annotated DAG D = V, E . V represents the set of LISA OPERATIONs, E the graph edges as set of child-parent relations. On top of this description, the structural information, such as pipelining, memory partitioning and storage accesses, is added.
BEHAVIOR description
The BEHAVIOR section gives the behavior of a LISA OPERATION. It is described in an extended C programming language (supporting bitwise manipulations and user-defined datatypes). The BEHAVIOR description constitutes the combinatorial part of the design that can access the clocked resources too, declared in a global RESOURCE section.
State CODING description
LISA operations CODING section is used to describe the state's encoding. The encoding of a LISA operation is described as a sequence of several coding elements. Each coding element is either a terminal bit sequence with "0", "1","don't care" bits or nonterminal. The nonterminal coding element can point to either an instance of LISA operation or a GROUP of LISA operations. The behavior of a LISA operation is executed only if all terminal coding bit patterns match, all non-terminal instances match and at least one member of each group matches. The root LISA operation containing a coding section is referred as the coding root.
LISA RESOURCE section
RESOURCES consist of general hardware resources for storage and structure such as memory, registers, internal signals and external pins. Memory and registers provide storage capabilities. Signals and pins are internal and external resources without storage capabilities. RESOURCEs can be parameterized in terms of sign, bit-width and dimension. Memories can be more extensively parameterized. There the size, accessible block size, access pattern, access latency, endian-ness can be specified. RESOURCEs are globally accessible from any OPERATION. Memories are accessed via a pre-defined set of interface functions. These interface functions comprise blocking and non-blocking memory access possibilities. RESOURCE section allows definition of micro-architecture by using the keywords PIPELINE and PIPELINE_REGISTER. With the pipeline definition, all the LISA operations need to be assigned in a particular pipeline stage. Interfaces are defined by PINs.
LISA description generation
RunFein generates the LISA description of the cipher model as per the micro-architecture chosen by the user for implementation. Figure 9 gives a partial code listing of LISA-based PRESENT-80 encryption only description, for a loop-folded implementation. The configuration specified by the user is parsed and converted from the XML file (as specified in Fig. 5 for PRESENT-80) to a header file that is included in all the code listing files. The RESOURCE section first specifies the I/Os of the cipher model using IN and OUT PINs, the Twire specifies non-buffering of the output ciphertext, while the sizes are specified as constants in the defines file (all capitalized). Figure 10 shows the interface of the generated HDL model. Other than the clock and reset pins (not Fig. 9 RunFein generated partial LISA code listing for PRESENT-80 encoding block Fig. 10 Interface of RunFein-generated block cipher HDL specified in the LISA model), it also has the instruction(inst) coming into the cipher module, being read from a program memory location that the program counter (PC) is currently pointing to. This circuitry is external to the design under test (DUT) and consequently is not included in the area/power estimates done later. The pipeline architecture of the cipher model is specified by a single pipeline stage named EX (no pipelining for loop-folded implementation) in line 17 of code listing given in Fig. 9 . The three registers of the model in Fig. 10, i. e., d_state, k_state and round_count are specified as REGISTERs in line 18, 19 and 20, respectively, of LISA code listing.
The execution starts from the main OPERATION and ACTIVATES another OPERATION to fetch the next instruction. fetch maintains the update of the PC and reads the next instruction. It further ACTIVATES the decode child OPERATION that is also the coding root for the model. Since the target application that RunFein is generating is specifically a cipher algorithm, the LISA processor need not have more than two instructions, i.e., initialization and round instruction. Consequently, the instruction word is a single bit only, controlling the two multiplexers at the input of two registers in the datapath of the design and clearing or incrementing the round_count in the controller of the DUT, as shown in Fig. 10 . From the decode OPERATION, the control forks to either of the two instructions, as shown in line 37 algorithm of Fig. 9 . If the current instruction is init, the init OPERATION is ACTIVATED that initializes the registers d_state, k_state and round_count as shown by the BEHAVIOR section of init OPERATION (line number 46). Otherwise, the round OPERATION is activated that has four OPERATIONs in its ACTIVATION list. Out of these four OPERATIONs, two execute the operational layers of k_round and round, i.e., kr_layers and r _layers, respectively, while two buffer the updated values of key and data in d_state and k_state, i.e., d_reg and k_reg OPERATIONs, respec-tively. It should be highlighted here that the operations in the ACTIVATION list are executed in a non-blocking fashion and consequently dependence dictates the order of execution.
For PRESENT, the key and round comprises three operational layers; hence kr_layers and r _layers further ACTIVATEs three OPERATIONs each. Consequently, the r _layers OPERATION has r _layers0, r _layers1 and r _layers2 in its ACTIVATION list, as shown in line no. 68 algorithm of Fig. 9 . For the first of these layers, i.e., the r _layers0 OPERATION that is an add round key (ARK) operation, the code listing is specified in line no. 73. The temp_data hold the XOR-ed value, which is assigned as layer0 output to r _layer0_out that is a global variable delcared in the RESOURCE section of the code listing. For the last round, this value holds the ciphertext too, as can be seen in Fig. 6 .
Each OPERATION of r _layers and k_layers is defined by RunFein according to the user-specified operation, a simplistic mapping of the layers into operations is carried out. (The extensible RunFein framework enables/encourages multiple customized LISA definitions of operations.)
-S-Boxes are implemented as read-only lookup tables
(LUTs). -Diffusion operations like rotation, shifting and P-Boxes are all implemented using rewiring of the inputs and consequently render no overhead to the combinational area and delay of the circuit. -GF-mul is implemented by shifting and XOR-ing operations in accordance with the primitive polynomial of the finite field specified. -Supported popular compound operations (e.g., MixColumns) have cascaded implementations of their constructive operations.
Other than the LISA description generated by the RunFein tool the respective assembly file for the processor is also generated. For PRESENT-80, it comprises one init and several round instructions, as specified by the syntax portion of the instructions in line no. 44 and line no. 57 algorithm of Fig. 9 . The design is validated using Synopsys Processor Designer Compiler, Assembler and Debugger. It can then be converted to a synthesizable, hierarchical block cipher HDL with necessary scripts that can be further used to carry out -Simulations for design verification, gate-level simulation (post-synthesis) using verification tools. -Logic synthesis of the design for profiling critical parameters like the maximum clock frequency and chip area. -Post-synthesis power consumption estimation using backannotation.
It must be emphasized that the LISA description generated by RunFein contains the complete architectural details of the cipher processor. The translation of this generated LISA ADL design into a synthesizable HDL using Synopsys Processor Designer eases the design debug and conversion; however it does not alter/improve the design architecture. Section 5.2 discusses the LISA description for a loopfolded architecture. For the rest of the RunFein offered micro-architectures (discussed below), the cipher model and the consequent LISA description generation is adapted accordingly, e.g., for bit-slicing micro-architecture, the interface sizes of plaintext and the d_state register is smaller than the blocksize, a counter s_cnt is added to keep track of the slices processed, etc. Since the description of a microarchitecture generated by RunFein and its LISA description are analogous, for the rest of the paper we discuss only the structural details of micro-architectures that RunFein offers, skipping the equivalent LISA description.
Supported micro-architectures
Through RunFein, the user can quickly explore various micro-architecture design options residing at different intensity of the performance-area trade-off. The user specifies algorithmic configuration of the cipher design always according to the simplistic loop-folded architecture. In addition, he must specify the micro-architecture he wants RunFein to automatically implement. By tweaking the microarchitecture configuration, he may opt for parallel implementations (loop sub-pipelining/unrolling) duplicating hardware for boasting throughput or bit-sliced designs economizing area/power at the expense of lower throughput performance by employing resource sharing. We discuss these microarchitectures individually; they are depicted in Fig. 11 .
Loop folded
A typical loop-folded block cipher implementation performing one round per clock cycle (N r cycles per block) is shown in Figs. 6, 11a) . It is the default hardware implementation micro-architecture of RunFein and serves as a middle point for area-throughput trade-off between parallel implementations and bit-sliced implementations. The controller comprises round counter register, incrementing every cycle (Fig. 12a) . The selection of plaintext or folded data for d_state register is controlled by this register. A valid ciphertext is generated when the counter register hits N r .
Loop unrolled
The loop unrolled configuration replicates round (and kround) resources u times to execute multiple rounds in one clock cycle, where u is the unrolling factor. Consequently, the critical path of the circuit increases, decreasing the maximum operational frequency, and the area also increases. The counter increments by u per cycle, since the design requires N r /u cycles for encryption of a complete block (N r /u not being a fraction), as shown in Fig. 12b . A higher throughput performance is expected since the propagation delay and the register setup time come only once in the combinational delay for u rounds. This gain in throughput is hard to enumerate without experimentation; hence synthesis profiling is required (a twice unrolled hardware configuration is shown in Fig. 11b ). Two critical design points relevant to the loop unrolling are -A fully unrolled architecture with u = N r encrypts/ decrypts of data in a single cycle (Fig. 11c) . The RunFein hardware generation engine optimizes the hardware for the round_ final if it is different from the round_middle. The round_middle hardware is replicated (u − 1)-times following the hardware for round_ final instantiated once. -A loop unrolling with pipelining architecture can be chosen by the user to automatically insert pipeline registers between unrolled rounds. Consequently, the critical path of the design also does not increase due to unrolling; this design handles multi ple blocks of data simultaneously. Configuration in Fig. 11d processes two blocks of data in a total of N r cycles boasting throughput by u. A supplementary counter or s_cnt keeps track of the unroll factor, which when fulfilled generates the load signal for counter to increment directly by u (Fig. 12c) . Hence in subsequent cycles, u-many valid ciphertexts are generated when counter equals the roundcount.
Sub-pipelining
Using RunFein, the user may choose to insert a sub-pipeline between any two layers in a round to reduce the critical path of the design. To ensure data consistency, for s sub-pipelines inserted in a cipher round, an equal number of sub-pipelines should be specified by the user to be inserted in kround as well. To do so, the user must check the sub-pipelining option on to be able to insert various operations along with a subpipeline register as shown in the GUI snapshot (Fig. 18) . Insertion of each sub-pipeline increments the number of multi ple blocks being processed, i.e., s sub-pipelines make the cipher design handle (s + 1) data blocks simultaneously (for s = 1 Fig. 11e) . A supplementary register s_cnt inserted keeps track of the sub-pipeline (Fig. 12d) . If the user wishes to insert a sub-pipeline within a layer, he must first redefine that layer as two layers, split at the cut-set point.
Hybrid micro-architectures
Using RunFein, the user may opt for some hybrid parallel micro-architecture configurations supporting both subpipelining and unrolling. Figure 11f shows a hybrid microarchitecture with sub-pipeline (s = 1) and unrolling with pipeline by a factor (u = 2). It is a multiple block configuration, handling 4 data blocks simultaneously. Consequently, the controller needs a supplementary register s_cnt to keep track of the total iteration count (Fig. 12e) .
Bit-slicing
Through bit-slicing, RunFein tiles the parallel loop-folded architecture to work on S b bits at a time (S b < S B ). Consequently, the design has lower area and lower throughput, a technique especially interesting for lightweight block ciphers. In most of the SPN ciphers, S-Boxes account for a significant area portion, e.g., more than 30 % of the PRESENT-80 loop-folded implementation area is contributed by its 17 S-Boxes [5] . Hence, S b is generally taken as S W or a multiple of it. The krounds and rounds are sliced to operate the task of one cycle in S B /S b cycles. The controller of the bit-sliced architecture changes so that the counter increments once after the s_cnt hits S B /S b . The encryption of one block requires S b × N r cycles as shown in the Fig. 12f . The d_state and k_state are shift registers (with parallel load/stores possible), with shift granularity of S b . Hence, the operations of each layer in a round is performed on S b bits and the result is stored in d_state shift register. Similar to the bit-slicing of S-boxes, operations like XOR and addition (with carry bit) can be bitsliced. However, for some operations, the operation slicing requires large extra selection logic, e.g., P-Boxes, rotation. Since these bit manipulation operations (when performing in parallel configurations) have no logic overhead, it is wiser not to bit-slice them.
RunFein takes the bit-slice factor (S b ) of a cipher and after evaluation the validity of the design generates a bitsliced implementation. Figure 13 shows a bit-sliced S b = 4 PRESENT-80 implementation requiring S b /S W (1) S-Box per round, shared between k_round and round calculations. A similar design has been presented for smallest area footprint of PRESENT-80 in [51] . Since the Key expansion is generally inexpensive in terms of resources, bit-slicing is not applied to krounds. Hence, the key is loaded in S K /S b cycles in k_state shift register, but a sub-key is calculated in a single cycle. For the round calculation, 4 bits are XOR-ed with one key nibble and passed through the S-Box in each cycle. As P-Box is not bit-sliced, round calculation requires S B /S b cycles plus one for P-Box calculation. Since the key expansion requires only one S-Box, the round and kround share one. Through RunFein, the bit-sliced and optimized 
Micro-architecture validation checks
When the user desires the LISA-based HDL generation, the cipher configuration and selected micro-architecture undergoes the following checks. Table 2 . For example, in OFB mode, the micro-architecture for encryption and decryption should not be sub-pipelined. -Bit-slicing cannot be combined with any other microarchitecture to generate a hybrid configuration.
RunFein limitations
We list here some micro-architectural limitations of RunFein.
-Both software and hardware implementations generated by RunFein follow the on-the-fly key expansion methodology. Alternatively, sub-key pre-computation requires large memory for storing S SK × N r bits of data. Additionally, the delay of sub-key computation has to be incurred whenever a new key is used. RunFein does not pre-compute sub-keys; however, converting the generated code to precomputed keys approach requires only trivial tweaking. -Ciphers requiring unequal number of iterations for round and krounds cannot be implemented using RunFein. Though this is uncommon for most of today's ciphers, the exceptions are AES-192/256 configurations. -For ciphers having Mix column as diffusion operations, bit-slicing requires large multiplexing logic whose overhead exceeds the potential saving achieved by bitslicing [52] . Currently, RunFein does not support a bit-sliced micro-architecture for cipher with Mix Column operation (e.g., AES). For ciphers with P-Boxes, a parallel execution of P-Box operation is performed instead of a bit-sliced implementation as discussed in the previous section (for PRESENT-80). -Currently, RunFein does not support a unified microarchitecture performing both encryption/decryption.
Experimental results and analysis
Using RunFein we implemented the software realizations of PRESENT (80, 128), AES (128), KLEIN (64, 80, 96) and LED (64, 128). The software efficiency in terms of lines of code and execution time has already been discussed in [1] . The randomness test using NIST test suite was also successfully conducted by generating long streams of encrypted data in CBC, PCBC, OFB and CFB modes of operation.
Hardware implementation and benchmarking
We implemented various hardware micro-architectures for PRESENT-80 and AES-128. The generated high-level design description models in LISA were tested with its software tools generated by 
Micro-architectures for PRESENT-80
For lightweight block ciphers, low operating frequencies are more relevant due to their stringent power constraints; hence 100 KHz clock frequency is considered. The results at 10 MHz are also reported. At 100 KHz, our RunFein generated PRESENT-80 encryption only loop-folded implementation has a throughput of 200 Kbps and occupies 1649 GE for 65 nm CMOS technology library as indicated by the first row of Table 4 . The power and area results for the same loop-folded implementation, synthesized at 10 MHz, are indicated in the first row of Table 5 .
For comparison with the manually optimized reported implementations, we take up the results for loop-folded PRESENT-80 encryption estimates in [51] with three different CMOS technology libraries as indicated by the first column of Table 6 . This implementation on 180 nm reportedly consumes 1650 and 1706 gates at 100 KHz and 10 MHz, respectively. Our implementation, on a comparable technology library, consumes 1750 for both 100 KHz and 10 MHz operating frequency, making our results have 100 and 46 gates more, respectively [51] . This area-gap is far too small to be considered an overhead and possibly can be attributed to the difference in the vendor libraries, synthesis optimizations settings or different versions of the synthesis tool.
Bit-slicing
For bit-slicing, we generated implementations with various possible bit-slice width, i.e., S b = 4, 8, 16, 32. Consequently, the reduction in area, power and throughput is seen as a trend on 65 nm CMOS technology library and an operating frequency of 100 KHz in Table 4 and 10 MHz in Table 5 . Figures 14 and 15 graphically show the trade-off design points for area and power saving, respectively, against the loss in throughput for various S b widths.
For comparison, we take up the smallest reported area for PRESENT-80 by hand-crafted implementation, requiring 1000 gates [51] . Their implementation area footprints for S b = 4 on various technology libraries are reported in Table 6 . For the same operating frequency (and consequently the same throughput), our area estimates when synthesized on 90 nm technology library come as close as 1081 GE. The implementation results for PRESENT-80 with higher bit-slice-widths have not yet been reported. RunFein accelerates the exploration of these intermediate design points by enabling prototyping of bit-sliced architectural customizations. Some novel results are presented in Figs. 14 and 15 for resources-performance trade-off. 
Unrolling without pipelining
Using RunFein, we employ various unroll factors for the 32 rounds of PRESENT-80 encryption design. Table 7 gives the area, power and throughput estimates when the design is unrolled by various factors. A fully unrolled design achieves the highest throughput per area ratio; however, it also consumes the most area and power in comparison.
Sub-pipelining
Through sub-pipelining we generated some novel highthroughput realizations of PRESENT-80 cipher that have not been reported till date. For a loop-folded implementation, the maximum operating frequency is profiled to be 3.7 GHz as indicated by the Table 8 . We sub-pipeline it twice for achieving high-throughput performance.
-First Sub-pipeline The critical path for the loop-folded implementation (Fig. 6, left) exists from the k_state register, through the three round layers, the multiplexer and till the d_state register. Since P-Box poses no combinational delay due to rewiring, it is prudent to break the critical path by inserting a sub-pipeline between layer0 and layer1 of the cipher, shown by the single dotted line in Fig. 6 . A corresponding sub-pipeline between layer1 and layer2 of the k_round is also opted. Consequently, the sub-pipelined circuit's operating frequency increases, raising the throughput to 8.1 Gbps. -Second sub-pipeline The critical path now exists between the sub-pipeline register and the d_state register in the round. For a further increase in the operating frequency, we break this critical path between layer1 and layer2 of round by a second sub-pipeline (with a corresponding sub-pipeline between layer0 and layer1 of k_round) as shown by double dotted lines in Fig. 6 . The corresponding operating frequency, however, decreases. This is attributed to the supporting control hardware inserted to tackle the 2 sub-pipelines. A 2-bit supplementary counter (s_counter) counting up to the number of sub-pipelines is inserted in addition to the 5-bit counter for rounds. The critical path now exists in the controller, i.e., between s_counter and counter, prohibiting further speedup by pipelining.
Micro-architectures for AES-128
For comparison of RunFein-generated realization for AES-128 with a similar architecture hand-crafted realization, we took up the RTL implementation of a loop-folded AES-128 encryption core available at Open Cores [53] . Since RunFein does not register the I/Os of the cipher implementation, we removed the registers for plaintext and ciphertext from open cores RTL for enabling equitable comparisons. Both of these RTL realizations were synthesized using the 65 nm technology library with the same versions of synthesis tools and settings at 10 and 100 MHz operating frequencies, and the area footprints obtained are comparable as shown in Table 10 . The area overhead of around 5 % for the opencores RTL is attributed to its several differences compared to RunFein design. Firstly, instead of putting a multiplexer for bypassing the GF-mul stage in the AES round, a separate layer of 128-bit XORs is inserted to get the ciphertext after the last round. Secondly, it maintains a 32-bit register to retain RCON value from a LUT, and RunFein has no register for that. The consequent sequential area overhead can be seen in Table 10 . Thirdly, it does not reuse the 32-bit XORs for calculation of keywords in layer3 till layer6 of the key rounds. Consequently, 5 XORs (32 bits each) are used for the least significant keyword, 4 XORs for the words next to it and so on. RunFein uses only 5 XORs in total for that; consequently, their area overhead for combinational logic is higher.
Unrolling without pipelining
The loop-based AES-128 implementation may be unrolled by a factor of 2, 5 or 10 for a potential increase in the throughput performance of the design. Table 9 gives the increase in area and consequently the throughput improvement when the design is unrolled and profiled for the maximum achievable frequency. Interestingly, the highest throughput/area efficiency of the design is achieved with unroll factor 2. For higher values of loop unrolling, the gain in throughput is diminished by the large number of S-Boxes and wide busbased selection circuitry (Table 10) . 
Sub-pipelining
For a loop-folded generated implementation of AES-128, the maximum operating frequency is profiled to be 1.65 GHz as indicated by the Table 11 . The critical path is found to exist from the d_state register, through the 4 round layers, the multiplexer and back to the d_state register. To break this critical path, we indicate RunFein to place a sub-pipeline between layer0 and layer1 of the cipher round and a corresponding pipeline between layer1 and layer2 of the k_round, as shown by the single dotted line in Fig. 6 . The RTL for the pipelined architecture is profiled to operate on a frequency as high as 2.25 GHz, with a 28.8 Gbps of throughput. The critical path now exists between d_state register and the pipeline register, i.e., the S-Box layer. A further exploration of breaking critical path is possible by partitioning the S-Box tables into 2 or more levels (instead of using one 256 entry S-Box, we use 8 with 32 entry S-Boxes) and inserting pipelining in between. Similarly, the Galois field inversion of the S-box using subfields of 4, 2 bits can be used for lower area footprints. The required multiple layers of operations for sub-fields inversion and operations can be sub-pipelined for achieving higher performance [54] .
Conclusion and future work
We present RunFein, an extensible framework for the rapid prototyping of block ciphers into customizable hardware and software implementations. It offers a sophisticated design capture of the algorithmic and structural specifications of a cipher by the user through a GUI. The algorithmic design requires specification of layers of atomic operations for key expansion and round transformations. The hardware implementation is aided by a commercial high-level synthesis framework. The architectural specifications of a loop-folded configuration of cipher is automatically transformed by RunFein according to the micro-architecture configuration specified by the user (loop unrolling, bit-slicing, sub-pipelining).
A thorough design viability is validation before design rapid prototyping. We took up some noticeable block ciphers with various different architectural specifications for implementation using RunFein. Equitable comparisons for areathroughput-power were carried out. Our results rivals the best available handwritten IP cores. Additionally, some novel optimization results for PRESENT-80 (bit-slicing) have also been reported. RunFein's high-level design approach eliminates the laborious development efforts for VLSI realization/verification of block ciphers. It aids the cryptographic community by enabling speedy benchmarking against critical resources like area, throughput, power and latency and allows design exploration of various micro-architectural design alternatives. We see RunFein as a first instance of a tools framework suite for high-level realization of domain-specific cryptography functions (block ciphers). Extensions to other cryptographic functions would follow. We are enthusiastic to extend this work in various directions.
-The dependence of RunFein on Synopsys Processor
Designer for conversion of LISA design in synthesizable HDL is planned to be removed. This intermediate step could easily be skipped and the hardware generation engine of the next version of RunFein will generate the synthesizable Verilog HDL for a block cipher specification. -A similar rapid prototyping tool for stream ciphers, called
RunStream, is in the pipeline. -Inclusion of cryptanalytic tools for block ciphers is intended. -An automatic software generation of parallel programming for GPU-accelerated machines is on the roadmap. -We plan to take up unified hardware micro-architecture supporting both encryption/decryption of ciphers. 
