Abstract-We present RC4-AccSuite, a hardware accelerator, which combines the flexibility of an application specific instruction set processor and the performance of an application specific IC for the most widely deployed commercial stream cipher RC4 and its eight prominent variants, including Spritz (CRYPTO-2014 Rump-session). Our carefully designed instruction set architecture reuses combinational and sequential logic at its various pipeline stages and memories, saving up to 41% in terms of area, compared with the individual cores, while the power budget being dictated primarily by the variant used. Moreover, using state replication, noticeable throughput performance enhancement in RC4 variants is achieved. RC4-AccSuite possesses extensibility for future variants of RC4 with little or no tweaking.
I. INTRODUCTION AND MOTIVATION

W
ITH the emergence of the pervasive computing paradigm, ensuring security for all the increased information exchange is becoming more and more challenging. Modern applied cryptography in communication networks requires secure kernels that also manifest into low-cost and high-performance realizations. The need of better performance justifies the efforts to design high-performance embedded application specific ICs (ASICs) dedicated to a certain cipher. Another critically required feature of these circuits, however, orthogonal to the performance offered by the dedicated ASICs, is the need of making flexible designs. The need of of flexibility stems from the dynamic nature of cryptography, i.e., newer versions of algorithm suitable to newer platforms and/or counteracting successful cryptanalytic attacks are frequently proposed. Flexibility could also be exploited to enable a user specified tradeoff between security against system performance.
Since its inception 20 years back, RC4 has been the target of keen cryptanalytic efforts, some of which have been successful. Note that for cryptographic algorithms, there are two kinds of attacks. The first kind exploits the specific use of an algorithm in a protocol and therefore needs some assumptions to mount the attack. The second kind focuses on mathematical analysis of the algorithm without any application-specific assumptions. Most of the attacks on the RC4 stream cipher belong to the first kind. For example, in wired equivalent privacy (WEP) protocol, the first 3 byte of the secret key is used as public initialization vector (IV) and, hence, are known to the attacker. Thus, the WEP attacks [2] make use of this assumption. However, the actual specification of RC4 algorithm mandates no such requirements and the entire secret key remains private. Thus, the WEP attack strategy is not applicable to the base RC4 algorithm. Similar is the case with the attacks on other protocols, such as WiFi protected access (WPA) and transport layer security (TLS) [3] using RC4. None of these are applicable to the actual RC4 algorithm. Among the second kind of attacks, the best known key recovery attack recovers a 16-byte key from the knowledge of the secret internal state with a complexity more than 2 53 [4] . But there is no direct key recovery attack on the exact RC4 algorithm from the knowledge of the keystream (which is actually observable under known plaintext attack model). The best known attack for RC4 state recovery from keystream has a complexity of 2 241 [5] . Though Internet Engineering Task Force is currently seeking replacement of RC4 in TLS protocol [6] , it is interesting to note that the base RC4 algorithm is still cryptographically secure and can be safely used with proper precautions. In addition, there are more secure variants of RC4 in the literature, such as RC4 + [7] , Spritz [8] , and so on. The usability of the RC4-like cipher kernels is reiterated in the recent proposal of Spritz [8] from the authors of the original RC4. Apart from being a drop-in replacement for RC4, Spritz also offers an entire suite of cryptographic functionalities based on sponge-like constructive functions. As NIST SHA-3 competition declared a spongebased kernel called Keccak [9] as the winner after a five year long competition, the usability, security, and efficiency of sponge functions have been already been scrutinized and appreciated by the cryptanalytic community. Thus, even if RC4 is replaced by other stream ciphers in practical protocols, RC4 and its variants, with their elegant and robust structures, are likely to remain model stream ciphers for both designers and cryptanalysts for years to come.
The fact that RC4 has an entire class of well-known variants for ensuring higher security, better performance, and versions fit for word-oriented platforms makes the study and design of a generic core for RC4 and its variants worthwhile. RC4-AccSuite is an application specific instruction set processor (ASIP) whose instruction set architecture (ISA) is designed by identifying the common operation kernels of members of RC4-like stream ciphers family. This accelerator can switch to various RC4 variants at run time and gives the user the choice to choose a variant that matches his/her performance, security, power, and platform need. This flexibility generally comes at the cost of lower throughput performance, however, the design compensates for performance using the technique of memory replication. The resultant RC4-AccSuite boosts the flexibility of an ASIP and the performance of an ASIC.
A. RC4 Stream Cipher
RC4 was designed by Ron Rivest of M.I.T. for RSA data security in 1987. It is also known as ARC4 or Alleged RC4, since it remained a trade secret until its code was leaked in 1994 on Internet. The simplicity of its design and implementation attracted a lot of attention, making it one of the most widely deployed stream ciphers in industrial applications. Originally, it was considered primarily as a software stream cipher. Due to the diversity and applicability of today's computing platforms, however, the boundary between software and hardware ciphers is fast fading away. Common use of RC4 is to protect Internet traffic using the secure sockets layer, TLS, WEP, and WPA protocols along with several application layer softwares.
The RC4 algorithm was described in [10] (see Table I ). It has an internal state comprising of 256-byte array, denoted by S[0 . . . N − 1] and accessed by indices i and j . The three phases of RC4 operation (and most other stream ciphers) are the state initialization (SI), key scheduling algorithm (KSA), and the pseudorandom generation algorithm (PRGA). Some stream ciphers require IV scheduling algorithm (IVSA) phase after K S A as well. The secret key k[0 . . . l − 1] is expanded by repetition to a size equal to that of array S:
In every iteration of KSA and PRGA, i is incremented, j is updated, and the values of S[i ] and S[ j ] are swapped while PRGA produces 1 byte of output, which is XOR-ed with the 1 byte of the message to produce 1 byte of ciphertext (or plaintext in case of decryption).
B. Variants of RC4
A compact description of some of the noticeable variants of RC4 to counteract cryptanalytic attacks follows (the list is not chronologically arranged but in decreasing order of similarity with RC4). The reader is kindly advised to refer to the respective references for a detailed description. 1) RC4 + : RC4 + recommended complementary layers of computation for KSA and PRGA phase on top of the original proposal of RC4 for achieving a better security margin [7] . These layers of computation achieve better scrambling and avoid key recovery attack during RC4 + KSA and RC4 + PRGA, respectively. Some intermediate designs tradingoff security against performance, namely, PRGA α and PRGA β have also been undertaken as VLSI designs [1] . 2) VMPC: The algorithm is named so after a hard to invert VMPC function, used during KSA, IVSA, and PRGA of VMPC variant of RC4 [11] . The VMPC function for an N variable permutation array named P, transformed into Q, requires a single modulo addition and three accesses of P as shown
3) RC4A: RC4A was introduced to remove a statistical bias in consecutive bytes of PRGA in RC4 [12] . It uses two keys to carry out KSA into two arrays S1 and S2. Similarly, two indices j 1 and j 2 are used for S1 and S2, respectively, during PRGA based on exchange shuffle model, inline with RC4 PRGA. The difference with the original RC4 is that here the index
produces output from S2 and vice versa. 4) RC4B: A recent work exposed the vulnerability of both RC4 and RC4A to new new classes of statistical biases [13] . To overcome that, a new RC4 variant known as RC4B is introduced, which differs from RC4A only as it mixes the contents of the S1 and S2 during update of j 1 and j 2. 5) RC4b: A byte variant of RC4, called RC4b, was described in [14] . The author claimed to remove the known biases in RC4 by scuffling state elements twice and by explicitly discarding the first N bytes during KSA. 6) NGG(n, m): NGG(n, m) is a word variant of RC4, extensible to 32/64 bit words with S much smaller than 2 32 /2 64 [15] , where S = 2 n is the size in words and m is the word size in bits (n ≤ m). The SI for NGG uses a precomputed random array, and the KSA and PRGA phases are similar to that of RC4, extended to words. NGG is named so after initials of its authors. 7) GGHN: GGHN is an improved version of NGG, also named so after its designers initials [16] . It recommends multiple iterations of KSA phase, depending on word size for maintaining a high degree of randomness. For better security, a key-dependent third variable k is also used, other than i and j for exchange shuffle model in GGHN PRGA. 8) Spritz: Spritz is the recent proposal, coming from the author of RC4, formulated as a sponge and, consequently, capable of being used as a block cipher, stream cipher, hash functions, deterministic random bit generator (DRBG), message authentication code (MAC), and authentication encryption (AE) [8] . It has RC4-like general design principles and attempts to repair weak design decisions of RC4.
C. Previous Work and Motivation
In the context of flexible cryptographic implementations, the idea of resource sharing for exclusive execution of more than one modes or versions of cipher algorithms is not novel. The motivation of designing flexible hardware coprocessors (or weakly programmable ASICs) stems from the need of multiple cryptographic functions required for ensuring privacy, authenticity, and integrity. For block ciphers, after the widespread acceptance and use of advanced encryption standard (AES), many unified configurable cores for AES with other ciphers were proposed, e.g., AES-128 with block cipher ARIA [17] , AES-128/192/256 and AES-extended [18] , AES-128, and Camellia [19] . Similarly, stream ciphers ZUC and SNOW 3G were combined in an area-frugal single ASIC, since both of them were included in the long-term evolution (LTE)-advanced security portfolio [20] . This case study was extended for a unified implementation of stream ciphers HC-128 and RC4 in [21] . Unified coprocessors were designed to include hash functions along with the block ciphers, hereby providing confidentiality and authenticity, simultaneously, examples include AES and Grøstl [22] , AES, and Fugue [23] .
More generic cryptographic processors include CryptoManiac [24] , a flexible four-core very long instruction word (VLIW) processor with a 32-bit instruction set. Based on an analysis of the considered cryptographic applications, the added instructions combined logical operations with arithmetic and memory operations. It provided moderate performance enhancement over a wide variety of algorithms. Cryptonite [25] was a cryptoprocessor with a small specialized instruction set and a two cluster architecture. Since it was not based on an existing instruction set and was designed from scratch, it was lightweight, had lesser register port pressure, and reduced routing constraints in comparison to CryptoManiac [24] . It combined up to three standard logic, arithmetic, and memory operations and outperformed CryptoManiac for many block ciphers and hash functions. Following the same lines, another proposal was CCproc [26] , a simple 32-bit coprocessor with an extended reduced instruction set computing (RISC) instruction set and datapath structure. It had a five-stage pipelined datapath and a specifically designed instruction set to improve processing of symmetric-key algorithm. It offered limited compound instructions but promised support for future cryptographic proposals due to its generality. These unified cores successfully achieve area efficiency, compared with the sum of individual cores, due to resource sharing. Moreover, the throughput penalty in most cases is small, when compared with the slower of the implemented algorithms, due to the existence of a common critical path. A more recent configurable coprocessor, addition, rotation and xor (CoARX), exploits operational similarity between cryptographic functions to implement different block ciphers, hash functions, stream ciphers that are based on ARX family of ciphers [27] . Field-programmable gate array (FPGA) implementation of AES and Keccak multifunctional cores has been compared with each other in [28] .
D. Original Contribution
Our RC4-AccSuite outperforms in three respects as discussed in the following. 1) Flexibility: Other than the basic proposal of RC4, RC4-AccSuite is flexible enough to execute six wellknown byte variants of RC4 too, namely, RC4 + [7] , VMPC [11] , RC4A [12] , RC4B [13] , RC4b [14] , Spritz [8] , and two of its word variants, namely, NGG [15] and GGHN [16] . RC4-AccSuite can switch to any of these RC4 variants as per the user requirements and has extensibility to accommodate future RC4 variants. 2) Performance: For promising higher performance, we systematically undertake the state replication technique and integrate it in the design individually for all RC4 variants. Consequently, the performance degradation due to a flexible design is compensated. 3) Resource Minimization: Both performance and flexibility are achieved in processors at the cost of resources. For RC4-AccSuite, we identify the reusable resources between RC4 and all the variants, including registers, pipeline registers, combinational macros, and memory blocks whenever possible, and consequently, the resource budget of RC4-AccSuite is much smaller compared with the accumulation of individual cores of RC4 variants. To the best of our knowledge this is the first endeavor to develop an ASIP for a well-known cryptographic cipher family with all these three features together. Design and development of RC4-AccSuite are an extension of the proposal put forward for a unified core for RC4 and RC4 + [7] in a single core [1] , however, that lacked a conscious effort for resource reuse except for where an entire instruction could be reused. We performed an incremental design of RC4-AccSuite, adding one RC4 variant in each step. All the intermediate design points were implemented using hardware description language (HDL) and synthesized with 65-nm CMOS technology. The immense saving in terms of area and the power budget for each of these cores have been documented. The rest of this paper is organized as follows. Section II explains the high level architecture and interfaces of our RC4-AccSuite. We review the performance enhancement techniques applied so far to RC4 implementations and specify the memory replication technique with a case study for RC4 + and Spritz in Section III. The resource economization is explained in detail in Section IV with merger of RC4 and RC4 + core as an example. Section V explains the area, power, and throughput results of various versions of RC4-AccSuite along with a comparison with existing work. Section VI concludes this paper and provides future roadmap. Fig. 1 presents a high-level architectural diagram of RC4-AccSuite system. The processor core performing one or more variants of RC4 is referred as RC4-AccSuite. A program 
II. HIGH-LEVEL ARCHITECTURE OF RC4-AccSuite
TABLE II BYTE-WIDE MEMORY REQUIREMENTS ( I nstances × Depth)
FOR RC4 VARIANTS memory keeps the instructions, while a program counter (PC) serves as the memory address. The internal pipeline architecture of the processor core, its input/output interface granularities, and the external memory bank configuration depend on the RC4 variant/variants it supports. The IOs of the core are discussed in the following. 1) Instruction is input to the core and is log 2 (n) bits for n instructions specified for the RC4 variant. If the core supports two variants with n1 and n2 distinct instructions, then the n value is taken as n = n1 + n2. 2) Keystream is output of the core and its width is taken to be the maximum keystream word size of the supported variants. Hence, for RC4/RC4 + , it will be 8 bit, and for RC4/GGHN, it will be 32 bit. Due to large sizes of internal states (or S-Boxes), a preferred storage medium for the RC4 variants is the vendor supplied SRAMs, which are optimized for throughput. The memory bank may include SRAMs for key (K memory, 32 words), IV (IV memory, 32 words), and internal states (S0-S3 memories, 256 words), each being 8 bit wide. An SRAM is selectively included in the memory bank provided at least one of the supported variants requires it, as given in Table II . K and IV need an external interface, so that a new key and IV may be supplied from host processor before KSA and IVSA are initiated. RC4A keeps two internal S arrays and, therefore, requires both S0 and S1. For the two word variants, only NGG (8, 32) and GGHN (8, 32) are the currently supported configurations; hence, S array has to have 256 words of 32 bit each. We instead use S0-S3 in order to reuse the same memory for byte variants as well. For the rest of the discussion NGG (8, 32) and GGHN (8, 32) are referred as NGG and GGHN, respectively.
III. PERFORMANCE ENHANCEMENT FOR RC4-AccSuite
The performance of a cryptoprocessor is usually benchmarked by the initialization time (KSA and IVSA, if applicable) and the PRGA throughput as byte/word/bits per second. Instead, we evaluate performance as cycles per keystream word/byte, a more generic, and technologyindependent term. The rationale behind this originates from the fact that there will be different critical paths for mapping a processor design on different CMOS technology libraries. Thus, the maximum operating frequency and the throughput differ from one design to the next. Similarly, if the SRAM access time dictates the critical path of the processor, the performance will be dependant on the memory modules. Nevertheless, in Section V, various design points of RC4-AccSuite are benchmarked on a CMOS technology library, and the throughput results in typical parameters are discussed as well.
A. Performance Enhancement for RC4 Variants
This section discusses the performance enhancement techniques undertaken in the literature for RC4 and its variants implementation individually. We then extend the discussion for a more general implementation undertaken in RC4-AccSuite The previous work in this regard and the throughput enhancement limit, wherever applicable, are also described. In one of the earliest RC4 implementations, Kitsos et al. [29] 
used three single ported SRAMs for S[i ], S[ j ], and S[t] accesses.
Due to data dependencies, three reads and two writes required per PRGA byte are distributed over three cycles, resulting in an encryption speed of 3 cycles/byte. A similar idea was published by Matthews [30] . We list the improvement techniques taken up for RC4 implementation; they are applicable to most of its variants as well.
1) Using Multiported SRAMs:
The number of read/write ports of an SRAM restricts multiple simultaneous accesses and hence the algorithm throughput. The idea of using a multiported SRAM was first proposed by Matthews [30] who hinted a 5/3/1 cycles/byte of RC4 PRGA throughput using a single, dual, or five ported SRAM, respectively, provided all the data dependencies are being taken care of. Extending on these lines, throughput was improved to 2 cycles/byte using a triported SRAM [31] . For RC4-AccSuite, we choose dual ported SRAMs, since they are the most common SRAM configuration having many vendors offering their optimized, low priced design. 2) Loop Unrolling: Unrolling of RC4 PRGA loop boosts throughput, but additional penalty is paid to counter possible read after write (RAW ) hazards when the i th and the (i + 1)th loops write and read from the same S memory location, respectively. Gupta et al. [32] unrolled the PRGA loop twice and reported a throughput up to 0.5 cycles/byte using a register-based storage in a pipelined RC4 circuit. For RC4-AccSuite, the loop unrolling was not employed, since the additional hazards-avoiding checks arising due to simultaneous loop processing are algorithm-dependent and cannot be reused, resulting in an area-inefficient design. 3) State Replication: The use of multiple copies of S array with multiple SRAM instances enhances the availability of the simultaneous access ports and hence the throughput, in line with Mathews proposal [30] . Using two dualported SRAMs for RC4 PRGA, a 2 cycles/byte throughput was reported [1] . This idea was viably extended to other stream ciphers, such as HC-128 [33] . RC4-AccSuite supports word variants of RC4 requiring four internal byte-wide memories, consequently, state replication is justifiably applied to a byte variants of RC4, whenever possible. 4) State Splitting: Splitting the large state array into smaller parts with known address distribution and keeping each smaller part in a separate memory enables more parallel accesses and in turn enables faster keystream generation. A 2× and 4× throughput enhancement by a two-way and four-way memory splitting is reported for HC-128 [34] . Unlike HC-128 that allows multiple parallel-independent accesses in a PRGA, RC4 memory accesses are tied up to be sequentially performed due to possible RAW hazards. To avoid these hazards, algorithm-dependent extra checks for pipeline stalling would be required. To avoid the consequent data inefficient design, memory splitting was not exploited in RC4-AccSuite.
B. State Replication in RC4-AccSuite
State replication, using a dual ported SRAM enables two additional state accesses and consequently lowers clock cycles per PRGA, increasing throughput. For any RC4 byte variant, memory replication to increase simultaneous state memory accesses is carried out provided no data incoherence arise as an aftermath. For an algorithm, if a state memory S is replicated m× to achieve parallelism, the implementation is dubbed as ALGO_S_m, e.g., RC4_S_0 and RC4_S_1 have been discussed in [1] and [30] with a throughput of 3 and 2 cycles/byte, respectively.
A limitation in this context is noteworthy. Given one SRAMs, each being n-ported, a critical question is that in k-cycles can all the l×n×k access ports be utilized or not? The architecture of current SRAMs prohibits this maximum usage, as one or more clock cycles are required for turnaround to change the SRAM access between read and write. As a result, if one considers an SRAM requiring a single turnaround cycle, then for write immediately followed by read or vice versa from one access port, one clock cycle is required between the two accesses. Regarding efficient use of memory ports in PRGA or any RC4 variant, the following guidelines can help minimize turnaround performance penalty. Memory replication may considerably boost throughput at the cost of additional area and power. To economize power, memory replication should be skipped, even if memory modules are available on a platform. Memory replication also costs additional writes to keep all state copies updated with correct data, e.g., RC4_S_0 and RC4_S_1 require one and two writes per PRGA, respectively. Memory replication should be incrementally applied on an algorithm followed by a systematic cycle by cycle design reevaluation exploiting additional parallelism. Various interesting design points may arise, with performance-area-power tradeoff. Section IV gives a walk-through into case studies for Spritz and RC4 + for increasing throughput by using state replication of SRAM. 1) Spritz: Table III (top) shows the Spritz PRGA that requires six state read accesses and two writes per PRGA byte generated. Fig. 2 shows the mapping memory accesses when the Spritz PRGA steps are mapped on a dual ported SRAM (no replication). Due to data dependencies, no pipeline stage entertains more than one memory read, although two simultaneous requests are possible. There should be five (or more) nop instructions between two consecutive instructions, so that TABLE IV RC4 + _S0_0, THROUGHPUT = 5 cycles/byte cycle 1 of next instruction overlaps with cycle 7 of current instruction resulting in a throughput of 6 cycles/byte. Further overlap of cycles for consecutive instructions is not possible due to the structural hazard caused due to availability of limited number of access ports of the memory. Fig. 3 shows memory accesses for Spritz_S0_1. This replication requires two additional memory writes, as indicated in DP3 and DP4 pipeline stages. The last three reads for output calculation of Spritz PRGA are directed to S1. This enables the next consecutive instruction execution after three nops. The throughput is improved to 4 cycles/byte.
By carefully placing the accesses on the memory ports, we ensure that overlap of consecutive instructions causes no data incoherence/resource hazard. Fig. 4 shows one PRGA byte generated after every four cycles. Further parallelization through memory replication is not considered, since data dependencies in the algorithm do not allow it.
2) RC4 + : For RC4 + PRGA, given in Table III (bottom), the relevant simplistic RC4 + _S_0 memory access mapping is given in Table IV . Out of the 4 extra reads here compared with PRGA RC4, two are initiated in cycle 4 and two in cycle 5. Due to bus turnaround, it is not possible to initiate a read on port 1 in cycle 3; hence, in cycle 4, only one access is possible. For simplicity, we dub
, respectively. There should be five (or more) nop instructions between two consecutive PRGA RC4 + instructions, so that no structural hazard is caused. A throughput of 5 cycles/byte results.
We try out higher replication versions of RC4 + for throughput enhancement. RC4 + _S0_1 and RC4 + _S0_2 improve the throughput to 4 and 3 cycles/byte, respectively. Table V (top) shows that for RC4 + _S0_2, only during third cycle are all the six available simultaneous ports used. There still is room for improvement that can be seen as an overlap of cycle 2 and cycle 4 of consecutive RC4 + PRGA instructions is possible provided, we had enough access ports. Using a total of three memories, it is not possible due to number of operations being more than the access ports. RC4 + _S0_3 further improves the throughput to 2 cycles/byte, as shown in Table V (bottom) . By carefully placing the accesses on the ports, we ensure that no more than one nop is required between two consecutive RC4 + PRGA instructions. Overlapping of the second and the fourth cycle also does not cause any data incoherence, since a read priority is set for all SRAMs. Further parallelization through memory replication is not considered, as RC4 + _S0_3 uses all the four S memories available, along with key and IV memories.
3) State Replication in RC4 Variants:
For mapping any of the rest of the RC4 variants on the RC4-AccSuite, the strategy followed is the same, as described for RC4 + . First with replication factor of 0, we try to place all accesses in a chronological order unless disturbing it helps in utilizing an available port, carefully checking for data incoherence. Next, we remap the accesses with a replication factor of 1, with doubled writes and keep doing so as long as a throughput boost is achievable or all the four memories are utilized. Table VI describes the effect on performance of these efforts. For VMPC, the parallelization cannot be exploited due to the dependence of reads on the previous value being read in the VMPC function. RC4A and RC4B require two S memories, replication of both, done once, occupies the four memories available. It generates 2 byte simultaneously after each PRGA and parallelization by memory replication further improves the throughput to 1 byte per clock cycle. For the word variants, the parallelization is not possible, since these algorithms use all the four S memories. For both the word variants, the throughput is specified in terms of cycles per word (4 byte) and the read/write accesses are 32 bit too (all four memories accessed simultaneously).
Out of the various memory replication versions of an algorithm, as given in Table VI , we map the fastest one on 
IV. RESOURCE ECONOMIZATION IN RC4-AccSuite
This sections talks about the potential resource sharing when two or more RC4 variants are clubbed together in RC4-AccSuite. We undertake the case study of an incremental build, starting with the design for RC4, then adding functionality for RC4 + on top.
A. RC4-AccSuite Architecture (for RC4)
RC4-AccSuite has a pipelined architecture. It is equipped with a set of 8-bit arithmetic logic unit (ALUs) with data registers, pipeline register, and memory bank. The memory bank comprises of S0 and S1 for mapping RC4_S0_1, other than K and program memory. The processor has a six stage pipeline, out of which the last four are datapath for RC4 for completing its PRGA stage (DP1-DP4). For accommodating other variants with RC4 more stages for datapath are required.
The ISA comprises of six instructions for RC4, given in Table VII . The nop instruction serves to relieve structural hazards due to limited ports in the processor in consecutive multiple KSA and PRGA instructions. The set_regs0 and set_regs1 instructions set the initial value of counter register to be 0 and 1, respectively. The former is required before the start of SI and KSA phase while the latter is required before PRGA. Fig. 5 shows the opcodes for these instructions as the selection of multiplexers (shown in bold font). value of counter (assigned to a pipeline register i ) and its one incremented value in two consecutive locations of memory using both ports. 
B. Case Study: RC4 + in RC4-AccSuite
To accommodate RC4 + in addition to RC4 in RC4-AccSuite, additional resources are added only if the existing logic cannot be reused. In terms of additional memories, RC4 + _S0_3 requires S2 and S3 memories as well as an IV memory. The ISA requires seven additional instructions (in addition to the RC4 instructions in Table VII) , given in Table VIII . For initialization phase, only S0 and S1 are used. The replication boosts throughput only during PRGA + ; hence for better energy utilization, the actual replication is delayed until the last layer of RC4 + KSA. The   TABLE VIII INSTRUCTION SET EXTENDED FOR RC4 + first layer of RC4 + KSA is the same as RC4 KSA, and hence, no new instruction is added. The second layer of RC4 + KSA requires two iterations, KSA_2a and KSA_2b, while the third layer implementation instruction is KSA_3. Fig. 6 shows the pipeline for RC4-AccSuite capable of executing both RC4 and RC4 + . Here, all the resources shown in Fig. 5 , that are being completely reused for RC4 + , have been shown in gray, while the additional resources to accommodate RC4 + have been shown in color, following the same convention as given in Fig. 5 . Most of the multiplexers size has been increased to accommodate the new set of instructions. As can be seen, the FE stage is completely reused. The DI stage is, however, reused partially, since additional logic for various KSA instructions of RC4 + is added. The additional instructions, include KSA_2a and KSA_2b for counter initialization with 127 and 128, requiring instructions set_regs2 and set_regs3 for register initialization, respectively. A decrementing counter required for KSA_2a and KSA_3 is also supported.
The next four datapath pipeline stages for RC4 + PRGA can be tallied with the memory accesses, as given in Table V (bottom). All the resources of DP1 stage are almost completely reused for RC4 + except the additional read request from IV memory that was not required previously for RC4. The second datapath stage, DP2, requires memory replication into S2 and S3 using port 1 for KSA_3 and PRGA + , as shown in Fig. 6 . Moreover, for KSA_2a and KSA_2b, j register update is carried out with one additional 8-bit adder and an XOR. For PRGA + , t calculation requires two intermediate reads, t 1 and t 2 , from register j and pipeline register i .
In DP3, calculation of t is reused as for RC4; however, additional logic is required for calculating t and t . Respective multiple simultaneous reads are initiated using port 0 of S1, S2, and S3, thanks to the high replication factor. In DP4, datapath stages S[t], S[t ], and S[t ] are read from S1_P0, S2_P0, and S3_P0, respectively, and the keystream byte for RC4 + is calculated after an 8-bit addition and XOR-ing.
C. Instruction Datapath Reuse
Accommodating other RC4 variants with the existing pipeline structure of RC4-AccSuite required newer instructions and additional pipeline stages. The VMPC PRGA instruction required additional pipeline stages to accommodate multiple interdependent memory accesses. Similarly, for Spritz, the four DP pipeline stage do not suffice; consequently, it has seven DP pipeline stages, as shown in Fig. 4 .
1) Entire Instruction Datapath Reuse:
Most of the RC4 variants reuse instructions that are part of RC4 instruction set. Two such instructions are nop and set_regs0, which are required by all the RC4 variants, and hence, their entire DP pipeline stages are reused. Similarly, both set_regs0 and set_regs1 instructions are entirely reused by all RC4 variants except VMPC.
2) Partial Instruction Datapath Reuse: Whenever possible, pipeline datapath reuse is maximized, within instructions of one algorithm or different algorithms and even if it is possible for few pipeline stages only. The KSA and PRGA instructions reuse the 8-bit adder in DP4, as shown in Fig. 5 . Similarly, PRGA and PRGA + instructions share the calculation and reading of S [t] in the four DP pipeline stages, as seen in Fig. 6 . As RC4A is a parallelized version of RC4, the logic for RC4 KSA and PRGA in all the pipeline stages is reused (except for the j register update).
D. Registers/Memories Reuse
For area economization, an aggressive reuse of registers, pipeline registers, and memory modules is carefully designed when undertaking multiple RC4 variants. Since a pipelined register has more overhead than a register, its use should be carefully justified. A processor with (n + 1) pipelines may have each pipeline register replicated up to n times. Table IX shows the register/pipeline registers and memory modules reuse in RC4-AccSuite. Please note that memory use is given as per the highest memory replication factor for each algorithm (refer Table VI). 1) Registers Reuse: RC4 requires only three registers for its execution: a PC, an incrementing/decrementing counter for keeping track of loop iterations, and an index j. It is noteworthy that these registers are used by all the RC4 variants. In addition, integrating RC4 + , VMPC, and NGG into the RC4-AccSuite requires no additional Hence for RC4-AccSuite, when RC4 + is accommodated in RC4-AccSuite, along with RC4, all the three memories, i.e., K , S0, and S1, are reused along with the requirement of three new memories, i.e., IV, S2, and S3.
Similarly, Spritz has a 100% reuse of three memories used by RC4.
V. IMPLEMENTATION AND BENCHMARKING For experimentation and modeling of configurable RC4-AccSuite, an incremental build was followed to accommodate one additional RC4 variant at every step. For each design point, a pipelined processor design was optimized and developed, maximizing memory replication and resource reuse. The integrated configurable core of RC4-AccSuite executing RC4 and RC4 + is termed as RC4C-1. Following similar nomenclature, the version after integration of VMPC was called RC4C-2 and so on. Hence, RC4C-7 is the most flexible version of RC4-AccSuite, capable of being configured to execute any of the variants of RC4. Consequently, a series of interesting design points were encountered, which were benchmarked for resource economization, power utilization, and performance. Worth mentioning is that the user may choose any number or any combination of RC4 variants as per his requirements for RC4-AccSuite, RC4C-1-RC4C-7, show only a trend of how area economizes and power budget rises.
All the designs were modeled using a high level processor description language language for instruction set architecture (LISA) [35] (Synopsys PD version 2012.06-SP2). Synthesis was carried out with Synopsys design compiler, version 2009.06-SP4 using the Faraday standard cell libraries in topographical mode, technology node UMC SP/RVT Low-K process 65-nm CMOS. For synthesizing the memory macros, Faraday memory compiler at 65-nm technology node was used; the best case for memory modules with column multiplexer width 4 was recorded. We estimated the power consumption by Synopsys Power Compiler based on register transfer level switching activity annotation.
A. Throughput
The operating frequency of any version of RC4-AccSuite is determined by the access time of the largest memory in the memory bank, since the memory modules have a larger critical path than the core. For our design, the 256-word memories S0-S3 have access time of 0.7644 ns, indicating a maximum operating frequency of 1.3 GHz. The operating frequency of individual cores of RC4 variants as well as any of the combination version of RC4-AccSuite is the same.
The SI, KSA, and IVSA phases are together named as keystream initialization. Due to the use of dual ported SRAMs, 256 byte/words initialization requires no more than 128 cycles for SI. During KSA and IVSA, intermediate nops are put in the assembly instructions. For RC4, two nops are inserted between all consecutive 256 KSA instructions (see Table X ). RC4 + has three layers of KSA, each having 256 instructions. For VMPC, both KSA and an optional IVSA take 768 cycles with seven nops in between consecutive instructions. RC4b has equal randomization cycles for KSA and IVSA, while GGHN requires 20 cycles for scrambling. For Spritz, a 32-byte key requires 64 absorb_nibble instructions (with one nop after every instruction). When shuffle is required, Whip and Crush of the state are called 3 and 2 times, respectively, requiring an additional 5.3μs as initialization time.
An interesting observation is that the throughput of RC4 and RC4 + (after memory replication) is the same, in spite of the added security margin for the later. The parallelization for RC4A and RC4B doubles the throughput compared with RC4. The word variants have the highest throughput performance, since they generate a 32-bit word per PRGA instruction.
B. Area
The core area estimates of sequential as well as combinational logic, along with the memory area for different incremental versions of RC4-AccSuiteare specified in Table XI in terms of equivalent NAND gates. The first half of Table XI specifies area of the individual pipelined cores of RC4-AccSuite while the second half refers to the configurable combination versions. Dual ported, byte-wide SRAMs were considered having 32 words for K and IV memories (4.609 kilo-gate equivalent (KGE) for each) and 256 words for S0-S3 (7.889 KGE for each). The total area is clearly dominated by the memory area.
The extent of resource sharing and a consequential core area economization can be visualized in Fig. 7 . The area of a single algorithm core and a configurable RC4C-x core is compared against the sum of area of single algorithm cores that this version is able to support, i.e., the sum of RC4 and RC4 + cores is added up and compared with the RC4C-1 core area and is found to be 12.3% less due to aggressive resource reuse. This area economization margin increases as more flexible versions of RC4C-x are analyzed, from left to right in Fig. 7 . Hence, for RC4C-7, this margin grows to reach 41.12%, justifying the need and rationale of developing configurable cores. A noteworthy point is that the area economization calculation for RC4-AccSuite core excludes the memory bank contribution, considering that the area saving for RC4C-7 reaches up to 79%. Since multiple arbiters time-share large memory banks that are extensively used in modern heterogeneous systems, for resource budgeting, it is fair to consider only the core of acryptoprocessor. Moreover, for most of the FPGA and coarse-grained reconfigurable architecture (CGRA), such as coarse-grained hardware platforms, memories with desired size can be configured using block RAM modules that are available as macros.
C. Power
The power consumption of a design is a function of its complexity/flexibility and the clock frequency. Table XII specifies the increasing core power consumption of the same algorithm, i.e., RC4 when run on various versions of RC4-AccSuite at 1.3 GHz. The user may keep the power budget minimal by choosing a version with flexibility no more than required. Faraday memory compiler reports the dynamic power of 32 and 256-byte memory to be 4.94 and 5.53 pJ/access, respectively. Since these memory power budgets are not dependent on the design of the core, these values are not included in the total core power calculation in Table XII. A similar trend is seen from Fig. 8 , showing the core dynamic power of RC4-AccSuite versions for different variants. All the power estimations include the initialization of stream ciphers and the generation of 1024 bit of keystream. The lower power budget utilized by VMPC is due to its least unshared resources, i.e., seven stage datapath pipeline while none of the rest of the variants need more than 4. 
D. Comparison With Hardware Performance
We compare our implementations to the best known SRAM-based hardware implementations of RC4 variants against area efficiency [throughput per area (TPA)], as reported in Table XIII . Though FPGA and ASIC implementations cannot be fairly compared, the fastest FPGA implementations are mentioned in Table XIII for reference. A configurable core supporting both RC4 and a less computationally intensive version of RC4 + , i.e., PRGA α is reported in [1] . It is the fastest CMOS implementation of RC4 and also reports a 2 cycles/byte throughput. However, due to aggressive resource sharing andmemory replication, RC4C-1 justifiably outperforms in area efficiency. Their storage class memory (SCM)-based RC4 implementation results in an encryption speed of 3.24 Gb/s with 22.09 KGE of area resulting in a TPA of 0.30. The fastest CMOS-based RC4 implementation on a comparable technology using SCMs for S-boxes reports 17.76 Gb/s with an area of 50.58 KGE [32] . This design has a better TPA (0.35) than our implementation, however, does not set a good framework for flexibility extensions. First, because of its large area budget primarily due to numerous access ports per SCM, which may not be fully utilized of other RC4 variants, and second, because of the unlikely reusability of the algorithm specific data coherency checks logic.
A CMOS implementation for no other RC4 variants is reported. For RC4A, an FPGA implementation is reported with 0.18 Gb/s of throughput performance [36] , that is around 57× slower than our RC4A implementation. What remains incomparable is the extent of area economization due to resource reusability due to the absence of flexible, configurable RC4-AccSuite versions that have not been taken up for hardware implementation before.
E. Comparison With Software Performance
Software performance on general-purpose computers for various RC4 variants is tabulated in Table XIV . RC4-AccSuite renders initialization for RC4 and RC4 + that is about 15× faster than the one reported in [7] . Similarly, RC4-AccSuite performance for NGG and GGHN is more than 4.3× faster than their respective references [15] , [16] . Their initialization time is not specified for comparison. Spritz on RC4-AccSuite has PRGA performance that is 27× faster than the one reported on a Macbook Air (1.8-GHz Core i5) [8] . For VMPC on RC4-AccSuite, the KSA and PRGA phases are comparable to the reported software performance [11] due to the high dependence of sequential memory accesses in VMPC function, rendering slow performance due to nops between two PRGA instructions.
VI. CONCLUSION
In the context of flexible yet efficient cryptographic accelerators for stream ciphers, this paper proposes RC4-AccSuite, a configurable coprocessor for the family of RC4-like ciphers. Its flexibility stands out due to its ability to switch to another algorithm on-the-fly as per the user requirements of throughput, power, or security changes, while its rich instruction set can be used to map newer RC4 variants. RC4-AccSuite significantly stands out in its area efficiency against SRAM-based dedicated hardware accelerators for stream ciphers. The detailed physical design of the processor is on our road-map. The idea of resource reuse for common kernel cryptographic algorithms will be further probed for other classes of similar block ciphers. APPENDIX DIFFERENCE WITH THE VLSI-SoC 2012 PAPER [1] This is a substantially revised and extended version of the conference paper [1] , authored by Chattopadhyay (third author of this paper) and Paul (second author of this paper), that was accepted in VLSI-SoC 2012. In the following, we summarize the important differences between this paper and the earlier published paper.
1) The conference paper talks about a reconfigurable processor capable of stream cipher encryption by RC4 and its one variant RC4 + . This journal version extends the idea for a number of RC4 variants, i.e., VMPC, RC4A, RC4B, RC4b, NGG, GGHN, and Spritz. 2) The conference paper lacked a conscious effort/ discussion of resource reuse for similar cores, except for an entire instruction/memory. This version takes on an extensive reuse economization of combinational recourses (entire or partial instruction datapath reuse as discussed in Sections IV-C1 and IV-C2) and sequential recourses (registers/memories reuse). Consequently, RC4-AccSuite saves up to 41% in terms of area, compared with individual cores, with power budget dictated primarily by the variant used (discussed in Section V-B). 3) For the memory replication undertaken in the conference paper, a replication factor of 2 is considered for RC4-α (a lighter version of RC4 + ). This version takes up the replication of memories for RC4 + instead up to a factor of 3. Consequently, the throughput improves too. 4) This paper did not start from the core presented in the conference version, all the work here have been taken up from scratch. Consequently, the results stand out for RC4 and RC4 + , and for other variants are entirely new.
