Abstract. LED and PHOTON are new ultra-lightweight cryptographic algorithms aiming at resourceconstrained devices. In this article, we describe three different hardware architectures of the LED and PHOTON family optimized for Field-Programmable Gate Array (FPGA) devices. In the first architecture we propose a round-based implementation while the second is a fully serialized architecture performing operations on a single cell per clock cycle. Then, we propose a novel architecture that is designed with a focus on utilizing commonly available building blocks (SRL16). This new architecture, organized in a complex scheduling of the operations, seems very well suited for recent designs that use serial matrices. We implemented both the lightweight block cipher LED and the lightweight hash function PHOTON on the Xilinx FPGA series Spartan-3 (low-cost) and Artix-7 (high-end) devices and our new proposed architecture provides very competitive area-throughput trade-offs. In comparison with other recent lightweight block ciphers, the implementation results of LED show a significant improvement of hardware efficiency and we obtain the smallest known FPGA implementation (as of today) of any hash function.
Introduction
Lightweight devices such as RFID tags, wireless sensor nodes and smart cards are increasingly common in applications of our daily life. These smart lightweight devices might manipulate sensitive data and thus usually require some security. Classical cryptographic algorithms are not very suitable for this type of applications, especially for very constrained environments, and thus many lightweight cryptographic schemes have been recently proposed (block ciphers [20, 30, 16, 11, 39, 36, 5] or hash functions [2, 19, 6] ). The main focus of lightweight cryptography research has been on the trade-offs between cost, security and performance in terms of speed, area and computational power. These primitives can be implemented either in software or in hardware platforms such as Field-Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC). Compared to ASICs, FPGAs offer additional advantages in terms of time-to-market, reconfigurability and cost.
Recently, Guo et al. proposed the lightweight block cipher LED [20] and the lightweight family of hash functions PHOTON [19] , for which the hardware performance has only been investigated on ASICs. LED is based on AES-like design principles with a very simple key schedule. The internal unkeyed permutations of PHOTON can also be seen as an AES-like primitive. Up to now, no design space exploration of LED on FPGAs has been published. The proposed architecture is suited for the applications where low-cost FPGAs are deployed such as FPGA-based RFID tags [15] and low-power FPGAs [38] are deployed for battery powered applications such as FPGA-based wireless sensor nodes [14] . Hence, they represents popular platforms (FPGA-based RFID tags, FPGA-based wireless sensor nodes) for lightweight cryptographic applications.
Our contributions. In this study, we propose three architectures optimized for the implementation of the LED block cipher and the five different flavors of the PHOTON hash functions family on FPGAs. The first architecture computes one round per clock cycle, while the second is based on the architecture presented in LED [20] and PHOTON [19] for ASIC, and adapted in this paper to FPGA with slight modifications. Our most interesting contribution is the third architecture, also serial by nature, which performs the LED and PHOTON computations based on shift registers (SRL16), thanks to a non-trivial scheduling of the successive operations. This structure is actually strictly better than the second one since it achieves lower area and better throughput.
We emphasize that the goal of this paper is to cover a wide variety of new implementation trade-offs offered by crypto primitives using serialized or recursive MDS (Maximum Distance Separable) matrices (for which LED and PHOTON are the main representatives), on a wide variety of different Xilinx FPGA families, ranging from low-cost (Spartan-3) to high-end (Artix-7). Using our novel architecture, based on SRL16, one requires only 77 slices for LED-64 and 112 slices for PHOTON-80 on a Xilinx Spartan 3 (XC3S50) device, and 40 slices for LED-64 and 58 slices for PHOTON-80 on an Artix-7 (XC7A100T) device (while achieving reasonable throughput of 9.93 Mbps and 22.93 Mbps for LED-64, 6.57 Mbps and 18.33 Mbps for PHOTON-80). To the best of our knowledge, it represents the most compact hash function implementations on FPGAs.
The article is structured as follows. First we provide the description of LED and PHOTON in Section 2. Then, we provide in Section 3 and Section 4 our architectures and FPGA implementations of LED and PHOTON respectively. We finally draw conclusions in Section 5.
Algorithms descriptions
In this section, we describe the different versions of LED block cipher [20] and the PHOTON [19] family of hash functions.
LED
LED is a 64-bit block cipher based on a substitution-permutation network (SPN). It supports any key lengths from 64 to 128 bits. In this article, we will focus on a few main versions: 64-bit key LED (named LED-64) and 128-bit key LED (named LED-128). The number of rounds N depends on the key size, LED-64 has N = 32 rounds while LED-128 has N = 48 rounds.
One can view the 64-bit internal state as a 4 × 4 matrix of 4-bit nibbles and the round function as an AES-like permutation composed of the following four operations:
• AddConstants: the internal state is bitwise XORed with a round-dependent constant (generated with an LFSR); • SubCells: the PRESENT [7] S-box is applied to each 4-bit nibble of the internal state; • ShiftRows: nibble row i of the internal state is cyclically shifted by i positions to the left; • MixColumnsSerial: each nibble column of the internal state is transformed by multiplying it once with MDS matrix χ 4 (or two times with matrix χ 2 , or four times with matrix χ). The key schedule of LED is very simple. In the case of LED-64, the key K is repeatedly XORed to the internal state every 4 rounds (with whitening key operation). In the case of LED-128, the key K is divided into two 64-bit subparts K = K 1 ||K 2 , each XORed alternatively to the internal state every 4 rounds. The 4-round operation between two key addition is called a step.
PHOTON
In this section we describe the PHOTON family of hash functions, for which five versions exist with digest sizes of 80, 128, 160, 224 and 256 bits. PHOTON is based on the sponge construction. First, after padding, the input message is divided into blocks of r-bit each. At each iteration, the t-bit internal state (t = r + c) absorbs the incoming message block by simply XORing it to the r-bit bitrate part (the remaining c-bit part is called the capacity). Then, after the absorption of the message block, one applies a t-bit permutation P to the internal state. Once all message blocks have been processed the squeezing phase starts. During this phase, for each iteration r bits are output from the internal state and the permutation P is applied. One continues to squeeze until the proper digest size n is reached.
The PHOTON internal permutation P is also AES-like and consists of 12 rounds. The internal state is represented as a (d × d) matrix of s-bit cells and each round is defined as the application of 4 operations:
• AddConstants: the internal state is bitwise XORed with a round-dependent constant (generated with an LFSR); • SubCells: the S-box is applied to each s-bit nibble of the internal state (the PRESENT S-box [7] if s = 4, the AES S-box [9] if s = 8); • ShiftRows: nibble row i of the internal state is cyclically shifted by i positions to the left; • MixColumnsSerial: each nibble column of the internal state is transformed by multiplying it once with MDS matrix χ d (or two times with matrix χ d/2 , ... , or d times with matrix χ).
The values of t, c, r, r , s and d depend on the hash output size n and we give in Table 1 the 5 versions of PHOTON (we refer to [19] for the various matrices χ depending on the PHOTON versions). Note that one always uses a cell size of 4 bits, except for the PHOTON-256/32/32 version for which one uses 8-bit cells. 
LED implementations
In this section, we present three different architectures for the FPGA implementation of the lightweight block cipher LED. The first one is a round-based implementation, while the second one is a fully serialized implementation, performing operations on a single cell during each clock cycle. The third one is a novel architecture, also fully serial, but based on the SRL16s and aiming at the smallest area possible. As we are interested in the performance of the plain LED core, we did not include any I/O logic implementation such as a UART interface. We have also investigated the performance of the LED cipher with different trade-offs. Indeed, the diffusion matrix being serial in LED, one can view the MixColumnsSerial diffusion layer as a single application of (χ) 4 , or two successive applications of (χ) 2 or four successive applications of (χ). We have implemented both LED versions (the 64-bit key version LED-64 and the 128-bit key version LED-128) in VerilogHDL and targeted Xilinx FPGAs Spartan-3 [23] and Artix-7 [25] . We used Mentor Graphics ModelSimPE for simulation purposes and Xilinx ISE v14.4 WebPACK for design synthesis. In Xilinx ISE the design goal is kept balanced and strategy is kept default (unlocked) and the synthesis optimization goal is set to area.
Round-based
We give in Figure 1 the block diagram of the round-based implementation of LED. Naturally, the data register (Dreg) is updated after every round operation. The keys are selected according to the key length (K 1 is loaded without modification every four rounds in LED-64, while K 1 and K 2 are loaded alternatively every four rounds for LED-128). Table 2 provides the detailed results of our round-based FPGA implementations of LED with three different approaches concerning the computation of the diffusion matrix: we compute χ 4 by either applying 4 times the matrix χ, or by applying 2 times the matrix χ 2 , or by directly applying the entire matrix χ 4 . As expected, the last option provides a higher throughput (since we directly compute the entire diffusion matrix), but for the price of higher resource consumption. In contrary, the first option allows to save resources, but at the expense of a lower throughput. The second option offers a trade-off in between.
We also added in Table 2 a comparison with known round-based FPGA implementations of other (lightweight) block ciphers on the same FPGA device. One can see that our LED-64 and LED-128 proposed round-based implementations outperform all the previous works in term of area.
Serialized
Our first serialized implementation of LED is derived from the architecture proposed in [20] for ASICs, but with some architectural modifications for the MCS state operations in order to improve the performance. This implementation stores the data and key in the registers (FF) and it has a 4-bit wide datapath, i.e. only 4 bits are processed in one clock cycle (see Figure 2 ). It consists of 4 states: Init, Sbox, Srow and MCS:
The Init state initial data and key values are stored in the data registers and key registers, respectively.
The Sbox state is for the simultaneous execution of the SubCells (SC) operations, AddConstants (AC) operations and XORing the roundkey (AK) every fourth round. It requires 16 clock cycles.
The Srow state is for the execution of the ShiftRows operation. It can be performed in 3 clock cycles with no additional hardware cost, because it just shifts the row positions of the state matrix. The MCS state is for the execution of the MixColumnsSerial operation. It calculates the result fully serialized, that is one cell in each clock cycle. It first calculates the topmost cell of the leftmost column (cell 00) by storing the result in the last row of the rightmost column (cell 33) in Figure 2 . At the same time, the entire state array is shifted to the left by one position, where the leftmost cells in every row are shifted into the rightmost cells of the row located on top. This way in the subsequent clock cycle the topmost cell of the second column is processed, leading to a serialized row-by-row calculation of the MixColumnsSerial.
It is to be noted that during the MixColumnsSerial operation in the architecture proposed in [20] , the result is stored in the last row of the leftmost column (cell 30), leading to a serialized column-by-column calculation. Our new architecture is strictly better as it saves both area and time: As the leftmost column requires only 1-input FFs instead of 2-input FFs the area requirement is reduced significantly. Our proposed architecture has similarities with the work from [33] , regarding the way the storing and rotating of matrices are implemented. Furthermore, it takes only 16 clock cycles to perform the MixColumnsSerial instead of the usual 20 clock cycles [20] . This new architecture is applicable to all AES-like permutations that use a serialized MixColumns operation and we will also use it for the PHOTON implementations described in Section 4.
This serialized architecture of LED requires 35 clock cycles to perform one round, resulting in a total latency of 1120 clock cycles for LED-64 and 1680 clock cycles for LED-128. Therefore, we have reduced the latency by 128 clock cycles for LED-64 and by 192 clock cycles for LED-128, respectively, when compared to the design proposed in [20] . We give in the first row of Table 3 the detailed results of our serialized implementations. For a (χ) version of the diffusion matrix computation, we obtain for LED-64 and LED-128 140 slices and 167 slices respectively, while the throughput reaches 9.11 Mbps and 5.2 Mbps, respectively. One can see that LED-64 and LED-128 seem to require much less area than most ciphers [27, 21, 10, 28] while having a higher throughput than SIMON [3] . Furthermore, an increased throughput can be reached by scaling the datapath to 16 bits and by computing the diffusion matrix in a less serial manner, i.e. by applying two times (χ) 2 or direct (χ) 4 . Moreover, our proposed serialized implementations when using directly (χ) 4 outperforms most ciphers [28, 3] implementations in terms of throughput per area ratio (Eff.). Using device-dependent building blocks, such as BRAMs and DSPs, are a great way to enhance performance and optimize implementations for a specific target device. However, it also, obviously, makes a fair comparison of the hardware costs (area) much more difficult. Therefore we do not use any additional building blocks and instead compare the number of slices. In the next section we will explain how to further reduce area and latency.
Serialized using SRL16s
Our second serialized implementation of LED is based on the use of a building block of Xilinx Spartan-3 FPGAs called SRL16s [24] . More precisely, SRL16 are look up tables (LUT) that are used as 16-bit shift registers that allow to access (or output) bits of its internal state in two ways (as shown in Figure 3 ): the last bit of its 16 stages (Q 15 ) is always available, while a multiplexer allows to access one additional bit from any of its internal stages. The Configurable Logic Blocks (CLBs) are the basic logic units in an FPGA. Each CLB has four slices, but only the two at the left-hand of the CLB can be used as shift registers. Spartan-3 FPGAs can configure some LUTs as a 16-bit shift register without using the flip-flops available in each slice. When a shift register is described in generic HDL code with the global reset signal, it has no impact on shift registers and synthesis tools infer the use of the SRL16s. Moreover, SRL16 is present in almost all XILINX FPGA families and [22] describes a way to use SRL16s on ALTERA devices. We have investigated possible area reductions by scaling the 64-bit implementation to an 8-bit (when using (χ)
2 ) and 16-bit datapath (when using χ and (χ) 4 ) using SRL16s. As MixColumnsSerial requires 16-bit inputs (4 times 4-bit) in every clock cycle, but each SRL16 only allows access to 2 bits, we have to use eight and sixteen SRL16s to store the state, respectively. Figure 4 shows the block diagram for the SRL16s based implementation of LED with 8-bit datapath when using (χ)
2 . It consists of 4 states: Init, SrSc, Re-update and MCS, where the content of each SRL16 is indicated in Table 4 for all the state operations. We also give in Table 7 and 8 of Appendix A the SRL16 content for 16-bit datapath implementations when using χ and (χ) 4 respectively.
The Init state: initial data and key values are stored in the data SRL16s and key SRL16s, respectively. A special ordering of the nibbles is required as shown in Table 4 and in Figure 4 .
The SrSc state: performs ShiftRows, SubCells, AddConstants and AddRoundKey simultaneously by clever memory (SRL16) addressing schedule. Table 4 depicts in bold the bits that are selected in every clock cycle to achieve this. The round operation starts by bitwise XORing the incoming data with the round key and round constants, then applying this result to two S-boxes (8-bit datapath) or four S-boxes (16-bit datapath), respectively. The first nibbles processed are 00 and 11 (8-bit datapath) and 00, 11, 22, and 33 (16-bit datapath), respectively. In order to perform ShiftRows, SubCells, AddConstants and AddRoundKey operations on the whole state, it takes 8 clock cycles (clk 9-16 in Table 4 ) using an 8-bit datapath, and 4 clock cycles (clk 5-8 in Table 8 ) using a 16-bit datapath, respectively.
The Re-update state: when using the 8-bit datapath, the 8-bit output from the S-boxes needs to be duplicated within the SRL16s. This is because the MixColumnsSerial operation reads four input vectors simultaneously and thus the leftmost bits of the SRL16s must be used. 8 clock cycles (clk 17-24 in Table 4 ) are required for this step. Note that this state only applies to 8-bit datapath, this is why it is not present in Table 7 and 8 of Appendix A. The MCS state: the 4 x 4-bit input data is read from the bits indicated in bold in Table 4 . It starts with the four 4-bit blocks 00, 11, 22 and 33, and using (χ) 2 , the resulting 8-bit output is stored in the SRL16s labeled as 00 , 10 (and 20 , and 30 , respectively) to indicate the indices of the next round. In the next clock cycle, the input data is 01, 12, 23, and 30, and the corresponding result is labeled as 01 , 11 (and 21 , and 31 ) and so on. In total 8 clock cycles (clk [25] [26] [27] [28] [29] [30] [31] [32] are required to complete the MixColumnsSerial layer using (χ) 2 , 4 clock cycles (clk 9-12 in Table 8 ) when using (χ) 4 , and 16 clock cycles (clk 9-24 in Table 7 ) when using (χ), respectively. The next round starts with the SrSc state (clk 9) and inputs 00 and 11 .
Concerning the key incorporation, we give in Table 9 (resp. Table 10) of Appendix A the SRL16s positions for the key when using 8-bit datapath with (χ) 2 (resp. when using (χ) 4 or (χ) for the 16-bit datapath). For the 8-bit datapath, four and eight SRL16s are required in order to store the entire 64-bit and 128-bit key, respectively. The keys are always read 8-bit at a time from the 4-bit blocks indicated in bold in Table 9 with a grey background in Figure 4 . Then, the key blocks of SRL16s are rotated by one position. Eight clock cycles (clk 17-24 in Table 9 ) are required for the 8-bit datapath, but extra 8 clock cycles (clk 25-32 in Table 9 ) are required for 64-bit key blocks so as to reach the initial position. The next AddRoundKey starts with the SrSc state (clk 17 in Table 9 ) and inputs 00 and 11 .
We have used sixteen SRL16s in order to store the 64-bit or 128-bit key for the 16-bit datapath. Initially, the key values are stored in the key SRL16s 4 times for the 64-bit (2 times for the 128-bit). 16 clock cycles (clk 1-16 in Table 10 ) are required for this step. The keys are read 16-bit at a time from the 4-bit blocks of SRL16s by selecting address taps based on the ShiftRows position (clk 17-20 in Table 10 ). After every 16-bit keys read, the key blocks of SRL16s are rotated by one position. The next AddRoundKey starts with the SrSc state (clk 17 in Table 10 ) and inputs 00 , 11 , 22 and 33 .
For the 8-bit datapath, 24 clock cycles are required in order to complete one round of LED (clk 9-32 in Table 4 ), resulting in a total latency of 768 clock cycles for LED-64 and 1152 clock cycles for LED-128. Table 3 shows the detailed results of our implementations of LED based on SRL16s for various MDS matrix computation approaches. Our design (χ) 2 only occupies 77 slices for LED-64 and 86 slices for LED-128 respectively, with a corresponding throughput of 9.93Mbps and 6.71Mbps respectively. The throughput can be increased to 29.82Mbps by scaling the 8-bit to a 16-bit datapath and by directly computing the (χ) 4 matrix. It is noteworthy to point out that our SRL16 based implementation on Artix-7 FPGA only occupies 40 slices for LED-64 and 50 slices for LED-128, respectively, with a throughput almost three times increased compared to Spartan-3 devices.
We also give in Table 3 the performance of existing FPGA implementations of some other lightweight block ciphers. As can be seen from the table, our work seems to require much less area than most ciphers [40, 27, 21, 10, 28] while having a higher throughput than AES [28] implementations and also yields a better throughput per area ratio (Eff.) compared to most ciphers [27, 28] . Compared to FPGA implementations of the lightweight block cipher SIMON [3] , we get bigger area requirements but for a higher throughput (and also achieves the better throughput per area ratio (Eff.) when using direct matrix (χ) 4 ). We remark that HIGHT [40] has a better throughput per area ratio than LED, but in this article our goal with serialised implementations is to reduce area, and not to improve throughput per area ratio. More importantly, one can see in the table that our SRL16 implementation technique both saves area and increases throughput compared to a classical optimized serial implementation. Therefore, we believe this technique is very interesting in order to implement serial-matrix based cryptographic primitives in FPGA technology.
In this section, we present three different architectures for the FPGA implementation of the lightweight hash function PHOTON. As in the previous section, the first architecture is a round-based implementation, the second one a fully serialized implementation, and the third one our new serial architecture based on SRL16s. The diffusion layer in PHOTON is based on a similar serial MDS matrix as in LED, thus we also tested different trade-offs concerning its implementation. We [23] and Artix-7 [25] . Again, we used Mentor Graphics ModelSimPE for simulation purposes and Xilinx ISE v14.4 WebPACK for design synthesis.
Round-based
In order to fully implement the sponge construction, the input data must be padded according to the sponge padding rule [19] , and this is handled by the padding unit. A 2 × 1 multiplexer drives r bits of the data input from message registers and applies the XOR operation with r bits of the input blocks. After the padding procedure, this multiplexer operates as a feedback multiplexer in order to apply the 12 rounds of the internal permutation of PHOTON. The data register Treg is updated every round, that is after processing AddConstants, SubCells, ShiftRows, and MixColumnsSerial in one clock cycle. Another 2 × 1 multiplexer is devoted to drive either the IV value or the internal state. Finally, during the squeezing phase, r bits are output from the internal state after every application of the permutation P , until the length of the hash digest size n is reached. The round-based hardware architecture of the PHOTON hash function implementations is shown in Figure 5 . The architectures were optimized for high throughput and minimal FPGA area resource consumption. The resulting design fits in the smallest Xilinx devices such as Spartan-3 XC3S50 for variants PHOTON-80/20/16, PHOTON-128/16/16 and Spartan-3 XC3S400 for variants PHOTON-160/36/36, PHOTON-224/32/32 and PHOTON-256/32/32 (because Spartan-3 XC3S50 has only 768 Slices). The major interest was to examine if this method is appropriate to obtain a high throughput implementation of PHOTON hash function. In Table 5 , our results are compared to other hardware implementations [1, 4] . One can see that our proposed round-based implementations outperform all the previous works in terms of throughput per area ratio (Eff.).
Serialized
Similarly to our work on LED in Section 3.2, we have built a serialized implementation of the different PHOTON versions. One can see in Figure 6 that our serialized implementation consists of 6 modules: MCS, IO, AC, SC, ShR, and Controller. These modules and the general hardware architecture that we propose are almost the same as the one described in [19] for ASICs. Yet, we applied the same optimization for MixColumnSerial that we have described for LED in detail in Section 3. [19] . Therefore, we obtain a total latency of 12 implementations [1] of the lightweight hash function SPONGENT [6], we get bigger area requirements but for a much higher throughput per area (Eff.). We will see in the next section that SRL16 based implementations of PHOTON will lead to lower area and much higher throughput and yield a better throughput per area ratio (Eff.) than SPONGENT.
Serialized using SRL16s
As for LED in Section 3.3, we considered a second serialized implementation of PHOTON hash function based on the use of SRL16s [24] . Our architecture is based on a 20-bit datapath that uses χ. It it is depicted in Figure 7 and consists of 3 states: Init, SrSc and MCS, where the content of each SRL16 is indicated in Table 11 of Appendix B for all the state operations.
The Init state: after the padding procedure, the IV value is stored into the data SRL16s (z = s · d bits) using a 3 × 1 multiplexer which drives either the IV input value, updates SrSc state value, or updates MCS state value.
The SrSc state: it reads the data values from SRL16s by selecting address taps according to the ShiftRows positions. The round operation starts by bitwise XORing the incoming data with r bits of the message input if applicable, and then adding the constants (round constants and internal constants). Next, the result goes through d S-boxes for a z-bit datapath. Finally, the output of the 4-bit S-boxes is given as input to the blocks 00, d clock cycles (clk 6-10 in Table 11 for PHOTON-80/20/16) for a z-bit datapath to perform AddConstants, ShiftRows and SubCells operations on the entire state.
The MCS state: the z-bit data is read from the bits indicated in bold in Table 11 for PHOTON-80/20/16. It starts with the five 4-bit blocks 00, 11, 22, 33 and 44, and using (χ), the resulting 20-bit output is stored in the SRL16s labeled as 11, 22, 33, 44 and 00 . In the next clock cycle, the input is 01, 12, 23, 34 and 40, and the corresponding result is labeled as 12, 23, 34, 40, 01 and so on similar to Table 7 . In total 25 clock cycles (clk 11-35 in Table 11 ) are required to complete the MixColumnsSerial operation for PHOTON-80/20/16. We have also implemented the remaining 4 versions of PHOTON using same architecture and give below the MCS state input(x) SRL16s labeled and output(y) SRL16s labeled for the first clock cycle.
• [32] and the implementation of SPONGENT [1] . We remark that SHABAL [12] has a better throughput per area ratio than PHOTON, but in this article our goal with serialised implementations is to reduce area, and not to improve throughput per area ratio.
Conclusion
In this paper, we have analyzed the feasibility of creating a very compact, low cost FPGA implementation of LED and PHOTON. For both primitives, we studied round-based and serial architectures and we implemented several possible tradeoffs when computing the diffusion matrix. In particular, we proposed an SRL16 based architecture, that seems to be very well suited for all cryptographic primitives that use serial matrices. Our results show that LED and PHOTON are very good candidates for lightweight applications, our implementations yield for example the best area of all lightweight hash functions implementations published so far. Future work will include the investigation of side-channel analysis on our implementations and apply countermeasures [18, 34, 33] in order to resist these attacks.
A SRL16s positions for LED Table 9 . Content of KEY SRL16s after every state of LED-64 when using (χ) 2 for the 8-bit datapath. Every cell of the content shows the index of a nibble of the state. Printed in bold is the input to the subsequent operation (see also Figure 4 ). The indices of the next round are indicated with a . 
